OpenAI, March 2022
with info from https://huggingface.co/blog/rlhf
Next token prediction objective "follow user's instructions helpfully and safely"
To compensate for the loss, we use metrics like BLEU or ROUGE that try to compare generate text to human text with simple ngrams.
Explicit intentions: following instructions
Implicit intentions: staying truthful, not being biased, toxic, harmful
Team of 40 screened contractors to label data. Collect human written demonstrations of desired output behavior on prompts submitted to OpenAI API and some labeler-written prompts. Train supervised learning baselines.
residual dropout:
cosine learning rate decay (with restart):
Anthropic used transformer models from 10 million to 52 billion params.
DeepMind used their 280 billion param model Gopher.
Starting from supervised fine-tuned model, remove the softmax layer (they call it unembedding layer) and output a scalar reward instead (scalar reward represents human preference). Model size: 6B instead of 175B (unstable training).
Trained on a dataset of comparisons between two model outputs. Use a cross-entropy loss: the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler.
Collect dataset of human-labeled comparisons between outputs from models on a larger set of API prompts. Train a reward model on this dataset to predict which model output labelers would prefer.
Q: why not ask the humans to score the model output with a scalar instead?
A: Different humans assign different scores based on their values, meaning scores are uncalibrated and noisy. Rankings are more robust.
Elo system (like in chess) is then used to generate a scalar reward signal for training.
Comparisons are very correlated within each labeling task. when simply shuffling all comparisons into one dataset, a single pass over the dataset causes reward model to overfit. (if each of in comparisons is treated as separate data point, each completion will be repeated multiple times in the dataset, causing overfitting). Instead, all among comparisons for each prompt is a single batch element. More efficient because it requires a single forward pass for each completion instead of among forward passes for completions. It no longer overfts. Q: unclear? are they outputting a vector instead of a scalar?
Loss function for reward model:
Intuitively, reward model should have similar capacity as the text generation model to understand the text given to them.
Use this reward model as a reward function and fine-tune the supervised learning baseline to maximize this reward using PPO algorithm. Parameters of the LM are frozen.
Formulating as a RL problem:
Bandit environment, presents a random customer prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode.
Add a per-token (?
) KL penalty from the supervised fine-tuned model at each token to mitigate over-optimization of the reward model. Penalizes RL policy from moving substantially away from initial pre-trained model. Without it, model could output gibberish that fools the reward model into giving it a high reward.
RL objective:
Evaluation: labelers rate quality of model outputs on test set consisting of prompts from held-out customers (not represented in training data).