OpenAI, March 2022

with info from


  • fine-tune GPT-3 using supervised learning. Collect dataset of rankings of model outputs. Fine-tune using reinforcement learning from human feedback.
  • 1.3B parameters InstructGPT preferred over 175B GPT-3
  • alignment to user's intent.


Next token prediction objective \neq "follow user's instructions helpfully and safely"

To compensate for the loss, we use metrics like BLEU or ROUGE that try to compare generate text to human text with simple ngrams.

Explicit intentions: following instructions

Implicit intentions: staying truthful, not being biased, toxic, harmful

1. Supervised fine-tuning (optional)

Team of 40 screened contractors to label data. Collect human written demonstrations of desired output behavior on prompts submitted to OpenAI API and some labeler-written prompts. Train supervised learning baselines.

  • train for 16 epochs using cosine learning rate decay and residual dropout of 0.2
  • they found that model overfits on validation loss after 1 epoch but training for longer improves reward modeling score and human preference.

residual dropout:

  • with probability p=0.2p=0.2, set activation to 0. This will mean this neuron is ignored during backprop.
  • with probability 1p1-p, normalize activation h:=h/(1p)h := h/(1-p)

cosine learning rate decay (with restart):


Anthropic used transformer models from 10 million to 52 billion params.

DeepMind used their 280 billion param model Gopher.

2. Reward modeling

Starting from supervised fine-tuned model, remove the softmax layer (they call it unembedding layer) and output a scalar reward instead (scalar reward represents human preference). Model size: 6B instead of 175B (unstable training).

Trained on a dataset of comparisons between two model outputs. Use a cross-entropy loss: the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler.

Collect dataset of human-labeled comparisons between outputs from models on a larger set of API prompts. Train a reward model on this dataset to predict which model output labelers would prefer.

Q: why not ask the humans to score the model output with a scalar instead?

A: Different humans assign different scores based on their values, meaning scores are uncalibrated and noisy. Rankings are more robust.

Elo system (like in chess) is then used to generate a scalar reward signal for training.

Comparisons are very correlated within each labeling task. when simply shuffling all comparisons into one dataset, a single pass over the dataset causes reward model to overfit. (if each of 22 in KK comparisons is treated as separate data point, each completion will be repeated multiple times in the dataset, causing overfitting). Instead, all 22 among KK comparisons for each prompt is a single batch element. More efficient because it requires a single forward pass for each completion instead of 22 among KK forward passes for KK completions. It no longer overfts. Q: unclear? are they outputting a vector instead of a scalar?

Loss function for reward model:


Intuitively, reward model should have similar capacity as the text generation model to understand the text given to them.

3. Reinforcement learning

Use this reward model as a reward function and fine-tune the supervised learning baseline to maximize this reward using PPO algorithm. Parameters of the LM are frozen.

Formulating as a RL problem:

  • policy: language model that takes in a prompt and returns a sequence of text
  • action space: all tokens in vocabulary (50k tokens)
  • observation space: distribution of possible input token sequences (dimension = vocab size ^ input length)
  • reward function: combination of preference model + constraint on policy shift (KL penalty, see below)

Bandit environment, presents a random customer prompt and expects a response to the prompt. Given the prompt and response, it produces a reward determined by the reward model and ends the episode.

Add a per-token (?) KL penalty from the supervised fine-tuned model at each token to mitigate over-optimization of the reward model. Penalizes RL policy from moving substantially away from initial pre-trained model. Without it, model could output gibberish that fools the reward model into giving it a high reward.

RL objective:


Evaluation: labelers rate quality of model outputs on test set consisting of prompts from held-out customers (not represented in training data).