(or DPO - stanford, 2024)
models trained with DPO:
The strength of this paper is that it reformulates RLHF (reinforcement learning from human feedback) to directly optimize the language models parameters by making the rewards implicit, instead of first training a reward model, then fine-tuning the LLM using reinforcement learning.
RLHF begins with fine-tuning a pre-trained language model on high-quality data for downstream tasks such as dialogue, summarization etc... this initial language model is denoted ( stands for supervised fine-tuning).
The supervised fine-tuning model is prompted with prompts to produce pairs of answers . Each pair is presented to human labelers who express preferences for one answer. We denote and the winning and losing completions respectively.
The preferences are assumed to be generated by some latent reward model , which we don't have access to. The Bradley-Terry model gives us the probability of choosing one answer over the other as:
The reward modeled is modeled as a neural network , initialized from the supervised fine-tuning model + a linear layer on top, that produces a single scalar prediction (reward). By formulating the problem as a binary classification, we can optimize the maximum likelihood by minimizing the negative log-likelihood loss:
Note from the paper: "to ensure a reward function with lower variance, prior works normalize the rewards, such that for all .". Why does it lower variance? How do you normalize in order to respect this equality?
The learned reward function serves as "environment" and produces a reward given a completion from the final language model (initialized to ).
Because this reward is not a direct function of the language model, and we used decoding to get the final sentence (which is discrete because we auto-regressively choose the token with highest soft-max), we can't compute gradients with respect to the model's parameters. The objective is thus not differentiable and is optimized with reinforcement learning.
"leverage an anlytical mapping from reward functions to optimal policies, which enables ut to transform a loss function over reward functions into a loss function over policies".
Start from the RL fine-tuning objective:
where controls the deviation from the base reference policy (which is actually the initial supervised fine-tuning model ).
Reformulate the objective into a simple KL-divergence between the policy to learn and a constructed probability distribution where is the partition function.
Since the KL-divergence is minimized at 0 if and only if the two distributions are identical (Gibbs' inequality), the optimal policy is the constructed distribution .
Next, we can isolate the reward function and express it as:
Plugging it into the Bradley-Terry preference model gets rid of the partition function since it only depends on the differences of rewards between two completions:
where is the sigmoid function and which is the reward implicitly defined by the language model and reference model.
Maximizing log-likelihood is equivalent to minimizing the log-loss (that's DPO!):
Where the model is parameterized by weights .
What does represent? it's the likelihood of completion given . Denoting the tokens in , we can take the product of the single token predictions generated auto-regressively:
They show that, for any reward function, we can reformulate it using two models as and, under the Bradley-Terry preference model, learn the same preference distribution. The DPO objective thus does not constrain the set of reward models we can learn.
3 open-ended text generation tasks:
Evaluation: plot reward vs KL divergence. They have access to the ground truth reward function using a pre-trained sentiment classifier. They also use GPT-4 to compute a win rate between the models (proxy for human eval).
Personal comment: I don't think their experiments are sufficient to show DPO is on par with RLHF, because:
they use smaller models. can DPO performance scale with model parameters?
the tasks are far from prod settings like chatgpt. I would be interested to see how DPO-trained model compare to PPO-trained models in LLM arena.