Timothy Delille / notes / direct-preference-optimization

direct-preference-optimization

(or DPO - stanford, 2024)

models trained with DPO:

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta (model by huggingface, fine-tuned Mistral 7B on public datasets)
Grok by xAI

The strength of this paper is that it reformulates RLHF (reinforcement learning from human feedback) to directly optimize the language models parameters by making the rewards implicit, instead of first training a reward model, then fine-tuning the LLM using reinforcement learning.

DPO objective formulation

starting from RLHF

preliminaries

RLHF begins with fine-tuning a pre-trained language model on high-quality data for downstream tasks such as dialogue, summarization etc... this initial language model is denoted $\pi^{SFT}$ ( $SFT$ stands for supervised fine-tuning).

reward modeling phase

The supervised fine-tuning model is prompted with prompts $x$ to produce pairs of answers $(y_1, y_2) \sim \pi^{SFT}(y \vert x)$ . Each pair is presented to human labelers who express preferences for one answer. We denote $y_w$ and $y_l$ the winning and losing completions respectively.

The preferences are assumed to be generated by some latent reward model $r^*(y,x)$ , which we don't have access to. The Bradley-Terry model gives us the probability of choosing one answer $y_1$ over the other $y_2$ as:

p^*(y_1 > y_2 \vert x)=\frac{\exp(r^*(x,y_1))}{\exp(r^*(x,y_1) + \exp(r^*(x,y_2))}

The reward modeled is modeled as a neural network $r_\phi(x,y)$ , initialized from the supervised fine-tuning model $\pi^{SFT}(y\vert x)$ + a linear layer on top, that produces a single scalar prediction (reward). By formulating the problem as a binary classification, we can optimize the maximum likelihood by minimizing the negative log-likelihood loss:

\mathcal{L}_R(r_\phi, \mathcal{D})=-\mathbb{E}_{(x,y_w, y_l)\sim \mathcal{D}}[\log\sigma(r_\phi(x,y_w) - r_\phi(x, y_l))]

Note from the paper: "to ensure a reward function with lower variance, prior works normalize the rewards, such that $\mathbb{E}_{x,y\sim \mathcal{D}}[r_\phi(x,y)] = 0$ for all $x$ .". Why does it lower variance? How do you normalize in order to respect this equality?

RL fine-tuning phase

The learned reward function serves as "environment" and produces a reward given a completion from the final language model $\pi_\theta$ (initialized to $\pi^{SFT}$ ).

Because this reward is not a direct function of the language model, and we used decoding to get the final sentence (which is discrete because we auto-regressively choose the token with highest soft-max), we can't compute gradients with respect to the model's parameters. The objective is thus not differentiable and is optimized with reinforcement learning.

formulating DPO

"leverage an anlytical mapping from reward functions to optimal policies, which enables ut to transform a loss function over reward functions into a loss function over policies".

Start from the RL fine-tuning objective:

\max_\pi \mathbb{E}_{x\sim \mathcal{D}, y\sim\pi}[r(x,y)]-\beta \mathbb{D}_{KL}[\pi(y|x)\Vert \pi_{ref}(y|x)]

where $\beta$ controls the deviation from the base reference policy $\pi_{ref}$ (which is actually the initial supervised fine-tuning model $\pi^{SFT}$ ).

Reformulate the objective into a simple KL-divergence between the policy $\pi$ to learn and a constructed probability distribution $\pi^*(y|x)=\frac{1}{Z(x)}\pi_{ref}(y|x)\exp(\frac{1}{\beta}r(x,y))$ where $Z$ is the partition function.

Since the KL-divergence is minimized at 0 if and only if the two distributions are identical (Gibbs' inequality), the optimal policy is the constructed distribution $\pi^*$ .

Next, we can isolate the reward function and express it as:

r(x,y)=\beta\log\frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta\log Z(x)

Plugging it into the Bradley-Terry preference model gets rid of the partition function $Z$ since it only depends on the differences of rewards between two completions:

\begin{aligned}p^*(y_1 > y_2 | x)&=\frac{1}{1+\exp\big(\beta\log\frac{\pi^*(y_2|x)}{\pi_{ref}(y_2|x)}-\beta\log\frac{\pi^*(y_1|x)}{\pi_{ref}(y_1|x)}\big)}\\&=\sigma(\hat r(x,y_1) - \hat r(x, y_2))\end{aligned}

where $\sigma$ is the sigmoid function and $\hat r=\beta \log \frac{\pi}{\pi_{ref}}$ which is the reward implicitly defined by the language model and reference model.

Maximizing log-likelihood is equivalent to minimizing the log-loss (that's DPO!):

\begin{aligned}\mathcal{L}_{DPO}(\pi_\theta, \pi_{ref})&=-\mathbb{E}_{(x,y_w, y_l)\sim\mathcal{D}}[\log p^*(y_w > y_l)] \\& = -\mathbb{E}\bigg[\log\sigma(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}-\beta\log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\bigg]\end{aligned}

Where the model is parameterized by weights $\theta$ .

What does $\pi_\theta(y|x)$ represent? it's the likelihood of completion $y$ given $x$ . Denoting the tokens in $y := [y^1, \dots, y^k]$ , we can take the product of the single token predictions generated auto-regressively:

\pi_\theta(y|x)=\prod_{i=1}^k\pi(y^i|x,y^{<i})

Theoretical analysis

They show that, for any reward function, we can reformulate it using two models as $\hat r = \beta \log \frac{\pi_\theta}{\pi_{ref}}$ and, under the Bradley-Terry preference model, learn the same preference distribution. The DPO objective thus does not constrain the set of reward models we can learn.

Experiments

3 open-ended text generation tasks:

controlled sentiment generation: generate positive sentiment from a movie review from the IMDb dataset. They fine-tune GPT-2 for their reference model.
summarization: generate summary of the main points in a forum post from the Reddit TL;DR summarization dataset.
single-turn dialogue: using the Anthropic Helpful and Harmless dialogue dataset (170k dialogues between a human and automated assistant), generate a response from a user message.

Evaluation: plot reward vs KL divergence. They have access to the ground truth reward function using a pre-trained sentiment classifier. They also use GPT-4 to compute a win rate between the models (proxy for human eval).

Personal comment: I don't think their experiments are sufficient to show DPO is on par with RLHF, because:

they use smaller models. can DPO performance scale with model parameters?
the tasks are far from prod settings like chatgpt. I would be interested to see how DPO-trained model compare to PPO-trained models in LLM arena.

links

x.com/timothydelille

linkedin.com/in/timothydelille

github.com/timothydelille