direct-preference-optimization

(or DPO - stanford, 2024)

models trained with DPO:

  • https://huggingface.co/HuggingFaceH4/zephyr-7b-beta (model by huggingface, fine-tuned Mistral 7B on public datasets)
  • Grok by xAI

The strength of this paper is that it reformulates RLHF (reinforcement learning from human feedback) to directly optimize the language models parameters by making the rewards implicit, instead of first training a reward model, then fine-tuning the LLM using reinforcement learning.

DPO objective formulation

starting from RLHF

preliminaries

RLHF begins with fine-tuning a pre-trained language model on high-quality data for downstream tasks such as dialogue, summarization etc... this initial language model is denoted πSFT\pi^{SFT} (SFTSFT stands for supervised fine-tuning).

reward modeling phase

The supervised fine-tuning model is prompted with prompts xx to produce pairs of answers (y1,y2)πSFT(yx)(y_1, y_2) \sim \pi^{SFT}(y \vert x). Each pair is presented to human labelers who express preferences for one answer. We denote ywy_w and yly_l the winning and losing completions respectively.

The preferences are assumed to be generated by some latent reward model r(y,x)r^*(y,x), which we don't have access to. The Bradley-Terry model gives us the probability of choosing one answer y1y_1 over the other y2y_2 as:

p(y1>y2x)=exp(r(x,y1))exp(r(x,y1)+exp(r(x,y2))p^*(y_1 > y_2 \vert x)=\frac{\exp(r^*(x,y_1))}{\exp(r^*(x,y_1) + \exp(r^*(x,y_2))}

The reward modeled is modeled as a neural network rϕ(x,y)r_\phi(x,y), initialized from the supervised fine-tuning model πSFT(yx)\pi^{SFT}(y\vert x) + a linear layer on top, that produces a single scalar prediction (reward). By formulating the problem as a binary classification, we can optimize the maximum likelihood by minimizing the negative log-likelihood loss:

LR(rϕ,D)=E(x,yw,yl)D[logσ(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_R(r_\phi, \mathcal{D})=-\mathbb{E}_{(x,y_w, y_l)\sim \mathcal{D}}[\log\sigma(r_\phi(x,y_w) - r_\phi(x, y_l))]

Note from the paper: "to ensure a reward function with lower variance, prior works normalize the rewards, such that Ex,yD[rϕ(x,y)]=0\mathbb{E}_{x,y\sim \mathcal{D}}[r_\phi(x,y)] = 0 for all xx.". Why does it lower variance? How do you normalize in order to respect this equality?

RL fine-tuning phase

The learned reward function serves as "environment" and produces a reward given a completion from the final language model πθ\pi_\theta (initialized to πSFT\pi^{SFT}).

Because this reward is not a direct function of the language model, and we used decoding to get the final sentence (which is discrete because we auto-regressively choose the token with highest soft-max), we can't compute gradients with respect to the model's parameters. The objective is thus not differentiable and is optimized with reinforcement learning.

formulating DPO

"leverage an anlytical mapping from reward functions to optimal policies, which enables ut to transform a loss function over reward functions into a loss function over policies".

Start from the RL fine-tuning objective:

maxπExD,yπ[r(x,y)]βDKL[π(yx)πref(yx)]\max_\pi \mathbb{E}_{x\sim \mathcal{D}, y\sim\pi}[r(x,y)]-\beta \mathbb{D}_{KL}[\pi(y|x)\Vert \pi_{ref}(y|x)]

where β\beta controls the deviation from the base reference policy πref\pi_{ref} (which is actually the initial supervised fine-tuning model πSFT\pi^{SFT}).

Reformulate the objective into a simple KL-divergence between the policy π\pi to learn and a constructed probability distribution π(yx)=1Z(x)πref(yx)exp(1βr(x,y))\pi^*(y|x)=\frac{1}{Z(x)}\pi_{ref}(y|x)\exp(\frac{1}{\beta}r(x,y)) where ZZ is the partition function.

Since the KL-divergence is minimized at 0 if and only if the two distributions are identical (Gibbs' inequality), the optimal policy is the constructed distribution π\pi^*.

Next, we can isolate the reward function and express it as:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x,y)=\beta\log\frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta\log Z(x)

Plugging it into the Bradley-Terry preference model gets rid of the partition function ZZ since it only depends on the differences of rewards between two completions:

p(y1>y2x)=11+exp(βlogπ(y2x)πref(y2x)βlogπ(y1x)πref(y1x))=σ(r^(x,y1)r^(x,y2))\begin{aligned}p^*(y_1 > y_2 | x)&=\frac{1}{1+\exp\big(\beta\log\frac{\pi^*(y_2|x)}{\pi_{ref}(y_2|x)}-\beta\log\frac{\pi^*(y_1|x)}{\pi_{ref}(y_1|x)}\big)}\\&=\sigma(\hat r(x,y_1) - \hat r(x, y_2))\end{aligned}

where σ\sigma is the sigmoid function and r^=βlogππref\hat r=\beta \log \frac{\pi}{\pi_{ref}} which is the reward implicitly defined by the language model and reference model.

Maximizing log-likelihood is equivalent to minimizing the log-loss (that's DPO!):

LDPO(πθ,πref)=E(x,yw,yl)D[logp(yw>yl)]=E[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx)]\begin{aligned}\mathcal{L}_{DPO}(\pi_\theta, \pi_{ref})&=-\mathbb{E}_{(x,y_w, y_l)\sim\mathcal{D}}[\log p^*(y_w > y_l)] \\& = -\mathbb{E}\bigg[\log\sigma(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}-\beta\log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\bigg]\end{aligned}

Where the model is parameterized by weights θ\theta.

What does πθ(yx)\pi_\theta(y|x) represent? it's the likelihood of completion yy given xx. Denoting the tokens in y:=[y1,,yk]y := [y^1, \dots, y^k], we can take the product of the single token predictions generated auto-regressively:

πθ(yx)=i=1kπ(yix,y<i)\pi_\theta(y|x)=\prod_{i=1}^k\pi(y^i|x,y^{<i})

Theoretical analysis

They show that, for any reward function, we can reformulate it using two models as r^=βlogπθπref\hat r = \beta \log \frac{\pi_\theta}{\pi_{ref}} and, under the Bradley-Terry preference model, learn the same preference distribution. The DPO objective thus does not constrain the set of reward models we can learn.

Experiments

3 open-ended text generation tasks:

  • controlled sentiment generation: generate positive sentiment from a movie review from the IMDb dataset. They fine-tune GPT-2 for their reference model.
  • summarization: generate summary of the main points in a forum post from the Reddit TL;DR summarization dataset.
  • single-turn dialogue: using the Anthropic Helpful and Harmless dialogue dataset (170k dialogues between a human and automated assistant), generate a response from a user message.

Evaluation: plot reward vs KL divergence. They have access to the ground truth reward function using a pre-trained sentiment classifier. They also use GPT-4 to compute a win rate between the models (proxy for human eval).

Personal comment: I don't think their experiments are sufficient to show DPO is on par with RLHF, because:

  1. they use smaller models. can DPO performance scale with model parameters?

  2. the tasks are far from prod settings like chatgpt. I would be interested to see how DPO-trained model compare to PPO-trained models in LLM arena.