Laravel

Introduction

Reinforcement Learning from Human Feedback (RLHF) has become the standard final training step for large language models (LLMs) like GPT-4 and Claude, ensuring outputs align with human expectations for chat and safety. However, traditional RLHF is complex: it requires building a reward function, estimating state values, and carefully constraining the model to avoid generating gibberish. The Direct Preference Optimization (DPO) paper by Rafailov, Sharma, Mitchell, and colleagues simplifies this by replacing the RL-based objective with a straightforward binary cross-entropy loss, making the alignment process far more accessible.

This blog post introduces DPO as implemented in the TRL library, demonstrating how to fine-tune the 7B-parameter Llama v2 model on the Stack Exchange preference dataset, which contains ranked answers from various Stack Exchange sites.

DPO vs. PPO

Traditional RLHF uses a separate reward model and reinforcement learning (often PPO) to maximize the reward while staying close to a reference model via a KL penalty. DPO cleverly bypasses the reward model entirely by deriving an analytical mapping from the reward function to the optimal policy. This allows DPO to be optimized directly on preference data using only the reference model, removing the need for fiddly RL hyperparameters.

Training with TRL

A typical RLHF pipeline includes:

Supervised fine-tuning (SFT)
Annotating preference data
Training a reward model
RL optimization

DPO eliminates steps 3 and 4, requiring only SFT (step 1) and preference data formatted as a dictionary with three keys:

prompt: The context given to the model
chosen: The preferred response
rejected: The dispreferred response

For the Stack Exchange dataset, this formatting is easy: the helper function return_prompt_and_responses extracts the question as the prompt and the two ranked answers. The DPOTrainer then needs the base model (from SFT) and a reference model (usually a copy of the SFT model). The beta parameter (typically 0.1–0.5) controls how much the optimization respects the reference model. Here's a snippet:

from transformers import AutoModelForCausalLM
from trl import DPOTrainer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model_ref = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

dpo_trainer = DPOTrainer(
    model,
    model_ref,
    beta=0.1,
    train_dataset=formatted_dataset,
    tokenizer=tokenizer,
    args=training_args
)
dpo_trainer.train()

Experiment with Llama v2

The TRL library integrates with PEFT and bitsandbytes for memory-efficient training. For example, you can use QLoRA to fine-tune the 4-bit quantized Llama v2 7B model. First, perform SFT using the SFTTrainer, then apply DPO with the DPOTrainer. Both steps leverage quantization and LoRA adapters to fit on a single GPU.

Conclusion

DPO offers a simpler, more stable alternative to traditional RLHF for aligning LLMs with human preferences. Combined with TRL's efficient training tools, fine-tuning even large models like Llama v2 becomes practical and accessible.

Mastering Llama 2 with Direct Preference Optimization

Introduction

DPO vs. PPO

Training with TRL

Experiment with Llama v2

Conclusion

We Care About Your Privacy

How and why we process data