Reinforcement Learning from Human Feedback (RLHF) is a cutting-edge technique that enables language models to better align with complex human preferences. Unlike traditional training methods that rely on simple loss functions like next token prediction, RLHF incorporates direct human feedback to guide model behavior. This approach has been instrumental in the success of models like ChatGPT, allowing them to generate more useful, truthful, and creative responses.
The Three Steps of RLHF
RLHF involves a multi-stage training process with three core steps:
-
Pretraining a language model: Start with a model already trained on general text data, such as GPT-3 or Gopher. This model should respond well to diverse instructions.
-
Training a reward model: Generate preference data by having humans rank outputs from the language model. This data trains a reward model that assigns a scalar score to text based on human preferences.
-
Fine-tuning with reinforcement learning: Use the reward model as a signal to fine-tune the original language model, optimizing it to produce text that humans prefer.
Pretraining Language Models
The initial language model is typically a large-scale transformer pretrained on diverse text. For RLHF, it's crucial that this model can handle a wide range of instructions. Companies like OpenAI, Anthropic, and DeepMind have used models from 10 billion to 280 billion parameters. Fine-tuning on curated human-written text can further improve the starting point, but it's not strictly necessary.
Reward Model Training
Creating a reward model that captures human preferences is a key innovation in RLHF. Instead of asking humans to assign scores directly (which can be noisy and subjective), the process uses pairwise comparisons. Annotators rank outputs from different models for the same prompt, and these rankings are converted into a scalar reward using systems like Elo. The reward model is often a smaller language model fine-tuned on this preference data. For example, OpenAI used a 6B parameter reward model alongside a 175B parameter generator.
Fine-Tuning with Reinforcement Learning
With a reward model in place, the original language model is fine-tuned using reinforcement learning. The model generates text, receives a reward from the reward model, and updates its parameters to maximize that reward. This step balances exploration (trying new outputs) with exploitation (sticking with known good outputs). The result is a model that not only generates coherent text but also aligns with human values like helpfulness, honesty, and harmlessness.
Open-Source Tools for RLHF
Several open-source libraries and datasets support RLHF research, including Hugging Face's Transformers and Datasets, Anthropic's HH-RLHF dataset, and tools for reward modeling and PPO training. These resources make it easier for researchers to experiment with RLHF without starting from scratch.
What's Next for RLHF?
RLHF is still an active area of research, with many open questions about optimal model sizes, training procedures, and the best ways to collect human feedback. Future developments may include more scalable reward models, better alignment with nuanced human values, and broader applications beyond chatbots.