The Hugging Face team has announced the integration of trl with peft, making reinforcement learning fine-tuning of large language models more accessible. This combination allows users to fine-tune models with up to 20 billion parameters using reinforcement learning from human feedback (RLHF) on a single 24GB consumer GPU, a significant reduction in hardware requirements.
Traditionally, RLHF involves three steps: fine-tuning a pretrained LLM on instructions, training a reward model, and then using reinforcement learning (like PPO) to further optimize the model. The challenge has been the memory footprint: fitting a 20B model, a reference model, and optimizer states typically requires multiple high-end GPUs.
With trl and peft, the team leverages parameter-efficient fine-tuning techniques and 8-bit quantization. By using low-rank adapters (LoRA) and 8-bit matrix multiplication, the memory requirements drop dramatically. For example, a 20B model in 8-bit uses only about 20GB, leaving room for optimizer and adapter weights on a 24GB GPU.
The integration is demonstrated with a step-by-step example of training a LLaMA-style model on the IMDB dataset to generate positive reviews. The training loop uses PPO with a learned reward model, and the PEFT approach updates only a small fraction of parameters, making the process efficient even on modest hardware.
Performance benchmarks show that the PEFT approach matches or exceeds the quality of full fine-tuning while drastically reducing GPU memory usage.
The code is available in the TRL documentation, and the team invites the community to experiment with larger models and datasets.