Training large language models often runs into Out of Memory (OOM) errors on available hardware. The new integration of DeepSpeed's ZeRO optimizer into Hugging Face's Accelerate library now enables users to train models up to 5x larger per GPU without modifying a single line of code.
Why This Matters
Large models deliver state-of-the-art performance but are notoriously difficult to train. ZeRO (Zero Redundancy Optimizer) shards optimizer states, gradients, and model parameters across GPUs, significantly reducing memory usage. Accelerate now supports ZeRO stages 1–3, along with offloading to CPU or disk, making it easy to fit bigger models.
Benchmark Results
In tests on a 2×24GB NVIDIA Titan RTX setup, finetuning a 900M-parameter DeBERTa model on MRPC showed:
- DDP (Distributed Data Parallel): max batch size 8, 103.57s/epoch, F1 0.931
- DeepSpeed ZeRO Stage 2: max batch size 40, 28.98s/epoch (3.5× faster), F1 0.936
All achieved with zero code changes — just accelerate config to select the DeepSpeed plugin.
How It Works
- ZeRO Stage 1: Shards optimizer states across GPUs.
- Stage 2: Shards optimizer states and gradients.
- Stage 3: Shards optimizer states, gradients, and model parameters.
- Optimizer/Param Offload: Moves data to CPU or disk for even larger models.
Accelerate integrates seamlessly with DeepSpeed, FairScale, and PyTorch FSDP. For advanced tuning, a YAML or JSON config file can be used instead of the plugin.
Getting Started
Run accelerate config, select DeepSpeed as the distributed type, choose your ZeRO stage, and launch training with accelerate launch.
For sequence-to-sequence tasks (e.g., chatbot training), ZeRO Stage 2 also delivers 2× speedup with larger batch sizes. The full examples are available on GitHub.
"You don't need to be an ML engineer to train large models — Accelerate + DeepSpeed does the heavy lifting."