DailyGlimpse

Supercharge Model Training with ZeRO: DeepSpeed and FairScale Explained

AI
April 26, 2026 · 5:52 PM

As machine learning models rapidly outpace GPU memory growth, many researchers struggle to train or even load large models. The ZeRO paper (Rajbhandari et al., 2019) introduced memory optimizations enabling training of trillion-parameter models, with open-source implementations in DeepSpeed (Microsoft) and FairScale (Facebook). Hugging Face's Trainer now supports both via --sharded_ddp (FairScale) and --deepspeed arguments.

Multi-GPU Speedups

Benchmarking a t5-large translation fine-tuning on 2x24GB Titan RTX GPUs with a baseline DDP approach achieved a batch size (BS) of 16. Adding optimizations progressively improved results:

Method Max BS Train Time (s) Eval Time (s)
baseline 16 30.95 56.33
fp16 20 21.49 53.47
sharded_ddp 30 25.91 47.56
sharded_ddp+fp16 30 17.38 45.66
deepspeed (no CPU offload) 40 10.40 34.93
deepspeed (CPU offload) 50 20.97 32.14

Both libraries significantly boost performance and batch size. DeepSpeed currently offers more features but requires a config file, while FairScale is easier to deploy. The 80:20 rule applies—fine-tune for production projects.

Single-GPU: Fitting a Giant Model

DeepSpeed also benefits single-GPU setups. Attempting to train t5-3b (3 billion parameters) on a 24GB RTX 3090 without optimizations fails even at BS=1 due to OOM. With DeepSpeed and a single GPU configuration, BS=20 works fine:

export BS=20
CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 ./finetune_trainer.py \
--model_name_or_path t5-3b --n_train 60 --n_val 10 \
--per_device_eval_batch_size $BS --per_device_train_batch_size $BS \
--task translation_en_to_ro --fp16 --deepspeed ds_config_1gpu.json

Results: train time 8.85s, eval time 3.62s. OOM occurs at BS=30. The ds_config_1gpu.json enables activation checkpointing and CPU offloading.

Getting Started

For full documentation, refer to the Trainer Integrations page. Detailed benchmark scripts and configurations are available in this GitHub issue.