Laravel

As machine learning models rapidly outpace GPU memory growth, many researchers struggle to train or even load large models. The ZeRO paper (Rajbhandari et al., 2019) introduced memory optimizations enabling training of trillion-parameter models, with open-source implementations in DeepSpeed (Microsoft) and FairScale (Facebook). Hugging Face's Trainer now supports both via --sharded_ddp (FairScale) and --deepspeed arguments.

Multi-GPU Speedups

Benchmarking a t5-large translation fine-tuning on 2x24GB Titan RTX GPUs with a baseline DDP approach achieved a batch size (BS) of 16. Adding optimizations progressively improved results:

Method	Max BS	Train Time (s)	Eval Time (s)
baseline	16	30.95	56.33
fp16	20	21.49	53.47
sharded_ddp	30	25.91	47.56
sharded_ddp+fp16	30	17.38	45.66
deepspeed (no CPU offload)	40	10.40	34.93
deepspeed (CPU offload)	50	20.97	32.14

Both libraries significantly boost performance and batch size. DeepSpeed currently offers more features but requires a config file, while FairScale is easier to deploy. The 80:20 rule applies—fine-tune for production projects.

Single-GPU: Fitting a Giant Model

DeepSpeed also benefits single-GPU setups. Attempting to train t5-3b (3 billion parameters) on a 24GB RTX 3090 without optimizations fails even at BS=1 due to OOM. With DeepSpeed and a single GPU configuration, BS=20 works fine:

export BS=20
CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 ./finetune_trainer.py \
--model_name_or_path t5-3b --n_train 60 --n_val 10 \
--per_device_eval_batch_size $BS --per_device_train_batch_size $BS \
--task translation_en_to_ro --fp16 --deepspeed ds_config_1gpu.json

Results: train time 8.85s, eval time 3.62s. OOM occurs at BS=30. The ds_config_1gpu.json enables activation checkpointing and CPU offloading.

Getting Started

For full documentation, refer to the Trainer Integrations page. Detailed benchmark scripts and configurations are available in this GitHub issue.

Supercharge Model Training with ZeRO: DeepSpeed and FairScale Explained

Multi-GPU Speedups

Single-GPU: Fitting a Giant Model

Getting Started

We Care About Your Privacy

How and why we process data