In a recent technical blog post, engineers from Hugging Face demonstrated how to fine-tune the massive Llama 2 70B language model using PyTorch's Fully Sharded Data Parallelism (FSDP), addressing key challenges of memory consumption, checkpoint saving, and training speed. The team leveraged Hugging Face Transformers, Accelerate, and TRL libraries, along with SLURM for job scheduling.
FSDP shards optimizer states, gradients, and parameters across devices, performing all-gather operations during forward and backward passes and reduce-scatter for gradient averaging. The setup required at least one node with 8 A100 GPUs (80GB each), NVLink intra-node connection, and 1TB RAM per node.
Three major obstacles emerged:
-
CPU RAM overload during model loading: Loading the 70B model on each GPU rank would consume ~2TB of CPU RAM. The fix introduced creating the model on the
metadevice without weights, loading the state dict only on rank 0, and initializing empty parameters on other ranks. Withsync_module_states=True, FSDP broadcasts weights to all ranks before training, reducing CPU peak memory on non-zero ranks to negligible levels. -
Slow and error-prone intermediate checkpoint saving: Standard
FULL_STATE_DICTsaving with CPU offloading on rank 0 led to NCCL timeout errors. The solution uses sharded state dicts for intermediate checkpoints, which are FSDP-compatible, and only saves the full state dict at the end of training. -
Training speed and VRAM optimization: The team employed Flash Attention V2 and recent PyTorch nightly builds to accelerate computation and reduce memory footprint. The complete code, FSDP config, and SLURM script are publicly available.
By implementing these techniques, the team successfully fine-tuned Llama 2 70B on a code chat assistant dataset, demonstrating practical methods for training large models on limited resources.