Training ever-larger language models has become routine, but the engineering know-how behind them often stays in the shadows. This article pulls back the curtain on the hardware, software, and human collaboration that made BLOOM—a 176-billion-parameter multilingual model—possible.
A Quick Snapshot
| Component | Details |
|---|---|
| Hardware | 384 NVIDIA A100 80GB GPUs (48 nodes) |
| Software | Megatron-DeepSpeed (3D parallelism) |
| Architecture | GPT-3 with enhancements |
| Dataset | 350B tokens across 59 languages |
| Training Time | 3.5 months (Mar–Jul 2022) |
The People Behind the Machine
The project was spearheaded by Hugging Face co-founder Thomas Wolf, who aimed to prove that a small team could train a world-class multilingual model and openly release it. The engineering success rested on contributions from:
- Hugging Face BigScience team – dedicated half a dozen full-time staff to orchestrate training and infrastructure.
- Microsoft DeepSpeed team – integrated DeepSpeed with Megatron-LM and provided hands-on support.
- NVIDIA Megatron-LM team – developed the Megatron-LM framework and offered expert guidance.
- IDRIS / GENCI team – donated compute on the Jean Zay supercomputer and provided system administration.
- PyTorch team – fixed bugs and improved components critical to training.
- BigScience Engineering volunteers – contributed countless hours.
Key individual contributors include Olatunji Ruwase, Deepak Narayanan, Jeff Rasley, Jared Casper, Samyam Rajbhandari, and Rémi Lacroix.
Hardware Setup
The model was trained on the Jean Zay supercomputer (France), using:
- GPUs: 384 NVIDIA A100 80GB (48 nodes) with 32 spare GPUs
- Node: 8 GPUs per node, NVLink 4 inter-GPU, 4 OmniPath links
- CPU: AMD EPYC 7543 32-core
- Memory: 512GB CPU RAM, 640GB GPU RAM per node
- Interconnect: Omni-Path Architecture (non-blocking fat tree)
- NCCL network: fully dedicated subnet for GPU communication
- Storage: GPFS shared with other users
Checkpoints were massive—each bfloat16+fp32 optimizer checkpoint was 2.3TB, and the bf16 weights alone occupied 329GB.
Dataset
BLOOM was trained on The BigScience Corpus, a 1.6TB composite multilingual dataset. After deduplication and cleaning, it compressed to 350 billion unique tokens spanning 46 languages. The model's vocabulary size is 250,680 tokens.
The Engine: Megatron-DeepSpeed
The training stack combined two powerhouse libraries:
- DeepSpeed: optimization library from Microsoft, providing ZeRO data parallelism and pipeline parallelism.
- Megatron-LM: NVIDIA’s large transformer framework, contributing tensor parallelism, fused CUDA kernels, and the data loader.
This union enabled 3D parallelism:
- Data Parallelism (DP): Replicate the model on multiple GPUs, each processing a different slice of data, synchronizing gradients after each step.
- Tensor Parallelism (TP): Split tensors across GPUs; each shard processes independently, then results are merged.
- Pipeline Parallelism (PP): Layer groups are distributed across GPUs; each GPU handles a segment of the model, processing micro-batches in a pipeline.
Together, these techniques allowed training a 176B-parameter model across 384 GPUs with high efficiency. The Megatron-DeepSpeed fork used by BigScience included several custom additions to tailor the setup for BLOOM.
Why It Matters
BLOOM demonstrated that open, collaborative AI research can rival corporate efforts—both in scale and transparency. By sharing not just the model but the engineering blueprint, the project empowers the broader AI community to build, study, and improve large language models.