Laravel

Training ever-larger language models has become routine, but the engineering know-how behind them often stays in the shadows. This article pulls back the curtain on the hardware, software, and human collaboration that made BLOOM—a 176-billion-parameter multilingual model—possible.

A Quick Snapshot

Component	Details
Hardware	384 NVIDIA A100 80GB GPUs (48 nodes)
Software	Megatron-DeepSpeed (3D parallelism)
Architecture	GPT-3 with enhancements
Dataset	350B tokens across 59 languages
Training Time	3.5 months (Mar–Jul 2022)

The People Behind the Machine

The project was spearheaded by Hugging Face co-founder Thomas Wolf, who aimed to prove that a small team could train a world-class multilingual model and openly release it. The engineering success rested on contributions from:

Hugging Face BigScience team – dedicated half a dozen full-time staff to orchestrate training and infrastructure.
Microsoft DeepSpeed team – integrated DeepSpeed with Megatron-LM and provided hands-on support.
NVIDIA Megatron-LM team – developed the Megatron-LM framework and offered expert guidance.
IDRIS / GENCI team – donated compute on the Jean Zay supercomputer and provided system administration.
PyTorch team – fixed bugs and improved components critical to training.
BigScience Engineering volunteers – contributed countless hours.

Key individual contributors include Olatunji Ruwase, Deepak Narayanan, Jeff Rasley, Jared Casper, Samyam Rajbhandari, and Rémi Lacroix.

Hardware Setup

The model was trained on the Jean Zay supercomputer (France), using:

GPUs: 384 NVIDIA A100 80GB (48 nodes) with 32 spare GPUs
Node: 8 GPUs per node, NVLink 4 inter-GPU, 4 OmniPath links
CPU: AMD EPYC 7543 32-core
Memory: 512GB CPU RAM, 640GB GPU RAM per node
Interconnect: Omni-Path Architecture (non-blocking fat tree)
NCCL network: fully dedicated subnet for GPU communication
Storage: GPFS shared with other users

Checkpoints were massive—each bfloat16+fp32 optimizer checkpoint was 2.3TB, and the bf16 weights alone occupied 329GB.

Dataset

BLOOM was trained on The BigScience Corpus, a 1.6TB composite multilingual dataset. After deduplication and cleaning, it compressed to 350 billion unique tokens spanning 46 languages. The model's vocabulary size is 250,680 tokens.

The Engine: Megatron-DeepSpeed

The training stack combined two powerhouse libraries:

DeepSpeed: optimization library from Microsoft, providing ZeRO data parallelism and pipeline parallelism.
Megatron-LM: NVIDIA’s large transformer framework, contributing tensor parallelism, fused CUDA kernels, and the data loader.

This union enabled 3D parallelism:

Data Parallelism (DP): Replicate the model on multiple GPUs, each processing a different slice of data, synchronizing gradients after each step.
Tensor Parallelism (TP): Split tensors across GPUs; each shard processes independently, then results are merged.
Pipeline Parallelism (PP): Layer groups are distributed across GPUs; each GPU handles a segment of the model, processing micro-batches in a pipeline.

Together, these techniques allowed training a 176B-parameter model across 384 GPUs with high efficiency. The Megatron-DeepSpeed fork used by BigScience included several custom additions to tailor the setup for BLOOM.

Why It Matters

BLOOM demonstrated that open, collaborative AI research can rival corporate efforts—both in scale and transparency. By sharing not just the model but the engineering blueprint, the project empowers the broader AI community to build, study, and improve large language models.

Inside BLOOM: The Tech and Team Behind a 176B-Parameter Language Model