DailyGlimpse

Achieving Blazing-Fast BLOOM Inference with DeepSpeed and Accelerate

AI
April 26, 2026 · 5:20 PM
Achieving Blazing-Fast BLOOM Inference with DeepSpeed and Accelerate

This guide demonstrates how to achieve remarkably low per-token latency when generating text using the 176-billion-parameter BLOOM model. The model requires 352 GB of bfloat16 weights, making 8×80 GB A100 GPUs the most efficient setup. Alternatively, 2×8×40 GB A100s or 2×8×48 GB A6000s can be used.

Running inference on a single node typically yields the fastest throughput due to faster intra-node GPU interconnects. However, if you lack such hardware, CPU or NVMe offload allows running BLOOM on smaller GPUs at a much slower speed.

We also cover 8-bit quantized solutions using BitsAndBytes and DeepSpeed-Inference, which halve GPU memory requirements with a slight throughput trade-off.

Benchmarks

All benchmarks were performed on an 8×80 GB A100 node with 512 GB CPU memory, using greedy generation to produce 100 tokens. The table below shows throughput in milliseconds per token for various batch sizes (bs):

Project bs=1 bs=8 bs=16 bs=32 bs=64 bs=128 bs=256 bs=512
accelerate bf16 230.38 31.78 17.84 10.89 OOM
accelerate int8 286.56 40.92 22.65 13.27 OOM
ds-inference fp16 44.02 5.70 3.01 1.68 1.00 0.69 OOM
ds-inference int8 89.09 11.44 5.88 3.09 1.71 1.02 0.71 OOM
ds-zero bf16 283 34.88 OOM

DeepSpeed-Inference achieves sub-millisecond throughput using tensor parallelism and custom fused CUDA kernels. Accelerate, which uses naive pipeline parallelism, works out of the box with any model. DeepSpeed-ZeRO processes multiple generate streams in parallel, effectively dividing throughput by the number of GPUs.

For quantized int8 models, the following results were obtained on 4×80 GB A100s:

Project bs=1 bs=8 bs=16 bs=32 bs=64 bs=128
accelerate int8 284.15 40.14 21.97 OOM
ds-inference int8 156.51 20.11 10.38 5.50 2.96 OOM

Solutions

Clone the demo repository:

git clone https://github.com/huggingface/transformers-bloom-inference
cd transformers-bloom-inference

Three scripts are used, presented alphabetically by framework.

HuggingFace Accelerate

Accelerate handles large models by instantiating them with empty weights, analyzing layer sizes, and placing layers on available devices. It then transfers inputs/outputs as needed. While only one GPU works at a time, throughput is still decent. The same code runs on any setup, with CPU/disk offload available but slower.

Setup:

pip install transformers>=4.21.3 accelerate>=0.12.0

Run:

python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark

To enable 8-bit quantization from BitsAndBytes, add the appropriate flag.