This guide demonstrates how to achieve remarkably low per-token latency when generating text using the 176-billion-parameter BLOOM model. The model requires 352 GB of bfloat16 weights, making 8×80 GB A100 GPUs the most efficient setup. Alternatively, 2×8×40 GB A100s or 2×8×48 GB A6000s can be used.
Running inference on a single node typically yields the fastest throughput due to faster intra-node GPU interconnects. However, if you lack such hardware, CPU or NVMe offload allows running BLOOM on smaller GPUs at a much slower speed.
We also cover 8-bit quantized solutions using BitsAndBytes and DeepSpeed-Inference, which halve GPU memory requirements with a slight throughput trade-off.
Benchmarks
All benchmarks were performed on an 8×80 GB A100 node with 512 GB CPU memory, using greedy generation to produce 100 tokens. The table below shows throughput in milliseconds per token for various batch sizes (bs):
| Project | bs=1 | bs=8 | bs=16 | bs=32 | bs=64 | bs=128 | bs=256 | bs=512 |
|---|---|---|---|---|---|---|---|---|
| accelerate bf16 | 230.38 | 31.78 | 17.84 | 10.89 | OOM | |||
| accelerate int8 | 286.56 | 40.92 | 22.65 | 13.27 | OOM | |||
| ds-inference fp16 | 44.02 | 5.70 | 3.01 | 1.68 | 1.00 | 0.69 | OOM | |
| ds-inference int8 | 89.09 | 11.44 | 5.88 | 3.09 | 1.71 | 1.02 | 0.71 | OOM |
| ds-zero bf16 | 283 | 34.88 | OOM |
DeepSpeed-Inference achieves sub-millisecond throughput using tensor parallelism and custom fused CUDA kernels. Accelerate, which uses naive pipeline parallelism, works out of the box with any model. DeepSpeed-ZeRO processes multiple generate streams in parallel, effectively dividing throughput by the number of GPUs.
For quantized int8 models, the following results were obtained on 4×80 GB A100s:
| Project | bs=1 | bs=8 | bs=16 | bs=32 | bs=64 | bs=128 |
|---|---|---|---|---|---|---|
| accelerate int8 | 284.15 | 40.14 | 21.97 | OOM | ||
| ds-inference int8 | 156.51 | 20.11 | 10.38 | 5.50 | 2.96 | OOM |
Solutions
Clone the demo repository:
git clone https://github.com/huggingface/transformers-bloom-inference
cd transformers-bloom-inference
Three scripts are used, presented alphabetically by framework.
HuggingFace Accelerate
Accelerate handles large models by instantiating them with empty weights, analyzing layer sizes, and placing layers on available devices. It then transfers inputs/outputs as needed. While only one GPU works at a time, throughput is still decent. The same code runs on any setup, with CPU/disk offload available but slower.
Setup:
pip install transformers>=4.21.3 accelerate>=0.12.0
Run:
python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark
To enable 8-bit quantization from BitsAndBytes, add the appropriate flag.