Laravel

This guide demonstrates how to achieve remarkably low per-token latency when generating text using the 176-billion-parameter BLOOM model. The model requires 352 GB of bfloat16 weights, making 8×80 GB A100 GPUs the most efficient setup. Alternatively, 2×8×40 GB A100s or 2×8×48 GB A6000s can be used.

Running inference on a single node typically yields the fastest throughput due to faster intra-node GPU interconnects. However, if you lack such hardware, CPU or NVMe offload allows running BLOOM on smaller GPUs at a much slower speed.

We also cover 8-bit quantized solutions using BitsAndBytes and DeepSpeed-Inference, which halve GPU memory requirements with a slight throughput trade-off.

Benchmarks

All benchmarks were performed on an 8×80 GB A100 node with 512 GB CPU memory, using greedy generation to produce 100 tokens. The table below shows throughput in milliseconds per token for various batch sizes (bs):

Project	bs=1	bs=8	bs=16	bs=32	bs=64	bs=128	bs=256	bs=512
accelerate bf16	230.38	31.78	17.84	10.89	OOM
accelerate int8	286.56	40.92	22.65	13.27	OOM
ds-inference fp16	44.02	5.70	3.01	1.68	1.00	0.69	OOM
ds-inference int8	89.09	11.44	5.88	3.09	1.71	1.02	0.71	OOM
ds-zero bf16	283	34.88	OOM

DeepSpeed-Inference achieves sub-millisecond throughput using tensor parallelism and custom fused CUDA kernels. Accelerate, which uses naive pipeline parallelism, works out of the box with any model. DeepSpeed-ZeRO processes multiple generate streams in parallel, effectively dividing throughput by the number of GPUs.

For quantized int8 models, the following results were obtained on 4×80 GB A100s:

Project	bs=1	bs=8	bs=16	bs=32	bs=64	bs=128
accelerate int8	284.15	40.14	21.97	OOM
ds-inference int8	156.51	20.11	10.38	5.50	2.96	OOM

Solutions

Clone the demo repository:

git clone https://github.com/huggingface/transformers-bloom-inference
cd transformers-bloom-inference

Three scripts are used, presented alphabetically by framework.

HuggingFace Accelerate

Accelerate handles large models by instantiating them with empty weights, analyzing layer sizes, and placing layers on available devices. It then transfers inputs/outputs as needed. While only one GPU works at a time, throughput is still decent. The same code runs on any setup, with CPU/disk offload available but slower.

Setup:

pip install transformers>=4.21.3 accelerate>=0.12.0

Run:

python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark

To enable 8-bit quantization from BitsAndBytes, add the appropriate flag.

Achieving Blazing-Fast BLOOM Inference with DeepSpeed and Accelerate

Benchmarks

Solutions

HuggingFace Accelerate

We Care About Your Privacy

How and why we process data