Large language models are growing rapidly, with models like PaLM (540B parameters) and BLOOM (176B parameters) pushing the limits of available hardware. Running these models is expensive—for instance, inference on BLOOM-176B requires eight 80GB A100 GPUs (each around $15,000), and fine-tuning demands 72 such GPUs. To make these models more accessible, researchers are exploring methods like quantization and distillation that reduce memory usage without sacrificing performance.
This article introduces an 8-bit quantization technique called LLM.int8(), developed in collaboration between Hugging Face and BigScience. The method allows large models to run with half the memory while maintaining predictive accuracy. It is now fully integrated into the Hugging Face transformers library.
Understanding Model Memory Usage
The memory footprint of a model is determined by the number of parameters and their precision. Common data types include:
- Float32 (FP32): 32 bits (4 bytes), the standard full-precision format.
- Float16 (FP16): 16 bits (2 bytes), half-precision with limited range, prone to overflow/underflow.
- Bfloat16 (BF16): 16 bits (2 bytes), retains FP32’s dynamic range but with less precision.
- Int8: 8 bits (1 byte), can store 256 discrete values.
During training, mixed precision is often used: weights are stored in FP32, while forward/backward passes run in FP16/BF16 for speed. For inference, half-precision weights often suffice, halving memory needs. For example, BLOOM-176B in FP32 requires ~704 GB, but in BF16 it drops to ~352 GB—still too large for most setups.
8-Bit Quantization
Quantization maps values from a higher-precision range to a lower-precision one, reducing memory by up to 4x. However, naive quantization causes significant information loss. Two common 8-bit methods are:
- Zero-point quantization: Scales and shifts values to fit the target range.
- Absolute maximum (absmax) quantization: Normalizes by the absolute maximum value.
Our LLM.int8() method combines these with a mixed-precision strategy to preserve model quality. The key innovation is handling outlier features—rare but large-magnitude activations that emerge in large models—by keeping them in FP16 while quantizing the rest to 8-bit. This ensures that performance degrades negligibly even for 176B-parameter models.
Integration into Transformers
The integration was challenging due to the need for seamless support across hundreds of model architectures. With accelerate and bitsandbytes, users can now load any Hugging Face model in 8-bit with a single line of code:
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-176b", load_in_8bit=True)
This reduces GPU memory by ~2x, enabling users to run BLOOM-176B on just two A100 GPUs instead of eight.
Conclusion
8-bit quantization makes large transformers more accessible without sacrificing accuracy. The Hugging Face integration simplifies deployment, paving the way for broader use of massive language models. Future work aims to reduce memory further while maintaining performance.
This article is based on the paper "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale".