Hugging Face has published an overview comparing the two natively supported quantization schemes in its Transformers library: bitsandbytes and auto-gptq. The article aims to help users decide which method best suits their needs for running large models on smaller devices or fine-tuning adapters on quantized models.
Key Differences
bitsandbytes is praised for its ease of use—it requires no calibration data and works out-of-the-box with any PyTorch model containing torch.nn.Linear layers. It also supports cross-modality interoperability, enabling quantization of models like Whisper, ViT, and Blip2. Additionally, adapters trained on quantized models can be merged with no performance degradation.
auto-gptq excels in speed for text generation and supports quantization down to 2 bits, though 4 bits is recommended for quality. GPTQ models are easily serializable and support AMD GPUs. However, they require a calibration dataset and currently only support language models.
Speed and Compatibility
Benchmarks indicate that GPTQ quantized models are faster than bitsandbytes for text generation tasks. bitsandbytes currently lacks 4-bit serialization but supports 8-bit serialization. The Hugging Face team notes that both libraries have room for improvement and are actively developing new features.
The full article includes links to detailed resources, notebooks, and documentation for each quantization method.