Large language models (LLMs) have shown remarkable abilities in understanding and generating human-like text, but their demands on consumer hardware for training and deployment remain challenging. In line with its mission to democratize machine learning, Hugging Face has integrated the AutoGPTQ library into Transformers, enabling users to quantize models down to 8, 4, 3, or even 2-bit precision using the GPTQ algorithm. This integration offers negligible accuracy degradation at 4-bit and inference speeds comparable to fp16 for small batch sizes, and it works on both Nvidia and AMD GPUs.
GPTQ is a post-training quantization method that uses a calibration dataset to reduce model size while preserving performance. It employs a mixed int4/fp16 scheme, where weights are quantized to int4 and activations remain in float16, dequantizing on the fly during inference. This approach yields near 4x memory savings and potential speedups by reducing data communication.
The AutoGPTQ library supports a wide range of transformer architectures, going beyond earlier efforts focused solely on Llama. Hugging Face provides integrated APIs, documentation, and a Colab notebook to help users quantize, run inference, and fine-tune models with PEFT. The integration also extends to the Optimum library and Text-Generation-Inference for serving quantized models.
This release is part of Hugging Face's ongoing work to make large models accessible, following similar collaborations like bitsandbytes. For more details, see the original paper, the Transformers quantization documentation, and model repositories by The Bloke.