Transformer models have become the backbone of modern machine learning, excelling in NLP, computer vision, and speech tasks. However, deploying them in production is often challenging due to their massive size—large language models can exceed tens of gigabytes, making it difficult to achieve both high throughput and low latency. In a new partnership, Hugging Face and Amazon Web Services aim to solve this by optimizing Hugging Face Transformers for AWS Inferentia2, a purpose-built inference accelerator.
What is AWS Inferentia2?
AWS Inferentia2 is the second-generation inference chip from AWS, succeeding the 2019 Inferentia1. The new chip delivers a 4x increase in throughput and a 10x reduction in latency compared to its predecessor. Amazon EC2 Inf2 instances, powered by Inferentia2, offer up to 2.6x better throughput, 8.1x lower latency, and 50% better performance per watt than comparable GPU-based instances. Inf2 instances scale from 1 to 12 Inferentia2 chips, with direct chip-to-chip connectivity for distributed inference on models up to 175 billion parameters—like GPT-3 or BLOOM.
Deploying models on Inferentia2 requires minimal effort thanks to the optimum-neuron library. With a single line of code, users can compile their Hugging Face models for Inferentia2, removing the need for complex manual model slicing or optimization.
Benchmarking Results
To validate performance claims, Hugging Face benchmarked popular models—BERT, RoBERTa, DistilBERT, ALBERT, and Vision Transformer—on Inferentia1, Inferentia2, and NVIDIA A10G GPUs. Experiments measured p95 latency across sequence lengths from 8 to 512 tokens with a batch size of 1.
Key findings:
- Inferentia2 delivers 4.5x better latency than NVIDIA A10G GPUs on average.
- Inferentia2 provides 4x better latency than Inferentia1 across tested models.
- Results were consistent across all six model architectures and sequence lengths.
"The benchmark confirms that the performance improvements claimed by AWS can be reproduced and validated by real use-cases and examples."
These improvements mean that developers can deploy state-of-the-art Transformers with significantly lower cost per inference and faster response times, making AI applications more accessible and efficient.
Conclusion
With native integration via the AWS Neuron SDK and optimum-neuron, Hugging Face users can now easily leverage AWS Inferentia2 to run large transformers at scale. The combination of ease of use, dramatic performance gains, and energy efficiency positions Inferentia2 as a compelling choice for production AI workloads. Whether you're running BERT for search or a massive LLM for conversational AI, Inferentia2 offers a path to better performance without added complexity.