Laravel

Hugging Face has released performance benchmarks for its Infinity inference solution running on the latest Intel Xeon CPUs, demonstrating significant latency and throughput improvements for Transformer models.

The company reports that Infinity on 3rd-gen Intel Xeon (Ice Lake) instances delivers up to 34% better latency and throughput compared to previous-generation Cascade Lake instances, and up to 800% improvement versus vanilla Transformer deployments on the same hardware.

Infinity is a containerized solution that provides optimized inference pipelines for popular Transformer models. It consists of the Infinity Container, a Docker-based inference server, and Infinity Multiverse, a model optimization service that tailors models for specific hardware targets.

Benchmarks were conducted using Amazon EC2 C6i instances powered by Ice Lake processors with Intel AVX-512, Turbo Boost, and Deep Learning Boost. The tests covered a DistilBERT model for sequence classification across 192 configurations, varying CPU cores (1,2,4,8), sequence lengths (8 to 512), and batch sizes (1 to 32). Key metrics included end-to-end latency (including preprocessing, prediction, and postprocessing) and throughput in requests per second.

Hugging Face has made a live endpoint available for testing, and the full benchmark data is accessible in a public spreadsheet.

Update: As of December 2022, Hugging Face no longer offers Infinity as a commercial product. Users are directed to Inference Endpoints and Optimum Intel/ONNX Runtime libraries, or the Expert Acceleration Program for custom optimization.

For companies requiring low-latency, high-throughput AI inference at scale, Infinity on modern CPUs provided a cost-effective alternative to GPU-based deployments, though Hugging Face now recommends alternative solutions for new projects.

Hugging Face Details Millisecond Latency Achieved with Infinity and Modern CPUs

We Care About Your Privacy

How and why we process data