Context and Motivations
Since the original benchmarking of Transformer models in 2019, the Hugging Face ecosystem has grown exponentially, with over 9,000 models now available. Deploying BERT-like architectures at scale remains a challenge, prompting the development of the Hugging Face Inference API and a series of optimization techniques.
This article, the first in a series, focuses on hardware and software optimizations for CPU-based BERT inference. Key topics include establishing baselines, leveraging modern CPU features, core count scaling, and batch size scaling with multiple model instances.
Benchmarking Methodology
To measure performance, we use two standard metrics:
- Latency: Time for a single model execution.
- Throughput: Number of executions per fixed time.
The benchmarking framework has been rewritten to integrate the latest Transformers features and is based on Facebook AI Research's Hydra configuration library for reproducibility. The code is available on GitHub.
Support is included for PyTorch, TensorFlow, TorchScript, Google XLA, and ONNX Runtime, the latter offering specific optimizations for transformer models.
Baselines
We establish baseline performance using out-of-the-box settings for PyTorch and TensorFlow with the BERT-base model. These baselines serve as a reference for subsequent optimizations.
Scaling BERT Inference on Modern CPUs
Cores and Threads
Modern CPUs feature multiple cores and hyper-threading. Understanding how to allocate threads and processes is crucial for maximizing throughput.
Multi-Socket Servers and CPU Affinity
On multi-socket servers, memory access latency varies. Setting CPU affinity ensures threads stay on the same socket, reducing memory access penalties.
Tuning Thread Affinity & Memory Allocation
Using tools like numactl and taskset, we can bind threads to specific cores or sockets, improving cache utilization and reducing contention.
Core Count Scaling
Using more cores does not always linearly improve performance due to overheads like cache misses and memory bandwidth saturation. Our experiments show diminishing returns beyond a certain core count.
Multi-Stream Inference
Multiple Independent Instances
Instead of scaling one model instance, we can run several instances in parallel, each handling separate inference requests. This approach increases throughput but requires careful resource allocation.
Smart Dispatching
Different model instances can be dedicated to different input sizes (e.g., sequence lengths), optimizing resource usage.
Batch Size Scaling
Batching multiple inputs together reduces overhead per sample. Combining batching with multiple model instances yields significant throughput gains.
Conclusion
Optimizing BERT inference on CPUs involves a combination of hardware awareness, thread management, and parallelism techniques. The strategies described here—core affinity, multi-instance inference, and batching—can dramatically improve performance. Future posts will explore software-level optimizations and quantization.
Acknowledgments
Thanks to the Hugging Face team and the open-source community for their contributions.
References
- Original benchmarking blog (Medium)
- Transformers library (GitHub)
- Hugging Face Model Hub
- BERT paper (Devlin et al., 2018)
- The Illustrated Transformer (Jay Alammar)
- TorchScript documentation
- XLA documentation
- ONNX Runtime