Laravel

Context and Motivations

Since the original benchmarking of Transformer models in 2019, the Hugging Face ecosystem has grown exponentially, with over 9,000 models now available. Deploying BERT-like architectures at scale remains a challenge, prompting the development of the Hugging Face Inference API and a series of optimization techniques.

This article, the first in a series, focuses on hardware and software optimizations for CPU-based BERT inference. Key topics include establishing baselines, leveraging modern CPU features, core count scaling, and batch size scaling with multiple model instances.

Benchmarking Methodology

To measure performance, we use two standard metrics:

Latency: Time for a single model execution.
Throughput: Number of executions per fixed time.

The benchmarking framework has been rewritten to integrate the latest Transformers features and is based on Facebook AI Research's Hydra configuration library for reproducibility. The code is available on GitHub.

Support is included for PyTorch, TensorFlow, TorchScript, Google XLA, and ONNX Runtime, the latter offering specific optimizations for transformer models.

Baselines

We establish baseline performance using out-of-the-box settings for PyTorch and TensorFlow with the BERT-base model. These baselines serve as a reference for subsequent optimizations.

Scaling BERT Inference on Modern CPUs

Cores and Threads

Modern CPUs feature multiple cores and hyper-threading. Understanding how to allocate threads and processes is crucial for maximizing throughput.

Multi-Socket Servers and CPU Affinity

On multi-socket servers, memory access latency varies. Setting CPU affinity ensures threads stay on the same socket, reducing memory access penalties.

Tuning Thread Affinity & Memory Allocation

Using tools like numactl and taskset, we can bind threads to specific cores or sockets, improving cache utilization and reducing contention.

Core Count Scaling

Using more cores does not always linearly improve performance due to overheads like cache misses and memory bandwidth saturation. Our experiments show diminishing returns beyond a certain core count.

Multi-Stream Inference

Multiple Independent Instances

Instead of scaling one model instance, we can run several instances in parallel, each handling separate inference requests. This approach increases throughput but requires careful resource allocation.

Smart Dispatching

Different model instances can be dedicated to different input sizes (e.g., sequence lengths), optimizing resource usage.

Batch Size Scaling

Batching multiple inputs together reduces overhead per sample. Combining batching with multiple model instances yields significant throughput gains.

Conclusion

Optimizing BERT inference on CPUs involves a combination of hardware awareness, thread management, and parallelism techniques. The strategies described here—core affinity, multi-instance inference, and batching—can dramatically improve performance. Future posts will explore software-level optimizations and quantization.

Acknowledgments

Thanks to the Hugging Face team and the open-source community for their contributions.

References

Original benchmarking blog (Medium)
Transformers library (GitHub)
Hugging Face Model Hub
BERT paper (Devlin et al., 2018)
The Illustrated Transformer (Jay Alammar)
TorchScript documentation
XLA documentation
ONNX Runtime

Optimizing BERT Inference on Modern CPUs: A Comprehensive Guide