DailyGlimpse

Inside the Optimization Journey: How BLOOM Inference Got 5x Faster

AI
April 26, 2026 · 5:18 PM
Inside the Optimization Journey: How BLOOM Inference Got 5x Faster

This article reveals the behind-the-scenes story of building an efficient inference server for the large language model BLOOM, achieving a 5x latency reduction and 50x more throughput over several weeks. The team shares their struggles and wins, from porting the training code to exploring multiple optimization routes.

Creating BLOOM The effort began with training BLOOM using Megatron-DeepSpeed, which required careful porting to the transformers library. This month-long task involved nearly 200 commits and the creation of smaller test models for development.

Porting to Transformers A key challenge was maintaining generation quality while optimizing for speed. The team established a test suite with fixed prompts and greedy decoding to ensure consistency, accepting minor logit differences when they didn't affect output. A configurable flag was added to handle differences from tensor parallelism.

First Inference (PP + Accelerate) Initial inference used pipeline parallelism and Accelerate, but performance was insufficient. The team then explored multiple routes including JAX/Flax on TPUs, ONNX/TRT, DeepSpeed, webserver ideas, and pure PyTorch.

Final Route: PyTorch + TP + Custom Kernel + torch.jit.script The winning combination involved writing more efficient PyTorch code, supporting tensor parallelism, and creating a custom kernel. Low-hanging fruits like operator fusion and memory optimizations yielded significant gains, while some experiments (like flash attention) were abandoned due to complexity.

Low-hanging Fruits Key optimizations included reducing memory allocations, using in-place operations, and fusing layer normalization. These simple changes provided a substantial speedup without sacrificing correctness.

Epic Fail Attempts to use full model parallelism with DeepSpeed failed due to instability and performance regressions, leading the team to stick with their custom approach.

Custom Kernel The team wrote a custom CUDA kernel for the attention mechanism, integrating it with PyTorch's JIT scripting. This kernel reduced overhead and improved GPU utilization.

Webserver Part The final server was built with Python's asyncio and a custom batching mechanism to maximize throughput. The system handles concurrent requests efficiently.

Last Notes The team acknowledges that many optimization avenues remain unexplored, including flash attention and OpenAI Triton, and invites the community to share new ideas. The journey highlights the importance of iterative testing, small model prototypes, and the willingness to abandon promising approaches when they don't deliver.