DailyGlimpse

BLOOMZ on Habana Gaudi2: Fastest LLM Inference Yet

AI
April 26, 2026 · 5:02 PM
BLOOMZ on Habana Gaudi2: Fastest LLM Inference Yet

Large language models with hundreds of billions of parameters, like BLOOM and its instruction-tuned variant BLOOMZ, are notoriously difficult to deploy for inference due to memory and compute demands. A new benchmark shows that Habana's Gaudi2 accelerator, combined with the Optimum Habana library, can outperform even Nvidia's A100 80GB GPU, achieving up to 2.89x speedups.

BLOOMZ, a 176-billion-parameter model fine-tuned for better zero-shot generalization, requires 352 GB in 16-bit precision. The Gaudi2 server packs eight accelerators with 96 GB each, enough to host such models. The key is DeepSpeed-inference, which implements model parallelism and KV-cache optimization, and is now integrated with Habana's SynapseAI SDK.

In latency tests (7-token prompt, 100-token greedy generation), Gaudi2 processed BLOOMZ-176B in 3.103 seconds vs. 4.402 seconds on A100 — a 1.42x improvement. For the 7-billion-parameter BLOOMZ-7B, Gaudi2 was 2.89x faster (0.734s vs. 2.119s on a single A100). Even first-generation Gaudi offered a better price-performance ratio than A100 for smaller models, costing about $13/hour vs. $30+/hour.

"For the 176-billion-parameter checkpoint, Gaudi2 is 1.42x faster than A100 80GB."

The Habana team recently added support for DeepSpeed-inference and HPU graphs, enabling competitive performance on latency-sensitive applications. Users can access Gaudi2 via the Intel Developer Cloud.

These results position Gaudi2 as a strong alternative to GPUs for large language model inference, especially as models continue to grow. Further optimizations are expected in upcoming SynapseAI releases.