Large language models with hundreds of billions of parameters, like BLOOM and its instruction-tuned variant BLOOMZ, are notoriously difficult to deploy for inference due to memory and compute demands. A new benchmark shows that Habana's Gaudi2 accelerator, combined with the Optimum Habana library, can outperform even Nvidia's A100 80GB GPU, achieving up to 2.89x speedups.
BLOOMZ, a 176-billion-parameter model fine-tuned for better zero-shot generalization, requires 352 GB in 16-bit precision. The Gaudi2 server packs eight accelerators with 96 GB each, enough to host such models. The key is DeepSpeed-inference, which implements model parallelism and KV-cache optimization, and is now integrated with Habana's SynapseAI SDK.
In latency tests (7-token prompt, 100-token greedy generation), Gaudi2 processed BLOOMZ-176B in 3.103 seconds vs. 4.402 seconds on A100 — a 1.42x improvement. For the 7-billion-parameter BLOOMZ-7B, Gaudi2 was 2.89x faster (0.734s vs. 2.119s on a single A100). Even first-generation Gaudi offered a better price-performance ratio than A100 for smaller models, costing about $13/hour vs. $30+/hour.
"For the 176-billion-parameter checkpoint, Gaudi2 is 1.42x faster than A100 80GB."
The Habana team recently added support for DeepSpeed-inference and HPU graphs, enabling competitive performance on latency-sensitive applications. Users can access Gaudi2 via the Intel Developer Cloud.
These results position Gaudi2 as a strong alternative to GPUs for large language model inference, especially as models continue to grow. Further optimizations are expected in upcoming SynapseAI releases.