In a series of benchmark tests, Intel's Habana Gaudi2 AI accelerator has demonstrated significant performance gains over both its predecessor and Nvidia's A100 80GB GPU. The new hardware, designed for deep learning workloads, delivered up to 3x the training speed of first-generation Gaudi and nearly 2x the speed of the A100 in BERT pre-training tasks.
Gaudi2 features 96GB of memory per accelerator, triple that of the original Gaudi, enabling larger batch sizes and faster convergence. For instance, pre-training BERT with a batch size of 64 samples per device took just 1 hour and 33 minutes on Gaudi2, compared to 8 hours and 53 minutes on first-gen Gaudi—a 5.75x reduction in total training time.
Inference performance also saw a boost: Stable Diffusion image generation latency dropped to 0.925 seconds per image on Gaudi2, versus 3.25 seconds on first-gen Gaudi and 2.63 seconds on the A100. These results were achieved using the Optimum Habana library, which seamlessly supports both Gaudi generations.
Access to Gaudi2 is available through the Intel Developer Cloud, where users can request instances with eight Gaudi2 accelerators. The Habana SynapseAI SDK ensures compatibility, allowing existing workflows for first-gen Gaudi to run unchanged on the new hardware.