Optimum Habana v1.7 on Habana Gaudi2 delivers a 2.5x speedup over Nvidia A100 and 1.4x over H100 when fine-tuning BridgeTower, a state-of-the-art vision-language model. This performance leap is driven by hardware-accelerated data loading, a technique applicable to any data-loading-constrained workload — especially common in vision models.
BridgeTower Overview
Vision-Language (VL) models have become dominant across many multimodal tasks. BridgeTower improves upon traditional approaches by introducing multiple bridge layers that connect the top layers of uni-modal encoders with each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment between visual and textual representations at varying semantic levels.
Pretrained on just 4 million images, BridgeTower achieves state-of-the-art results: it attains 78.73% accuracy on the VQAv2 test-std set, outperforming the previous best (METER) by 1.09% with minimal added parameters. Scaling further yields 81.15% accuracy, surpassing models trained on far larger datasets.
Hardware Comparison
- Nvidia H100: Latest generation with Transformer Engine supporting FP8 mixed precision, 80GB memory.
- Nvidia A100 80GB: Third-gen Tensor Core technology, widely available on cloud.
- Habana Gaudi2: Second-gen AI accelerator with 8 HPUs per server, 96GB memory each, easy to use via Optimum Habana.
Benchmark Setup
We fine-tuned a BridgeTower Large checkpoint (866M parameters), pretrained on Conceptual Captions, SBU Captions, MSCOCO Captions, and Visual Genome using MLM, ITM, and ITC losses. The downstream dataset was the New Yorker Caption Contest. Hyperparameters were identical across accelerators with a batch size of 48 per device.
Data loading is often a bottleneck in image-heavy workloads due to CPU-bound decoding and augmentations. The experiment varied the number of dedicated data-loading subprocesses (dataloader_num_workers):
| Device | workers=0 | workers=1 | workers=2 |
|---|---|---|---|
| Gaudi2 | 601.5 samples/s | 747.4 samples/s | 768.7 samples/s |
| H100 | 336.5 samples/s | 580.1 samples/s | 602.1 samples/s |
| A100 | 227.5 samples/s | 339.7 samples/s | 345.4 samples/s |
With 2 workers, Gaudi2 achieves 1.28x faster than H100 and 2.23x faster than A100. Even without dedicated workers, Gaudi2 is 1.79x H100 and 2.64x A100.
Key Takeaway
By simply increasing dataloader_num_workers — a one-line change — users can dramatically accelerate fine-tuning on Gaudi2, H100, or A100. Optimum Habana makes porting Transformers scripts to Gaudi with minimal code changes, unlocking these performance gains effortlessly.