Transformer models in language, vision, and speech continue to grow in size, demanding more resources for training. Hugging Face and Microsoft's ONNX Runtime teams have collaborated to optimize fine-tuning for large models. The integration of ONNX Runtime Training into Hugging Face's Optimum library delivers training time improvements of 35% or more, with specific models seeing acceleration from 39% to 130%.
Performance Results
When using ONNX Runtime combined with DeepSpeed ZeRO Stage 1, popular Hugging Face models achieve significant speedups. Tested on a single Nvidia A100 node with 8 GPUs, the baseline PyTorch runs used the AdamW optimizer, while ONNX Runtime runs employed the Fused Adam Optimizer. Key software versions: PyTorch 1.14.0, ONNX Runtime 1.14.0, DeepSpeed 0.6.6, HuggingFace 4.24.0, Optimum 1.4.1, CUDA 11.6.2.
What is Optimum?
Hugging Face's Optimum library extends the Transformers ecosystem to maximize hardware efficiency. It integrates accelerators like ONNX Runtime and specialized hardware (e.g., Intel Habana Gaudi) to speed up both training and inference. Optimum maintains the ease of use of Transformers, allowing developers to adapt their workflows for lower latency and reduced computational cost.
What is ONNX Runtime Training?
ONNX Runtime speeds up large model training by up to 40% standalone, and up to 130% when combined with DeepSpeed. It optimizes memory and compute through efficient memory planning, kernel optimizations, multi-tensor Adam updates, FP16 optimization, mixed precision training, and graph fusions. ONNX Runtime supports NVIDIA and AMD GPUs and offers custom operator extensibility.
Using ONNX Runtime in Optimum
The ORTTrainer API extends the standard Transformers Trainer to use ONNX Runtime as the backend. It provides a complete training and evaluation loop with support for hyperparameter search, mixed precision, and multi-GPU distributed training. Developers can combine ONNX Runtime with DeepSpeed ZeRO-1 for further memory savings. After training, models can be saved in PyTorch or exported to ONNX format for simplified deployment.
Getting Started
To leverage these optimizations, install the required packages and configure the trainer as shown in the Optimum documentation. The integration enables faster, more efficient training without sacrificing ease of use.