DailyGlimpse

Speed Up Transformer Inference with Optimum and ONNX Runtime

AI
April 26, 2026 · 5:36 PM
Speed Up Transformer Inference with Optimum and ONNX Runtime

The Hugging Face team has released Optimum 1.2, bringing inference support to its optimization library. This update allows developers to accelerate Transformer models using ONNX Runtime while maintaining compatibility with the familiar Transformers pipeline API.

Optimum is an open-source extension of Hugging Face Transformers that provides a unified interface for performance optimization tools. With the new release, users can replace standard AutoModelForXxx classes with ORTModelForXxx equivalents, enabling optimized inference on ONNX Runtime.

For example, a question-answering pipeline can be switched from PyTorch to ONNX with minimal code changes:

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering

model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

optimum_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
result = optimum_qa("What's my name?", "My name is Philipp and I live in Nuremberg.")

In addition to inference, Optimum 1.2 includes tools for quantization (ORTQuantizer) and graph optimization (ORTOptimizer). These can be used to further reduce model size and improve speed before deployment. All optimized models can be pushed to the Hugging Face Hub for community sharing.

The release addresses the growing need for efficient deployment of Transformer models in production. As companies move from research to large-scale workloads, Optimum aims to reduce latency and resource consumption without sacrificing accuracy.