DailyGlimpse

Streamline AI Model Deployment: Convert Transformers to ONNX with Hugging Face Optimum

AI
April 26, 2026 · 5:31 PM
Streamline AI Model Deployment: Convert Transformers to ONNX with Hugging Face Optimum

Every day, hundreds of Transformer models are uploaded to the Hugging Face Hub, built with frameworks like PyTorch and TensorFlow. For production deployment, exporting these models to a serialized format like ONNX enables optimized execution on specialized runtimes and hardware.

This guide explains:

  1. What is ONNX?
  2. What is Hugging Face Optimum?
  3. Supported Transformer architectures.
  4. How to convert a BERT model to ONNX using three methods.

What is ONNX?

The Open Neural Network eXchange (ONNX) is an open standard that defines a common set of operators and file format to represent deep learning models across frameworks. When exported to ONNX, the model becomes an intermediate representation of a computational graph.

ONNX is not a runtime; it's a format that can be used with runtimes like ONNX Runtime. Learn more about supported accelerators.

What is Hugging Face Optimum?

Hugging Face Optimum is an open-source library extending Hugging Face Transformers. It provides a unified API for performance optimization tools, enabling conversion, quantization, graph optimization, and accelerated training and inference.

Supported Transformer Architectures

Many architectures are supported, including ALBERT, BART, BERT, DistilBERT, ELECTRA, GPT Neo, GPT-J, GPT-2, RoBERTa, T5, ViT, XLM, and more. See the full list in the ONNX section of the Transformers documentation.

How to Convert a BERT Model to ONNX

Below are three methods to export a DistilBERT model fine-tuned for sentiment analysis.

Using torch.onnx (Low-Level)

pip install transformers torch
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
dummy_input = tokenizer("This is a sample", return_tensors="pt")

torch.onnx.export(
    model,
    tuple(dummy_input.values()),
    f="torch-model.onnx",
    input_names=['input_ids', 'attention_mask'],
    output_names=['logits'],
    dynamic_axes={'input_ids': {0: 'batch_size', 1: 'sequence'},
                  'attention_mask': {0: 'batch_size', 1: 'sequence'},
                  'logits': {0: 'batch_size', 1: 'sequence'}},
    do_constant_folding=True,
    opset_version=13,
)

Using transformers.onnx (Mid-Level)

pip install transformers[onnx] torch

... (Note: The original article truncated the mid-level and high-level code examples, but the process simplifies exporting using configuration objects.)

Using Optimum (High-Level)

Optimum provides the simplest API, automatically handling many details.

Next Steps

After converting to ONNX, you can optimize and run your models using ONNX Runtime or other compatible runtimes for maximum efficiency.