The adoption of BERT and Transformer models is growing rapidly, with applications extending beyond NLP into computer vision, speech, and time-series analysis. As companies move from experimentation to production, accelerating these models becomes critical. AWS Inferentia, a custom chip designed for inference, promises up to 80% lower cost per inference and 2.3x higher throughput compared to GPU-based instances, thanks to its Neuron Cores.
This tutorial shows how to speed up BERT inference for text classification using Hugging Face Transformers, Amazon SageMaker, and AWS Inferentia. You'll learn to convert a Transformer model to AWS Neuron format, create a custom inference script, upload the model to S3, deploy a real-time endpoint, and evaluate performance.
Step 1: Convert Your Model to AWS Neuron
The AWS Neuron SDK compiles PyTorch and TensorFlow models for Inferentia. First, install the SDK and required packages:
pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
pip install torch-neuron==1.9.1.* neuron-cc[tensorflow] sagemaker>=2.79.0 transformers==4.12.3 --upgrade
Then, load a pre-trained model and tokenizer, create a dummy input for static shape (e.g., batch size 1, sequence length 128), and trace the model with torch_neuron:
import torch
import torch.neuron
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id, torchscript=True)
# Create dummy input (batch_size=1, seq_len=128)
dummy_input = "This is a sample input for tracing."
inputs = tokenizer(dummy_input, return_tensors="pt", padding="max_length", truncation=True, max_length=128)
# Trace and compile for Inferentia
neuron_model = torch.neuron.trace(model, example_inputs=(inputs["input_ids"], inputs["attention_mask"]))
Note: Neuron SDK requires static input shapes; the model will only accept inputs of the same shape used during compilation.
Step 2: Create the Inference Script
Write a custom inference.py script that loads the neuron model and handles preprocessing/postprocessing for text classification.
import torch
import torch.neuron
from transformers import AutoTokenizer
class NeuronModelWrapper:
def __init__(self, model_path, tokenizer_id):
self.model = torch.jit.load(model_path)
self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
def predict(self, texts):
inputs = self.tokenizer(texts, return_tensors="pt", padding="max_length", truncation=True, max_length=128)
with torch.no_grad():
outputs = self.model(inputs["input_ids"], inputs["attention_mask"])
probabilities = torch.softmax(outputs[0], dim=-1)
return probabilities.tolist()
Step 3: Upload to S3
Save the neuron model and inference script, then upload them to an Amazon S3 bucket.
import boto3
import os
# Save model and script
torch.jit.save(neuron_model, "model.pt")
with open("inference.py", "w") as f:
f.write(inference_script)
# Upload to S3
s3 = boto3.client("s3")
bucket = "my-huggingface-models"
s3.upload_file("model.pt", bucket, "distilbert-neuron/model.pt")
s3.upload_file("inference.py", bucket, "distilbert-neuron/inference.py")
Step 4: Deploy on SageMaker
Create a SageMaker model using the Hugging Face Inference Toolkit with Inferentia instance type ml.inf1.xlarge. Deploy as a real-time endpoint.
from sagemaker.huggingface import HuggingFaceModel
role = "arn:aws:iam::...:role/SageMakerRole"
huggingface_model = HuggingFaceModel(
model_data=f"s3://{bucket}/distilbert-neuron/",
role=role,
transformers_version="4.12",
pytorch_version="1.9",
py_version="py38",
instance_type="ml.inf1.xlarge",
)
predictor = huggingface_model.deploy(initial_instance_count=1)
Step 5: Evaluate Performance
Test the endpoint with sample inputs and measure latency/throughput.
text = "This movie was fantastic!"
result = predictor.predict({"inputs": text})
print(result) # e.g., [[0.98, 0.02]] for positive/negative
Clean up by deleting the endpoint and model when done:
predictor.delete_endpoint()
predictor.delete_model()
Conclusion
AWS Inferentia offers a cost-effective way to accelerate BERT inference. By combining Hugging Face Transformers with the Neuron SDK, you can easily convert and deploy models for high-performance, low-latency applications on SageMaker. This tutorial gives you a practical starting point for production-grade Transformer inference.