The Hugging Face team has rolled out significant performance improvements for TensorFlow models in the Transformers library, focusing on faster computation and easier deployment with TensorFlow Serving. The updates primarily target BERT, RoBERTa, ELECTRA, and MPNet models.
Computational Performance
Benchmarks comparing BERT performance with TensorFlow Serving in version 4.2.0 against the official Google implementation show speed improvements of up to 10%. For instance, with a batch size of 8, the Hugging Face implementation processes in 21.5 ms versus Google's 24 ms. These gains are consistent across batch sizes and also represent a twofold speed increase over the previous 4.1.1 release.
TensorFlow Serving Integration
TensorFlow Serving, part of the TensorFlow Extended (TFX) ecosystem, allows easy deployment of models via HTTP or gRPC APIs. Models must be saved in the SavedModel format, which encapsulates the model graph and weights. Starting with v4.2.0, Hugging Face Transformers simplifies SavedModel creation with three key features:
- Flexible sequence length: The sequence length can be adjusted between inference runs.
- Full input availability: All model inputs are accessible for inference.
- Grouped outputs: Hidden states and attention outputs are bundled into single tensors.
Creating a SavedModel
To create a SavedModel, use the save_pretrained() method with saved_model=True. For custom inputs like inputs_embeds instead of input_ids, subclass the model and override the serving method with a new input_signature using @tf.function. The example below shows how to define a custom input signature:
from transformers import TFBertForSequenceClassification
import tensorflow as tf
class MyOwnModel(TFBertForSequenceClassification):
@tf.function(input_signature=[{
"inputs_embeds": tf.TensorSpec((None, None, 768), tf.float32, name="inputs_embeds"),
"attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
"token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
}])
def serving(self, inputs):
output = self.call(inputs)
return self.serving_output(output)
model = MyOwnModel.from_pretrained("bert-base-cased")
model.save_pretrained("my_model", saved_model=True)
Deployment
The recommended installation method for TensorFlow Serving is via Docker. Once the SavedModel is ready, it can be served using Docker commands. The model then accepts requests via the REST API or gRPC.
These enhancements make Hugging Face Transformers a more efficient choice for production NLP pipelines, offering both speed and flexibility.