The Hugging Face team and external contributors have recently expanded the Transformers library with a diverse set of TensorFlow vision models, including Vision Transformer (ViT), Masked Autoencoders, RegNet, and ConvNeXt. This article demonstrates how to deploy a Vision Transformer model for image classification locally using TensorFlow Serving (TF Serving), which provides both REST and gRPC endpoints.
Saving the Model
All TensorFlow models in Transformers include a save_pretrained() method that can serialize weights in the SavedModel format required by TF Serving. For example, loading and saving a ViT model:
from transformers import TFViTForImageClassification
temp_model_dir = "vit"
ckpt = "google/vit-base-patch16-224"
model = TFViTForImageClassification.from_pretrained(ckpt)
model.save_pretrained(temp_model_dir, saved_model=True)
Inspecting the serving signature reveals the model expects a 4-d input pixel_values of shape (batch_size, num_channels, height, width) and outputs logits of shape (-1, 1000).
Model Surgery
To reduce cognitive load and training-serving skew, it's beneficial to embed preprocessing and postprocessing steps directly into the model graph.
Preprocessing
Essential preprocessing for ViT includes scaling pixel values to [0,1], normalizing to [-1,1], resizing to 224x224, and transposing to channel-first format. The following functions handle these operations:
def normalize_img(img, mean=processor.image_mean, std=processor.image_std):
img = img / 255
mean = tf.constant(mean)
std = tf.constant(std)
return (img - mean) / std
def preprocess(string_input):
decoded_input = tf.io.decode_base64(string_input)
decoded = tf.io.decode_jpeg(decoded_input, channels=3)
resized = tf.image.resize(decoded, size=(SIZE, SIZE))
normalized = normalize_img(resized)
normalized = tf.transpose(normalized, (2, 0, 1))
return normalized
@tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
def preprocess_fn(string_input):
decoded_images = tf.map_fn(preprocess, string_input, dtype=tf.float32, back_prop=False)
return {CONCRETE_INPUT: decoded_images}
Using base64-encoded images as input helps keep request payload sizes manageable over REST or gRPC.
Postprocessing and Model Export
Postprocessing maps logits to class labels (e.g., ImageNet-1k). To embed both preprocessing and postprocessing, you can create a new model that wraps the original:
class TFServingViT(tf.keras.Model):
def __init__(self, model):
super().__init__()
self.model = model
@tf.function(input_signature=[tf.TensorSpec([None], tf.string)])
def call(self, string_input):
# Preprocess the string input to pixel values
pixel_values = preprocess_fn(string_input)[CONCRETE_INPUT]
# Run the model
logits = self.model(pixel_values).logits
# Postprocess: apply softmax and get top-k labels
probabilities = tf.nn.softmax(logits, axis=-1)
# Example: return top-5 labels and scores
values, indices = tf.math.top_k(probabilities, k=5)
return {"labels": indices, "scores": values}
Then, export this wrapper model:
serving_model = TFServingViT(model)
tf.saved_model.save(serving_model, "tf_serving_vit/1")
Deployment with TensorFlow Serving
Install TF Serving and start the server pointing to the exported model directory:
docker run -p 8501:8501 --name tf_serving_vit \
--mount type=bind,source=$(pwd)/tf_serving_vit,target=/models/vit \
-e MODEL_NAME=vit -t tensorflow/serving
Querying the REST Endpoint
Send a POST request with a base64-encoded image:
curl -X POST http://localhost:8501/v1/models/vit:predict \
-d '{"instances": [{"b64": "<base64_image>"}]}'
Querying the gRPC Endpoint
Using the TensorFlow Serving gRPC API, you can send prediction requests in a similar manner.
Wrapping Up
This approach allows you to deploy Hugging Face TensorFlow vision models with all necessary processing baked in, making them easy to consume via standard endpoints. For the complete code, refer to the accompanying Colab notebook.