Laravel

In a previous post, we demonstrated how to deploy a Vision Transformer (ViT) model from Hugging Face Transformers locally using TensorFlow Serving. Now, we take it to the next level by scaling that deployment with Docker and Kubernetes, enabling it to handle real-world traffic.

Why Docker and Kubernetes?

The standard workflow for scaling ML model deployments involves containerizing the application with Docker and orchestrating it with Kubernetes. This approach is battle-tested in the industry, offering granular control, autoscaling, and security. While services like SageMaker or Vertex AI simplify ML deployment, Docker and Kubernetes provide flexibility and are widely adopted.

This guide uses Google Kubernetes Engine (GKE) to provision a cluster, but the concepts apply to Minikube or other Kubernetes solutions.

Containerization with Docker

We start with a TensorFlow Serving model that accepts raw image bytes and handles preprocessing and postprocessing. The model must be stored in the SavedModel format and organized in the directory structure <MODEL_NAME>/<VERSION>/<SavedModel>.

Preparing the Docker Image

First, extract the SavedModel tarball into the correct directory:

MODEL_TAR=model.tar.gz
MODEL_NAME=hf-vit
MODEL_VERSION=1
MODEL_PATH=models/$MODEL_NAME/$MODEL_VERSION

mkdir -p $MODEL_PATH
tar -xvf $MODEL_TAR --directory $MODEL_PATH

The models directory will look like:

/models
/models/hf-vit
/models/hf-vit/1
/models/hf-vit/1/keras_metadata.pb
/models/hf-vit/1/variables
/models/hf-vit/1/variables/variables.index
/models/hf-vit/1/variables/variables.data-00000-of-00001
/models/hf-vit/1/assets
/models/hf-vit/1/saved_model.pb

Next, build a custom TensorFlow Serving image by copying the models into a running container:

docker run -d --name serving_base tensorflow/serving
docker cp models/ serving_base:/models/

After copying, commit the container to create a new image:

docker commit serving_base hf-vit-serving
docker rm serving_base

This image, hf-vit-serving, contains the model and is ready for deployment.

Running Locally

To test the image locally:

docker run -p 8501:8501 -t hf-vit-serving

The model will be available at http://localhost:8501/v1/models/hf-vit:predict.

Pushing to a Registry

For Kubernetes, push the image to a container registry like Docker Hub or Google Container Registry (GCR). For GCR:

docker tag hf-vit-serving gcr.io/[PROJECT_ID]/hf-vit-serving:latest
docker push gcr.io/[PROJECT_ID]/hf-vit-serving:latest

Deploying on Kubernetes

Provisioning a Cluster on GKE

Create a Kubernetes cluster using gcloud:

gcloud container clusters create tf-serving-cluster --num-nodes=3 --zone=us-central1-a

Get credentials:

gcloud container clusters get-credentials tf-serving-cluster --zone=us-central1-a

Writing Kubernetes Manifests

Create a deployment manifest deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vit-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vit-server
  template:
    metadata:
      labels:
        app: vit-server
    spec:
      containers:
      - name: vit-container
        image: gcr.io/[PROJECT_ID]/hf-vit-serving:latest
        ports:
        - containerPort: 8501

Then a service service.yaml to expose the deployment:

apiVersion: v1
kind: Service
metadata:
  name: vit-service
spec:
  type: LoadBalancer
  selector:
    app: vit-server
  ports:
  - port: 8501
    targetPort: 8501

Performing the Deployment

Apply the manifests:

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

Check the status:

kubectl get pods
kubectl get services

Wait for the external IP to be assigned to the service.

Testing the Endpoint

Once the service is up, test with a sample image:

curl -X POST http://[EXTERNAL_IP]:8501/v1/models/hf-vit:predict -H "Content-Type: application/json" -d '{"instances": [{"image": {"b64": "<base64_encoded_image>"}}]}'

Replace <base64_encoded_image> with a base64-encoded image string.

Notes on TF Serving Configurations

TensorFlow Serving supports multiple configurations, such as batching and model version control. For production, consider enabling batching to improve throughput:

docker run -p 8501:8501 -t hf-vit-serving --enable_batching=true

In Kubernetes, you can pass arguments via the args field in the container spec.

Conclusion

By containerizing the ViT model with Docker and deploying it on Kubernetes, you achieve a scalable, production-ready serving infrastructure. This approach is widely used in industry and provides fine-grained control over deployments. The same method can be applied to other TensorFlow Serving models.

Acknowledgment

This work builds on the previous post on deploying ViT with TensorFlow Serving. All code is available in the accompanying repository.

Scaling Vision Transformer Deployments with Docker and Kubernetes