Laravel

In a previous series of posts, we demonstrated how to deploy a Vision Transformer (ViT) model from the Hugging Face Transformers library locally and on a Kubernetes cluster. Now, we turn to Google Cloud's Vertex AI platform, which offers the same scalability as Kubernetes but with far less code.

What is Vertex AI?

According to Google Cloud, Vertex AI provides tools to support the entire ML workflow, across different model types and levels of ML expertise. For model deployment, it offers:

Authentication
Autoscaling based on traffic
Model versioning
Traffic splitting between versions
Rate limiting
Model monitoring and logging
Support for online and batch predictions

For TensorFlow models, Vertex AI includes off-the-shelf utilities, but it also supports PyTorch and scikit-learn.

The Serving Model

We use the same ViT B/16 model implemented in TensorFlow, serialized with preprocessing and postprocessing operations embedded to reduce training-serving skew. The model accepts base64-encoded image strings, resizes images to 224x224, normalizes to [-1, 1], transposes to channels-first layout, runs inference, and outputs confidence scores and string labels.

Deployment Workflow

Upload your trained TensorFlow SavedModel to a Google Cloud Storage (GCS) bucket. Then follow these steps:

Vertex AI Model Registry

This fully managed registry stores models and manages multiple versions. You don't need to worry about storage or security. It supports TensorFlow SavedModel, scikit-learn, and XGBoost.

Vertex AI Endpoint

An endpoint receives prediction requests and sends responses. It allows configuration of model version, VM spec (CPU/memory/accelerators), number of compute nodes, traffic splits, and monitoring.

Performing the Deployment

Using the google-cloud-aiplatform Python SDK, the process involves four steps:

Upload the model to the registry.
Create an endpoint.
Deploy the model to the endpoint.
Make prediction requests.

Here is a code snippet to upload a model:

tf28_gpu_model_dict = {
    "display_name": "ViT Base TF2.8 GPU model",
    "artifact_uri": f"{GCS_BUCKET}/{LOCAL_MODEL_DIR}",
    "container_spec": {
        "image_uri": "us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-8:latest",
    },
}
tf28_gpu_model = (
    model_service_client.upload_model(parent=PARENT, model=tf28_gpu_model_dict)
    .result(timeout=180)
    .model
)

Replace GCS_BUCKET with your bucket path (e.g., gs://hf-tf-vision). The container_spec uses a pre-built Vertex AI TensorFlow serving image. After uploading, create an endpoint and deploy the model to it.

Conclusion

Vertex AI simplifies deploying and scaling ML models with minimal code. By following the steps above, you can serve your ViT model for production use.

For the complete example, see the accompanying Colab notebook.

Deploy Vision Transformers on Google Cloud with Vertex AI: A Step-by-Step Guide