In a previous series of posts, we demonstrated how to deploy a Vision Transformer (ViT) model from the Hugging Face Transformers library locally and on a Kubernetes cluster. Now, we turn to Google Cloud's Vertex AI platform, which offers the same scalability as Kubernetes but with far less code.
What is Vertex AI?
According to Google Cloud, Vertex AI provides tools to support the entire ML workflow, across different model types and levels of ML expertise. For model deployment, it offers:
- Authentication
- Autoscaling based on traffic
- Model versioning
- Traffic splitting between versions
- Rate limiting
- Model monitoring and logging
- Support for online and batch predictions
For TensorFlow models, Vertex AI includes off-the-shelf utilities, but it also supports PyTorch and scikit-learn.
The Serving Model
We use the same ViT B/16 model implemented in TensorFlow, serialized with preprocessing and postprocessing operations embedded to reduce training-serving skew. The model accepts base64-encoded image strings, resizes images to 224x224, normalizes to [-1, 1], transposes to channels-first layout, runs inference, and outputs confidence scores and string labels.
Deployment Workflow
Upload your trained TensorFlow SavedModel to a Google Cloud Storage (GCS) bucket. Then follow these steps:
Vertex AI Model Registry
This fully managed registry stores models and manages multiple versions. You don't need to worry about storage or security. It supports TensorFlow SavedModel, scikit-learn, and XGBoost.
Vertex AI Endpoint
An endpoint receives prediction requests and sends responses. It allows configuration of model version, VM spec (CPU/memory/accelerators), number of compute nodes, traffic splits, and monitoring.
Performing the Deployment
Using the google-cloud-aiplatform Python SDK, the process involves four steps:
- Upload the model to the registry.
- Create an endpoint.
- Deploy the model to the endpoint.
- Make prediction requests.
Here is a code snippet to upload a model:
tf28_gpu_model_dict = {
"display_name": "ViT Base TF2.8 GPU model",
"artifact_uri": f"{GCS_BUCKET}/{LOCAL_MODEL_DIR}",
"container_spec": {
"image_uri": "us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-8:latest",
},
}
tf28_gpu_model = (
model_service_client.upload_model(parent=PARENT, model=tf28_gpu_model_dict)
.result(timeout=180)
.model
)
Replace GCS_BUCKET with your bucket path (e.g., gs://hf-tf-vision). The container_spec uses a pre-built Vertex AI TensorFlow serving image. After uploading, create an endpoint and deploy the model to it.
Conclusion
Vertex AI simplifies deploying and scaling ML models with minimal code. By following the steps above, you can serve your ViT model for production use.
For the complete example, see the accompanying Colab notebook.