Laravel

The Goal

I wanted to create a micro-service that automatically detects whether a customer review left in Discord is positive or negative. This would allow me to treat the comment accordingly and improve the customer experience. For instance, if the review was negative, I could create a feature which would contact the customer, apologize for the poor quality of service, and inform him/her that our support team will contact him/her as soon as possible to assist him and hopefully fix the problem. Since I don't plan to get more than 2,000 requests per month, I didn't impose any performance constraints regarding the time and the scalability.

The Transformers Library

I was a bit confused at the beginning when I downloaded the .h5 file. I thought it would be compatible with tensorflow.keras.models.load_model, but this wasn't the case. After a few minutes of research I was able to figure out that the file was a weights checkpoint rather than a Keras model. After that, I tried out the API that Hugging Face offers and read a bit more about the pipeline feature they offer. Since the results of the API & the pipeline were great, I decided that I could serve the model through the pipeline on my own server.

Below is the official example from the Transformers GitHub page:

from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier('We are very happy to include pipeline into the transformers repository.')
[{'label': 'POSITIVE', 'score': 0.9978193640708923}]

Deploy transformers to Google Cloud

GCP is chosen as it is the cloud environment I am using in my personal organization.

Step 1 - Research

I already knew that I could use an API-Service like flask to serve a transformers model. I searched in the Google Cloud AI documentation and found a service to host Tensorflow models named AI-Platform Prediction. I also found App Engine and Cloud Run there, but I was concerned about the memory usage for App Engine and was not very familiar with Docker.

Step 2 - Test on AI-Platform Prediction

As the model is not a "pure TensorFlow" saved model but a checkpoint, and I couldn't turn it into a "pure TensorFlow model", I figured out that the example on this page wouldn't work. From there I saw that I could write some custom code, allowing me to load the pipeline instead of having to handle the model, which seemed easier. I also learned that I could define a pre-prediction & post-prediction action, which could be useful in the future for pre- or post-processing the data for customers' needs. I followed Google's guide but encountered an issue as the service is still in beta and everything is not stable.

Step 3 - Test on App Engine

I moved to Google's App Engine as it's a service that I am familiar with, but encountered an installation issue with TensorFlow due to a missing system dependency file. I then tried with PyTorch which worked with an F4_1G instance, but it couldn't handle more than 2 requests on the same instance, which isn't really great performance-wise.

Step 4 - Test on Cloud Run

Lastly, I moved to Cloud Run with a docker image. I followed this guide to get an idea of how it works. In Cloud Run, I could configure a higher memory and more vCPUs to perform the prediction with PyTorch. I ditched Tensorflow as PyTorch seems to load the model faster.

Implementation of the serverless pipeline

The final solution consists of four different components:

main.py handling the request to the pipeline
Dockerfile used to create the image that will be deployed on Cloud Run.
Model folder having the pytorch_model.bin, config.json and vocab.txt.
- Model: DistilBERT base uncased finetuned SST-2
- To download the model folder, follow the instructions in the button.
- You don't need to keep the rust_model.ot or the tf_model.h5...

Building a Serverless Text Sentiment Pipeline on Google Cloud