Sentiment analysis models typically require access to unencrypted text to determine whether a message is positive, negative, or neutral. This poses significant privacy risks when handling sensitive personal data. Homomorphic encryption offers a solution by enabling computations on encrypted data without decryption.
This article demonstrates how to build a sentiment analysis model that operates on encrypted data using the Concrete-ML library, which allows data scientists to leverage machine learning in fully homomorphic encryption (FHE) settings without cryptography expertise.
Pipeline Overview
The workflow combines a transformer model for text representation and XGBoost for classification, all executed on encrypted data:
- Text Representation: Use a pre-trained BERT transformer (fine-tuned on sentiment analysis) to convert text into a 768-dimensional vector by averaging hidden states.
- Classification: Train an XGBoost model on the hidden representations to predict sentiment (negative, neutral, positive).
- Encryption: Apply Concrete-ML to convert the XGBoost classifier into an FHE-compatible model, allowing predictions on encrypted inputs.
Step-by-Step Implementation
1. Environment Setup
pip install -U pip setuptools
pip install concrete-ml transformers datasets
2. Load and Explore Dataset
Using the Twitter Airline Sentiment dataset:
from datasets import load_dataset
train = load_dataset("osanseviero/twitter-airline-sentiment")["train"].to_pandas()
text_X = train['text']
y = train['airline_sentiment']
y = y.replace(['negative', 'neutral', 'positive'], [0, 1, 2])
The dataset is imbalanced: 62.7% negative, 21.2% neutral, and 16.1% positive examples.
3. Text Representation with Transformer
Load a BERT-based model fine-tuned for sentiment:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment-latest")
transformer_model = AutoModelForSequenceClassification.from_pretrained(
"cardiffnlp/twitter-roberta-base-sentiment-latest"
)
Convert texts to hidden representations by averaging token-level outputs:
def text_to_tensor(texts, model, tokenizer, device):
tensors = []
for text in texts:
tokens = tokenizer.encode(text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(tokens, output_hidden_states=True)
hidden = outputs.hidden_states[-1].mean(dim=1).cpu().numpy()
tensors.append(hidden)
return np.vstack(tensors)
4. Train XGBoost Classifier
Split data and train on hidden representations:
from sklearn.model_selection import train_test_split
import xgboost as xgb
X_train, X_test, y_train, y_test = train_test_split(
text_X, y, test_size=0.1, random_state=42
)
X_train_repr = text_to_tensor(X_train.tolist(), transformer_model, tokenizer, device)
X_test_repr = text_to_tensor(X_test.tolist(), transformer_model, tokenizer, device)
model = xgb.XGBClassifier()
model.fit(X_train_repr, y_train)
5. Convert to FHE-Compatible Model
Use Concrete-ML to compile the model for encrypted inference:
from concrete.ml.sklearn import XGBClassifier
concrete_model = XGBClassifier()
concrete_model.fit(X_train_repr, y_train)
# Compile to FHE
circuit = concrete_model.compile(X_train_repr)
6. Predict on Encrypted Data
The client encrypts input, sends it to the server, which runs the FHE circuit and returns encrypted predictions:
# Client side
encrypted_input = circuit.encrypt(X_test_repr[0])
# Server side
encrypted_prediction = circuit.run(encrypted_input)
# Client decrypts
prediction = circuit.decrypt(encrypted_prediction)
Deployment on Hugging Face Spaces
A complete demo is available on Hugging Face Spaces, showcasing client-server interaction where text is encrypted before being sent to the model.
Conclusion
Homomorphic encryption enables privacy-preserving sentiment analysis without sacrificing accuracy. By combining transformers for feature extraction and XGBoost for classification, developers can deploy models that never see raw user data. This approach is ideal for applications like analyzing private messages or medical records.