Understanding Embeddings
An embedding is a numerical representation of a piece of information—such as text, documents, images, or audio—that captures its semantic meaning. For example, the sentence "What is the main benefit of voting?" can be represented as a list of 384 numbers like [0.84, 0.42, ..., 0.02]. This numerical form allows us to compute distances between embeddings to measure semantic similarity.
Embeddings aren't limited to text. You can create an embedding of an image and compare it with a text embedding to check if a sentence describes the image. This capability powers systems for image search, classification, description, and more.
Embeddings are generated using open-source libraries like Sentence Transformers, which can create state-of-the-art embeddings from text and images for free.
What Are Embeddings For?
"[...] once you understand this ML multitool (embedding), you'll be able to build everything from search engines to recommendation systems to chatbots and a whole lot more." — Dale Markowitz, Google Cloud
Once information is embedded, countless industrial applications open up. Google Search uses embeddings to match text to text and text to images; Snapchat uses them for ad targeting; Meta (Facebook) employs them for social search.
To leverage embeddings, companies must embed their datasets, enabling algorithms to search, sort, and group efficiently. While this can be expensive and technically complex, open-source tools make it accessible.
Getting Started with Embeddings
We'll build a small Frequently Asked Questions (FAQs) engine: given a user query, identify the most similar FAQ. We'll use the US Social Security Medicare FAQs.
First, we need to embed our dataset. The Hugging Face Inference API allows this with a simple POST call. Since embeddings capture semantic meaning, we can compare them to find the closest match to a query.
In summary, we will:
- Embed Medicare's FAQs using the Inference API.
- Upload the embedded questions to the Hugging Face Hub for free hosting.
- Compare a customer query to the embedded dataset to find the most similar FAQ.
1. Embedding a Dataset
Select a pre-trained model for creating embeddings. We'll use sentence-transformers/all-MiniLM-L6-v2 because it's small but powerful. Future posts will explore other models and trade-offs.
First, log in to the Hugging Face Hub and create a write token in your Account Settings. Store it in hf_token.
model_id = "sentence-transformers/all-MiniLM-L6-v2"
hf_token = "get your token in http://hf.co/settings/tokens"
To generate embeddings, use the endpoint https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id} with headers {"Authorization": f"Bearer {hf_token}"}. Here's a function that receives a dictionary of texts and returns a list of embeddings:
import requests
api_url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_id}"
headers = {"Authorization": f"Bearer {hf_token}"}
def query(texts):
response = requests.post(api_url, headers=headers, json={"inputs": texts, "options": {"wait_for_model": True}})
return response.json()
The first request may take about 20 seconds as the model downloads. Use a retry decorator (install with pip install retry) to wait 10 seconds and retry up to three times if needed. Subsequent calls are much faster.
The API does not enforce strict rate limits; Hugging Face balances loads evenly. For large-scale embedding, consider the Hugging Face Accelerated Inference API.
2. Host Embeddings for Free on the Hugging Face Hub
After embedding your dataset, you can upload the vectors to the Hugging Face Hub for free hosting and easy access. This enables persistent storage and sharing of embeddings.
3. Get the Most Similar FAQ to a Query
With the embedded dataset hosted, you can compare a new user query against all FAQs using cosine similarity or other distance metrics. The FAQ with the highest similarity score is returned as the answer.
Additional Resources
For deeper learning, explore training advanced embedding models and techniques in the Sentence Transformers documentation.