Creating a custom leaderboard to track and compare AI models is a powerful way to benchmark performance. In this guide, we walk through the process using Vectara's hallucination leaderboard as a real-world example.
What You'll Need
- A Hugging Face account
- A dataset for evaluation
- A scoring metric (e.g., hallucination rate)
- Basic Python scripting skills
Step 1: Define Your Metric
Vectara's leaderboard measures how often models produce hallucinations—factually incorrect or nonsensical outputs. For your own board, choose a metric that matters to your use case, like accuracy, coherence, or safety.
Step 2: Prepare Evaluation Data
Create a test set of prompts and expected responses. For hallucination testing, use questions with verifiable facts. Vectara's dataset includes a mix of open-domain queries with known answers.
Step 3: Set Up the Evaluation Pipeline
Use Hugging Face's datasets library to load your data and model outputs. Write a Python script to:
- Load each model from the Hub
- Generate responses for your test set
- Score each response using your metric
from datasets import load_dataset
from transformers import pipeline
# Load evaluation data
data = load_dataset("your-org/your-dataset", split="test")
# Initialize model
generator = pipeline("text-generation", model="model-id")
# Evaluate
results = []
for example in data:
response = generator(example["prompt"], max_length=100)[0]["generated_text"]
score = compute_hallucination(response, example["expected"])
results.append({"model": "model-id", "prompt": example["prompt"], "score": score})
Step 4: Aggregating Results
Aggregate scores per model. Vectara reports average hallucination rate across all prompts. Store results in a structured format like JSON or a Pandas DataFrame.
Step 5: Create the Leaderboard on Hugging Face
- Push your aggregated results to a Hugging Face dataset space
- Use the Hugging Face Spaces UI or Gradio to display a live leaderboard
- Include model names, scores, and links to model cards
Step 6: Maintain and Update
As new models release, add them to your pipeline and update the leaderboard. Vectara updates its board regularly to reflect the latest improvements.
Example: Vectara's Hallucination Leaderboard
Vectara's leaderboard at huggingface.co/spaces/vectara/leaderboard tracks over 50 models. It uses a proprietary hallucination metric on a curated set of 100 questions. The board is public and encourages community submissions.
By following these steps, you can build a transparent, reproducible benchmark for any AI capability. Whether you're tracking accuracy, bias, or creativity, a Hugging Face leaderboard makes your results accessible to the community.