Laravel

Creating a custom leaderboard to track and compare AI models is a powerful way to benchmark performance. In this guide, we walk through the process using Vectara's hallucination leaderboard as a real-world example.

What You'll Need

A Hugging Face account
A dataset for evaluation
A scoring metric (e.g., hallucination rate)
Basic Python scripting skills

Step 1: Define Your Metric

Vectara's leaderboard measures how often models produce hallucinations—factually incorrect or nonsensical outputs. For your own board, choose a metric that matters to your use case, like accuracy, coherence, or safety.

Step 2: Prepare Evaluation Data

Create a test set of prompts and expected responses. For hallucination testing, use questions with verifiable facts. Vectara's dataset includes a mix of open-domain queries with known answers.

Step 3: Set Up the Evaluation Pipeline

Use Hugging Face's datasets library to load your data and model outputs. Write a Python script to:

Load each model from the Hub
Generate responses for your test set
Score each response using your metric

from datasets import load_dataset
from transformers import pipeline

# Load evaluation data
data = load_dataset("your-org/your-dataset", split="test")

# Initialize model
generator = pipeline("text-generation", model="model-id")

# Evaluate
results = []
for example in data:
    response = generator(example["prompt"], max_length=100)[0]["generated_text"]
    score = compute_hallucination(response, example["expected"])
    results.append({"model": "model-id", "prompt": example["prompt"], "score": score})

Step 4: Aggregating Results

Aggregate scores per model. Vectara reports average hallucination rate across all prompts. Store results in a structured format like JSON or a Pandas DataFrame.

Step 5: Create the Leaderboard on Hugging Face

Push your aggregated results to a Hugging Face dataset space
Use the Hugging Face Spaces UI or Gradio to display a live leaderboard
Include model names, scores, and links to model cards

Step 6: Maintain and Update

As new models release, add them to your pipeline and update the leaderboard. Vectara updates its board regularly to reflect the latest improvements.

Example: Vectara's Hallucination Leaderboard

Vectara's leaderboard at huggingface.co/spaces/vectara/leaderboard tracks over 50 models. It uses a proprietary hallucination metric on a curated set of 100 questions. The board is public and encourages community submissions.

By following these steps, you can build a transparent, reproducible benchmark for any AI capability. Whether you're tracking accuracy, bias, or creativity, a Hugging Face leaderboard makes your results accessible to the community.

Build Your Own AI Leaderboard: A Step-by-Step Guide Inspired by Vectara's Hallucination Ranker