Laravel

Since the launch of ChatGPT, the development of large language models (LLMs) has surged, especially those fine-tuned to follow instructions. However, comparing these models remains challenging due to a lack of rigorous benchmarks. Evaluating instruction-tuned models is inherently difficult because user preference often hinges on qualitative style, whereas traditional NLP evaluation was more defined.

A common claim is that a new LLM is "preferred to ChatGPT N% of the time," but often this assessment relies on GPT-4-based evaluations rather than direct human feedback. The goal is to approximate human judgments. Reinforcement learning from human feedback (RLHF) has popularized interfaces for comparing model outputs, using this data to train reward models that predict preferred text. Now, rating and ranking model outputs has become a general evaluation tool.

Using a language model to evaluate other models is efficient, but a critical piece is missing: ensuring that the downstream tool aligns with the original human measurement. In this post, we examine when you can trust LLM-generated labels by expanding the Open LLM Leaderboard evaluation suite.

Evaluating Preferences of Open-Source Models

Human involvement in data curation is costly. Few human-labeled preference datasets exist for training, such as Anthropic's HHH data, OpenAssistant's dialogue rankings, or OpenAI's summarization datasets. Preference labels from human raters can create a relative Elo ranking between models—a global ranking derived from pairwise comparisons.

To investigate, we curated a held-out set of instruction prompts and completions from popular open-source models: Koala 13b, Vicuna 13b, OpenAssistant 12b, and Dolly 12b. We collected 327 high-quality human-written prompts across diverse categories like generation, brainstorming, QA, summarization, commonsense, and coding. Each prompt's length averaged 24 tokens, and completions averaged 69 tokens.

We evaluated model outputs with both professional human labelers (via Scale AI) and GPT-4. Labelers rated responses on a Likert scale (1–8) for helpfulness and truthfulness in pairwise comparisons. Using this data, we computed bootstrapped Elo scores based on win probabilities.

Our findings highlight the strengths and limitations of using LLMs as evaluators. While GPT-4 offers speed and scalability, it may not perfectly replicate human preferences, especially in nuanced tasks. The study underscores the need for careful calibration when substituting human labels with AI-generated ones.

Can AI Reliably Replace Human Judgment in Data Labeling? A New Study Investigates

Evaluating Preferences of Open-Source Models

We Care About Your Privacy

How and why we process data