Large language models can now be evaluated on zero-shot classification tasks with Evaluation on the Hub, a new low-code tool that makes it simple to measure model performance without writing custom code or requiring expensive infrastructure.
Zero-shot evaluation is a popular method for assessing how well language models perform tasks they haven't been explicitly trained on. Researchers have shown that these models often learn capabilities during training without labeled examples. The Inverse Scaling Prize is one recent community effort that uses zero-shot evaluation across model sizes to find tasks where larger models may actually perform worse than smaller ones.
Enabling zero-shot evaluation on the Hub
Evaluation on the Hub now supports zero-shot evaluation for any causal language model on the Hub. This feature measures how likely a trained model is to produce a given set of tokens and doesn't require labeled training data, saving researchers from costly annotation efforts.
The underlying AutoTrain infrastructure has been upgraded to allow large models to be evaluated for free. For instance, a model with 66 billion parameters previously took 35 minutes just to load and compile; now, evaluating it on a zero-shot task with 2,000 sentence-length examples takes about 3.5 hours. The tool currently supports models up to 66 billion parameters, with support for even larger models on the way.
In a zero-shot text classification task, a dataset provides prompts and possible completions. The tool concatenates completions with prompts, sums the log-probabilities, normalizes them, and compares with the correct completion to report accuracy.
Case study: WinoBias evaluation
To demonstrate, the team evaluated various OPT models on WinoBias, a coreference task measuring gender bias related to occupations. WinoBias tests whether a model picks a stereotypical pronoun to fill in a sentence mentioning an occupation.
The dataset was formatted as a zero-shot task where classification options are different pronoun completions. The target is the anti-stereotypical completion (e.g., for "developer," which is stereotypically male, the anti-stereotypical pronoun is "she").
After submitting evaluation jobs via the interface, results appear as pull requests on the model's Hub repository. Plotting the results revealed an inverse scaling trend: smaller models were more likely to select the anti-stereotypical pronoun, while larger models tended to learn stereotypical gender-occupation associations. This corroborates findings from BIG-Bench and prior work showing larger models are more biased and generate more toxic text.
Better research tools for everyone
Open science has benefited from community-driven tools like EleutherAI's Language Model Evaluation Harness and the BIG-bench project. Evaluation on the Hub adds a low-code option that makes it easy to compare zero-shot performance across models, whether by FLOPS or model size, or between different training corpora.
The zero-shot text classification task is highly flexible—any dataset that can be turned into a Winograd schema—where examples to compare differ by only a few words—can be evaluated on multiple models at once. The goal is to simplify uploading new datasets and benchmarking.
Researchers can use this tool to explore the inverse scaling problem: tasks where larger models perform worse. The Inverse Scaling Prize competition challenges researchers to find such tasks. The team encourages the community to try zero-shot evaluation on models of all sizes and submit interesting findings to round 2 of the competition.