Today, Hugging Face introduced Evaluation on the Hub, a new tool powered by AutoTrain that enables users to evaluate any model on any dataset directly from the Hub without writing a single line of code. This initiative aims to address what the company calls an "evaluation crisis" in modern AI, where the reproducibility and comparability of model performance are hampered by inconsistent evaluation practices and implementation bugs.
Evaluation on the Hub allows users to:
- Find the best model for a task: Browse leaderboards for a dataset or evaluate a specific model not yet listed.
- Evaluate on new datasets: Upload a custom dataset to the Hub and run evaluations on multiple models with consistent methodology.
- Test models across diverse datasets: For example, evaluate a question-answering model on various related datasets.
The system integrates seamlessly with the Hugging Face ecosystem. Users launch evaluations from dataset pages, and the results are automatically posted as pull requests on model cards, embedding verified performance metrics in a standard format. As a practical demonstration, the team evaluated hundreds of models on key datasets and updated their model cards with results.
The tool also includes an advanced configuration mode for fine-tuning evaluation parameters, such as task type, dataset splits, column mappings, and metrics. For instance, users can select the F1 score, accuracy, or Matthew's correlation coefficient for classification tasks. A worked example using a dog, muffin, and fried chicken image dataset illustrates the process step by step.
Evaluation on the Hub is now available as a Spaces app, inviting the community to participate in creating a more robust and transparent evaluation ecosystem.