Laravel

The Hugging Face Hub has become a central repository for machine learning models, datasets, and applications. As the number of datasets grows, accurate metadata is crucial for discoverability. However, a significant portion of datasets lack language metadata, making them hard to find. In a new experiment, Huggy Lingo uses machine learning to automatically detect the languages of datasets and update their metadata via librarian bots.

Language Metadata for Datasets

Currently, around 50,000 public datasets are hosted on the Hugging Face Hub. Only about 13% include language metadata, with English dominating—19% of datasets specify en. The remaining 87% have no language information, hindering search and filtering.

Language metadata is vital for finding datasets relevant to specific languages, especially for training LLMs in underrepresented languages. It also helps identify language biases on the Hub.

Predicting Languages with Machine Learning

To address this, the team developed a method to sample text from datasets using the dataset viewer API, avoiding full downloads. For each dataset, they fetch 20 rows from likely text columns (e.g., text, prompt) and pass them to Meta's fastText language identification model, which supports 217 languages.

Updating Metadata with Librarian Bots

Once predictions are made, librarian bots create pull requests to add the identified language tags to dataset cards. This automated process aims to enhance metadata quality across the Hub.

"We're using machine learning to detect the language of Hub datasets with no language metadata, and librarian bots to make pull requests to add this metadata."

This initiative improves dataset discoverability and helps the community better understand language representation on the Hub.

Huggy Lingo: Using Machine Learning to Enhance Language Metadata on Hugging Face Hub

Language Metadata for Datasets

Predicting Languages with Machine Learning

Updating Metadata with Librarian Bots

We Care About Your Privacy

How and why we process data