Introduction
🤗 Datasets is an open-source library for downloading and preparing datasets across all domains. Its minimalistic API lets you load a dataset in one line of Python, with efficient pre-processing functions. It offers a vast collection of audio datasets, making it a go-to tool for researchers and practitioners. This guide covers audio-specific features that streamline working with audio data.
The Hub
The Hugging Face Hub hosts open-source models, datasets, and demos. It includes a growing collection of audio datasets for various domains, tasks, and languages. Thanks to tight integration with 🤗 Datasets, any dataset on the Hub can be downloaded with a single line of code.
Filter datasets by task:
Currently, there are 77 speech recognition and 28 audio classification datasets, with numbers growing. Clicking on a dataset like common_voice reveals a dataset card with details, models trained on it, and a preview of audio samples you can play in real time.
Load an Audio Dataset
The load_dataset function handles downloading, extracting, and preparing data in one step. For example, to load the GigaSpeech dataset:
from datasets import load_dataset
gigaspeech = load_dataset("speechcolab/gigaspeech", "xs")
print(gigaspeech)
Output:
DatasetDict({
train: Dataset({...})
validation: Dataset({...})
test: Dataset({...})
})
The returned DatasetDict acts like a Python dictionary. Access the training split with gigaspeech["train"] and the first item with gigaspeech["train"][0].
Easy to Load, Easy to Process
Audio datasets include an audio feature that stores the audio array, sampling rate, and path. You can cast audio columns to a desired sampling rate and apply transformations efficiently using map. The library also supports streaming mode, which loads data on the fly without downloading the entire dataset.
Streaming Mode: The Silver Bullet
Streaming mode is ideal for huge datasets that don't fit in memory. Enable it by passing streaming=True to load_dataset. Data is fetched in chunks, allowing you to iterate over samples without full download.
A Tour of Audio Datasets on the Hub
Explore datasets by task or language. Use the Hub's filters to find datasets like common_voice (multilingual speech recognition) or esc50 (audio classification). Each dataset card includes a preview, making it easy to assess quality before use.
Closing Remarks
🤗 Datasets simplifies audio dataset handling from download to preprocessing. With one-line loading, built-in audio features, and streaming, it's an essential tool for audio machine learning projects.