Whisper, OpenAI's pre-trained model for automatic speech recognition (ASR), has set new benchmarks by being trained on 680,000 hours of labeled audio-transcription data, including 117,000 hours of multilingual data. This enables it to support over 96 languages, many of which are low-resource. Unlike earlier models like Wav2Vec 2.0, which learn intermediate speech representations through unsupervised pre-training, Whisper directly learns a speech-to-text mapping, requiring less fine-tuning to achieve strong performance.
In a detailed guide, Hugging Face demonstrates how to fine-tune Whisper for any multilingual ASR dataset using the 🤗 Transformers library. The blog covers the Whisper model architecture, the Common Voice dataset, and the theory behind fine-tuning, along with executable code for data preparation and training. A streamlined version is available as a Google Colab notebook.
Whisper adopts a Transformer-based encoder-decoder (sequence-to-sequence) architecture. The encoder processes log-Mel spectrograms derived from raw audio, generating hidden state representations. The decoder then autoregressively predicts text tokens, conditioned on both previous tokens and encoder states, effectively embedding a language model internally. This end-to-end trainable system uses cross-entropy loss.
The model comes in five sizes (tiny, base, small, medium, large) with multilingual checkpoints for each. For demonstration, the guide fine-tunes the multilingual 'small' checkpoint (244M parameters) on a low-resource language from the Common Voice dataset, showing that with as little as 8 hours of fine-tuning data, competitive results can be achieved.
Key steps include:
- Environment Setup: Installing required libraries.
- Dataset Loading: Accessing Common Voice data.
- Data Preparation: Configuring feature extractor, tokenizer, and data collator.
- Training & Evaluation: Using the Trainer API.
- Demo Creation: Building an interactive Gradio app.
This resource empowers developers to adapt Whisper for custom multilingual ASR tasks efficiently.