IBM has released two new open-source speech recognition models, Granite Speech 4.1 2B and Granite Speech 4.1 2B-NAR, designed to deliver high accuracy without the heavy compute demands typically associated with production-grade ASR systems. Both models are available on Hugging Face under the Apache 2.0 license.
The Granite Speech 4.1 2B model supports multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST) across English, French, German, Spanish, Portuguese, and Japanese. Its non-autoregressive counterpart, Granite Speech 4.1 2B-NAR, is optimized for low-latency ASR tasks and supports English, French, German, Spanish, and Portuguese—but not Japanese. A third variant, Granite Speech 4.1 2B-Plus, adds speaker-attributed ASR and word-level timestamps.
On the Open ASR Leaderboard (as of April 2026), Granite Speech 4.1 2B achieves a mean Word Error Rate (WER) of 5.33%, with particularly strong performance on LibriSpeech clean (1.33% WER) and LibriSpeech other (2.5% WER).
The architecture consists of three components: a speech encoder using 16 conformer blocks trained with Connectionist Temporal Classification (CTC), a speech-text modality adapter featuring a 2-layer window query transformer (Q-Former), and a language model. The encoder uses frame importance sampling to focus on informative audio segments, while the Q-Former compresses acoustic features by a factor of 10, producing a 10Hz embedding rate for the LLM.