Wav2Vec2, the state-of-the-art speech recognition model from Meta AI, has been a game-changer since its release in September 2020. With over 250,000 monthly downloads on Hugging Face, it's widely used—but it has a critical limitation: the Transformer architecture's quadratic attention cost means running the model on audio longer than a few seconds can quickly exhaust memory. This post explains how to overcome that using chunking with stride, a technique that leverages the Connectionist Temporal Classification (CTC) architecture.
The Problem: Long Audio = Crash
Standard Wav2Vec2 inference on a file longer than a few seconds will crash due to O(n²) memory usage. For example:
from transformers import pipeline
pipe = pipeline(model="facebook/wav2vec2-base-960h")
pipe("very_long_file.mp3") # Out of memory!
However, adding chunk_length_s=10 makes it work. The rest of this article explains how.
Simple Chunking (and Why It Fails)
The naive approach is to split audio into short chunks (e.g., 10 seconds), process each separately, and concatenate results. This is computationally efficient but produces poor results near chunk boundaries—the model lacks context there. Attempts to cut only at silence points are unreliable.
Chunking with Stride (The Solution)
CTC maps each audio frame to a single letter (logit). This lets us:
- Process overlapping chunks so the model has full context in the center.
- Discard the logits from the overlapping edges (the "stride").
- Chain the remaining logits to reconstruct near-perfect transcription.
In practice, most errors are confined to the discarded strides. Enabling this in the pipeline is simple:
output = pipe("very_long_file.mp3", chunk_length_s=10, stride_length_s=(4, 2))
Here, each chunk is 10 seconds with a 4-second left stride and 2-second right stride.
Chunking with Language Models
Wav2Vec2 models can be augmented with an n-gram language model to improve word error rate. Since the LM operates on logits, the same stride technique works without modification—just use the same chunk_length_s argument.
Live Inference
CTC models are single-pass and very fast, making them ideal for live transcription. By feeding audio in small chunks (e.g., 1-second strides on 10-second windows), the model can output text incrementally, reducing latency. This requires more inference steps but enables real-time display.