DailyGlimpse

Reformer: Training Transformer Models on Sequences of Half a Million Tokens with Under 8GB RAM

AI
April 26, 2026 · 5:54 PM
Reformer: Training Transformer Models on Sequences of Half a Million Tokens with Under 8GB RAM

The Reformer model, introduced by Kitaev, Kaiser et al. (2020), stands as one of the most memory-efficient transformer architectures for long-sequence modeling. It can process up to half a million tokens at once using less than 8GB of RAM, a significant leap from conventional models like BERT, which are limited to 512 tokens. This efficiency is achieved through four key innovations: a novel self-attention layer, chunked feed-forward layers, reversible residual layers, and axial positional encodings. Each component is optimized to reduce memory without sacrificing performance, making the Reformer ideal for tasks such as summarization and question answering that require processing extensive input sequences.

The Reformer's self-attention layer employs two variants: local self-attention and locality-sensitive hashing (LSH) self-attention. Local attention restricts each token to attend only to a nearby window, reducing complexity from quadratic to linear with respect to sequence length. LSH attention further improves efficiency by grouping tokens into buckets based on similarity, allowing each token to attend only to others within the same bucket. This approximation maintains high performance while drastically cutting memory usage.

Chunked feed-forward layers address the memory bottleneck of large feed-forward networks by splitting the hidden dimension into smaller chunks, processing each sequentially. This reduces peak memory consumption during training without slowing down inference. Reversible residual layers eliminate the need to store activations for backpropagation by using a clever architecture that can recompute them from the output, saving substantial memory in deep networks.

Finally, axial positional encodings enable the model to handle extremely long sequences by factoring the position embedding into two lower-dimensional matrices. This reduces the memory footprint of positional encoding from O(LD) to O(L1D + L2*D), where L1 and L2 are factors of the sequence length L, making it feasible to encode positions for half a million tokens.

These combined innovations allow the Reformer to push the boundaries of long-sequence modeling, enabling researchers to train on datasets that were previously out of reach. The model is available in the Hugging Face Transformers library, and users are encouraged to read the detailed blog post for configuration guidance.