DailyGlimpse

DeepSeek-V4 Breaks Barriers: Million-Token Contexts Achieved with Compressed Attention

AI
April 26, 2026 · 6:00 PM
DeepSeek-V4 Breaks Barriers: Million-Token Contexts Achieved with Compressed Attention

DeepSeek AI has unveiled the DeepSeek-V4 series, a pair of Mixture-of-Experts (MoE) language models designed to make one-million-token context windows practical and cost-effective for inference. The release includes two models: DeepSeek-V4-Pro, with 1.6 trillion total parameters and 49 billion activated per token, and DeepSeek-V4-Flash, with 284 billion total parameters and 13 billion activated per token. Both models natively support a context length of one million tokens, trained on 33 trillion and 32 trillion tokens respectively. Model checkpoints for all four variants—including base versions—are publicly available on Hugging Face.

The core challenge addressed by DeepSeek-V4 is the quadratic computational complexity of standard Transformer attention, which makes million-token sequences prohibitively expensive. DeepSeek's solution involves four key innovations: a hybrid attention architecture, a novel residual connection design, a different optimizer, and FP4 quantization-aware training.

The central architectural innovation is a hybrid attention mechanism that combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), interleaved across Transformer layers. CSA compresses the Key-Value (KV) cache of every m tokens into one entry using a learned token-level compressor, then applies DeepSeek Sparse Attention (DSA), where each query token attends only to the top-k selected compressed KV entries. A component called the Lightning Indexer handles sparse selection by scoring queries against compressed KV blocks. Both CSA and HCA include a sliding window attention branch covering the most recent n_win tokens for local dependency modeling. HCA is more aggressive, consolidating KV entries of every m' tokens (where m' ≫ m) into a single compressed entry.

This release marks a significant step toward making long-context language models viable for real-world applications, eliminating a major bottleneck in scaling AI systems to handle extensive documents, codebases, or conversations.