Moonshot AI, the developer behind the Kimi.ai assistant, has open-sourced FlashKDA, a high-performance CUDA kernel library for Kimi Delta Attention. Released under the MIT license on GitHub, FlashKDA delivers prefill speedups of 1.72x to 2.22x over standard baselines on NVIDIA H20 GPUs and serves as a drop-in replacement for the flash-linear-attention library.
Kimi Delta Attention (KDA) is a linear attention mechanism that improves upon Gated DeltaNet by using channel-wise gating. It enables linear scaling with sequence length, addressing the quadratic complexity of standard softmax attention. KDA is central to Moonshot's Kimi Linear hybrid model (48B total, 3B activated parameters), which uses a 3:1 KDA-to-MLA ratio to cut KV cache usage by 75% and boost decoding throughput 6x at 1 million token context.
FlashKDA implements the KDA forward pass efficiently, handling inputs (queries, keys, values, gates, beta logits, scale) and optional recurrent states for multi-turn inference. Built on NVIDIA's CUTLASS framework for Hopper architecture, these kernels optimize prefill performance, making long-context linear attention models more practical for production use.