The Qwen team has released FlashQLA, a high-performance linear attention kernel library that achieves up to a 3x speedup on NVIDIA Hopper GPUs compared to existing implementations. Designed specifically for the Gated Delta Network (GDN) attention mechanism used in Qwen3.5 and Qwen3.6 model families, the library optimizes memory layout, instruction scheduling, and hardware utilization to make linear attention more efficient.
Linear attention reduces the complexity of standard softmax attention from O(n²) to O(n), making it far more scalable for long sequences. The Gated Delta Network (GDN) uses an exponentially decaying gate to control context propagation, and FlashQLA exploits this formulation for performance gains. Before FlashQLA, the standard GDN implementation relied on the Flash Linear Attention (FLA) library with Triton kernels. FlashQLA, built on the TileLang compiler framework, supercedes that approach by offering custom-designed kernels fine-tuned for NVIDIA Hopper GPUs.
FlashQLA is released under the MIT License and is available as an open-source project. It directly addresses the computational bottleneck of long-context LLM inference and training, a critical area in advancing AI efficiency.