Qwen Team Releases FlashQLA: A High-Performance Memory Kernel Library for Up to 3× Speedup on NVIDIA Hopper GPUs

The race to make large language models faster and cheaper to run has been fought on two levels: model architecture and hardware. But there is a third frontier, which is often underestimated – the GPU kernel. A kernel is a low-level computing mechanism that performs mathematical operations on the GPU. Good writing requires understanding not just the math, but the architecture, instruction set, and hardware characteristics of the chip you're targeting. Most ML experts never write characters directly; they rely on libraries like FlashAttention or Triton to do it for them.
Meet each other FlashQLA: QwenLM's contribution to this category. Released under the MIT License and built on the framework of the TileLang compiler, it is a high-performance library specially optimized for the Gated Delta Network (GDN) attention method – the attention-specific architecture that powers the Qwen3.5 and Qwen3.6 model families.
What Is Linear Attention and Why Is It Important?
To understand what FlashQLA solves, it's helpful to understand what the typical attention costs of softmax are. In a typical Transformer, the attention method has a complexity of O(n²) – meaning that doubling the length of the sequence doubles the calculation. This is a fundamental bottleneck that makes processing long documents, long code files, or long conversations expensive.
Linear approximation replaces softmax with a formulation that reduces this to O(n) complexity, making it a very good approximation of sequence length. The Gated Delta Network (GDN) is one of the sequential attention methods, and it is combined with the structure of the Qwen hybrid model, where the GDN layers alternate with regular full attention layers. This hybrid design tries to get the best of both worlds: the expression of full attention where it is most needed, and the efficiency of linear attention everywhere else.
GDN uses what is called a 'gated' architecture — it uses a gate that decays with exposure to control how much core is pushed forward. This gateway is key to how FlashQLA achieves its performance benefits.
Problem with Existing Characters
Before FlashQLA, the standard implementation of GDN functionality came from the Flash Linear Attention (FLA) library, which uses the Triton script – Triton is OpenAI's Python-based GPU programming language. While Triton makes kernel customization more accessible, it comes with a trade-off: the kernels it produces aren't always optimized for specific hardware, especially for NVIDIA's Hopper architecture (H100 and H200 GPU generation).
The Hopper architecture introduced new features such as warpgroup Tensor Core performance and asynchronous data pipes that Triton can always use at its full potential. This is the gap FlashQLA is designed to fill.
What FlashQLA Does Differently
FlashQLA uses user integration and performance optimization for both forward passes (used during definition and training) and backward passes (used during training for gradient computation) of GDN Chunked Prefill. The result is a 2–3× acceleration on forward passes and a 2 × speed on reverse passes compared to the FLA Triton kernel in most cases on NVIDIA Hopper GPUs.
Three new technologies are driving these benefits:
1. The card's internal core parallelism is driven by the gate: Context parallelism (CP) refers to dividing a long sequence into multiple processing units so that they can work on different parts at the same time. FlashQLA uses the exponential decay property of the GDN gate to make this classification work statistically – because gate decay means that tokens further apart in a sequence have a decreasing influence on each other. This allows FlashQLA to automatically enable intra-card CP under tensor parallelism (TP), long sequencing, and small compute settings, improving GPU Streaming Multiprocessor (SM) usage without requiring manual configuration.
2. Easy-to-use algebraic rearrangement: FlashQLA refactors, to some extent, the computation of the forward and backward flow of GDN Chunked Prefill to reduce overhead in three types of GPU hardware units: Tensor Cores (which handle matrix multiplication), CUDA Cores (which handle scalar and vector operations), and Special Function Units (which handle the square (S) unit). Importantly, this is done without sacrificing numerical accuracy – an important guarantee when reconstruction is performed using model training.
3. TileLang has included special warp threads: Rather than decomposing the computation into a series of independent kernels (slower) or combining everything into a single monolithic kernel (too robust to be optimized), FlashQLA takes a middle ground. It uses TileLang to create several integrated keys and uses a warpgroup specialist – a process that assigns different warpgroups (groups of 128 threads in Hopper) to special roles, such as one warp group moving data from global memory to shared memory while another simultaneously uses the Tensor Core matrix iteration. This overlap of data movement, Tensor Core computation, and CUDA Core computation is what allows FlashQLA to approach the theoretical peak throughput of the hardware.
Measurements
FlashQLA was benchmarked against two benchmarks: FLA Triton kernel (version 0.5.0, Triton 3.5.1) and FlashInfer (version 0.6.9), using TileLang 0.1.8, on NVIDIA H200 GPUs. The benchmarks used head suspensions from the Qwen3.5 and Qwen3.6 model families, with head sizes hv ∈ 64, 48, 32, 24, 16, 8, corresponding to tensor parallelism settings from TP1 to TP8.
Advanced benchmarks (FWD) measure single-kernel latency for different models and TP settings under various cluster lengths. Back-to-back (BWD) benchmarks examine the relationship between the number of tokens in a batch and the latency during a single update step.
Key Takeaways
- FlashQLA is a high performance attention kernel library developed by the Qwen team at TileLang, specially optimized for Gated Delta Network (GDN) Chunked Prefill forward and backward.
- It achieves 2–3× forward acceleration and 2× reverse acceleration over the FLA Triton kernel in most instances on NVIDIA Hopper GPUs (SM90+), with significant performance gains in pre-training and side-agent understanding.
- Three new innovations drive performance gains: automatic gate-driven internal card core matching, hardware-friendly algebraic transformations that reduce Tensor Core, CUDA Core, and SFU without loss of numerical accuracy, and special TileLang warp-mixed characters that bypass data movement, Tensor Core computation, and CUDA Core computation.
- GDN is a direct attention method with O(n) complexity.used in the construction of the Qwen hybrid model alongside standard full attentional layers – making efficient GDN characters essential for both training and interpretation of long-scale content.
- FlashQLA is open source under the MIT license and requires SM90 or higher, CUDA 12.8+, and PyTorch 2.8+, with easy pipeline installation and high-level and low-level Python APIs available for compilation.
Check it out GitHub Repo again Technical details. Also, feel free to follow us Twitter and don't forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.?contact us
The post Qwen Team Releases FlashQLA: High-Performance High-Performance Kernel Library for Up to 3× Speedup on NVIDIA Hopper GPUs appeared first on MarkTechPost.



