NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating

nimda December 13, 2025

0 31 21 minutes read

NeurIPS 2025 Best Paper Review: Qwen’s Systematic Exploration of Attention Gating

one little trick can bring about enhanced training stability, the use of larger learning rates and improved scaling properties

The Enduring Popularity of AI’s Most Prestigious Conference

By all accounts this year’s NeurIPS, the world’s premiere AI conference, was one of the largest and most active in its history. This year’s conference was held at the San Diego Convention Center in San Diego, California from Sunday, November 30, 2025 through Sunday, December 7, 2025. As a sense of the scale, NeurIPS 2025 received 21,575 valid paper submissions. From 2023 (~12.3 k) to 2025 (~21.6 k) this reflects a ~75–80% jump over two years, roughly ~30% per year average. In person attendance has been equally as impressive, which has usually been the tens of thousands of people often capped by venue size, with past locations operating near the upper limit of what the physical venue can handle. Reinforcement learning dominated the conversation this year, with the field is shifting from scaling models to tuning them for specific use cases. Industry momentum appeared to centre strongly around Google, with Google DeepMind in particular surging and pushing new and refreshing research directions, for example continual learning and nested learning, rather than just “bigger LLMs”. The scale and intesity of the conference is a reflection of perhaps both the pace of AI progress and the cultural peak of the modern AI gold rush.

Figure 1: Game of spot the speaker. A Typical NeurIPS main presentation hall, which as you can see, are almost verging on a stadium-level setting. In this brave new world, AI researchers have become rockstars. 📖 Source: photo by author.

This year, the exhibitor hall was packed, with major industry players from technology, finance, and AI infrastructure all setting out their stalls to demonstrate their latest breakthroughs, highlight open roles to talented delegates, and hand out the ever-coveted branded “stash” — pens, T-shirts, water bottles, and more. The especially fortunate conference goer might even receive an invite to company-hosted “after-parties”, that have become a staple of the NeurIPS experience and an ideal opportunity to decompress, shed the information-overload and network, from Konwinski’s Laude Lounge to the invite-only Model Ship cruise packed with top researchers. Diamond sponsors this year included Ant Group, Google, Apple, ByteDance, Tesla, and Microsoft. The buy-side presence this year was particularly strong, with leading firms such as Citadel, Citadel Securities, Hudson River Trading, Jane Street, Jump Trading, and The D. E. Shaw Group represented. On the infrastructure and tooling side, Lambda showcased its GPU cloud platform, while companies like Ollama and Poolside highlighted advances in local LLM runtimes and frontier model development.

**Figure 2:** Nobel laureate Geoffrey Hinton’s presentation at the Google booth (picture taken at NeurIPS 2018). Well known AI researchers and industry titans are a common sight throughout NeurIPS. 📖 **Source:** photo by author.

The NeurIPS Expo showcased many equally fascinating applied-AI demos. Highlights included BeeAI, demonstrating how autonomous agents can behave reliably across different LLM backends; a multimodal forensic search system capable of scanning large video corpora with AI; an AI-accelerated LiDAR processing demo that showed how heterogeneous compute can dramatically speed up 3D perception; and LLM-driven data-engineering workflows that automate ingestion, transformation, and quality checks. It’s clear from the EXPO that AI is heading full steam ahead toward agents, multimodal intelligence, accelerated perception, and end-to-end automated data systems.

**Figure 3:** Smile you’re on camera! The NeurIPS EXPO always has some fascinating exhibitions, including robotics, hardware, neuromorphic systems, etc. 📖 **Source:** photo by author.

The NeurIPS Best Paper Award ceremony arguably represents a pinnacle of the conference and a celebration of its most impactful work. The best paper awards are given to exceptionally innovative and impactful research that is likely to have an immediate and longlasting effect on the field of AI. It goes without saying that a best paper award is a major professional accomplishment in a highly competitive and fast moving research field. It is even more impressive if we take into account the massive volume of submitted papers to NeurIPS. Standing out in that crowd is exceptionally difficult.

The Anatomy of a NeurIPS Best Paper: Exploring the benefits of Gated Attention in LLMs

Gating Explained: How a Tiny Valve Controls Big Neural Models

In the remainder of this article, we take a deep dive into one of this year’s best papers from NeurIPS: “Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free” by the Qwen team. Arguably, this dense paper title packs a lot of information into a very small footprint, so, in what follows, I’ll unpack the paper piece by piece with the objective of giving practicing Data Scientists a clear mental model of attention gating and concrete takeaways from the paper they can immediately apply to their own work.

First, we begin with an understanding of the gate, the core module under study in the paper. What exactly is a gate in the context of neural networks? A gate is nothing more than a signal modulation mechanism, a computational unit that takes the output of an existing transformation in the network and regulates it by selectively amplifying, attenuating, or suppressing parts of the input signal.

Instead of allowing every activation to flow unchanged through the network, a gate introduces a learned control pathway that determines how much of the transformed information should pass forward.

Operationally speaking, a gate computes a vector of coefficients, typically using a sigmoid, softmax, or occasionally a ReLU-based squashing function, and these coefficients are applied multiplicatively to another vector of activations originating from an upstream computation. This has the effect of regulating how much of that input makes its way downstream, a bit like twisting a tap handle to and fro to regulate the volume of water passing through. That’s all there is to it, now you understand gating, what it is and how it is applied.

**Figure 4:** One of the simpest posssible gating mechanisms. The input is modulated by a vector of coefficients computed in the Gate module that applies a linear projection of the input data followed by a sigmoid non-linearity. The sigmoid squeezes the projected coefficients so that they lie between 0 and 1, which is ideal for a gate as its purpose is to modulate how much information from the input makes its way through to the next layer. 📖 **Source:** image by author.

Because the gating weights are typically learnable parameters, the network can discover during training how to modulate internal signals in ways that minimise the overall network loss. In this way, the gate becomes a dynamic filter, adjusting the internal information flow based on the input context, the model’s continually evolving parameters, and the gradients received during optimisation.

A Brief Tour Down Memory Lane: The Long History of Gating

It’s worth taking in a little bit of the history of Gating, before we move to the main contributions of the paper. Gating is really nothing new, and the Qwen paper didn’t invent this standard component, their contribution lies elsewhere and will be covered shortly. In fact gating has been a core mechanism in deep architectures for many decades now. For example, Long Short-Term Memory (LSTM) networks, introduced in 1997, pioneered the systematic use of multiplicative gates — the input, forget, and output gates — to regulate the flow of information through time. These gates act as learned filters that determine which signals should be written to memory, which should be retained, and which should be exposed to downstream layers. By controlling information flow in this fine-grained way, LSTMs effectively mitigated the multiplicative explosion or vanishing of gradients that hampered early recurrent networks, enabling stable long-term credit assignment during backpropagation through time (BPTT).

Applying Gating to the LLM Attention Block

The Qwen team’s contribution focuses on applying gating directly to the transformer’s softmax attention mechanism, a specific type of configuration called attention gating. In this article, I won’t spend too much time on the what of attention, as there are many resources out there to learn about it, including this recent course by the DeepLearning.ai team and this prior article I’ve written on the subject. In a super brief summary, attention is the core mechanism in the transformer architecture that lets each input sequence token gather contextual information from any other token in the sequence, enabling tokens to ‘communicate’ during training and inference, sharing information regardless of how far apart they appear in the input. The computational graph for the popular scaled dot product attention (SDPA) is shown below:

**Figure 5:** The Transformer’s attention mechanism, applied to a toy sequence “Big Cat”. The input tokens are projected into queries, keys, and values. The attention module compares queries with keys to form an attention map, which is then used to weight the values. The result is an enriched representation of each token. 📖 **Source:** image by author.

Although attention gating has been used for many years, the Qwen team highlight a surprising gap in our body of knowledge: as AI practitioners we’ve broadly applied attention gating without truly understanding why it works or how it shapes learning dynamics. The Qwen team’s work shows that we’ve been benefiting from this module for a long time without a rigorous, systematic account of its effectiveness or the conditions under which it performs best. The Qwen paper does just that and plugs the gap, with the NeurIPS best paper selection committee citation mentioning:

“This paper represents a substantial amount of work that is possible only with access to industrial scale computing resources, and the authors’ sharing of the results of their work, which will advance the community’s understanding of attention in large language models, is highly commendable, especially in an environment where there has been a move away from open sharing of scientific results around LLMs.”

NeurIPS 2025, Select Committee statement.

Given the sheer amount of dollars flowing in and the massive commercial interest in AI these days, it’s really nice to see that the Qwen team decided to deliver this rich batch of lessons learnt to the wider community, rather than keep these informational nuggets behind closed doors. In doing so, the Qwen team have delivered a beautiful paper packed with practical lessons and clear explanations of the why behind attention gating, all distilled in a way that Data Scientists can immediately take and apply in real-world models.

The Qwen’s team systematic study makes several concrete contributions to knowledge that can be easily and immediately applied to improve many standard LLM architectures:

Positioning of Gating: Putting a gating module right after the value matrix computation provides enhanced LLM performance, through introduction of a non-linearity and the inducement of input-dependent sparsity. They also study key parameterisations of the gating module, such as the type of activation function (SiLU or sigmoid) and the combination function (multiplication, addition).
Attention Sink and Massive Activations: Gating can radically curtail the power of the attention sink phenomenon, where most if not all of the attention in a layer concentrates on a single token — I cover this phenomenon in detail later. By suppressing these extreme activations, the model becomes far more numerically stable during optimisation, eliminating the loss spikes that typically appear in deep or long-training runs. This increased stability allows the model to tolerate substantially higher learning rates, unlocking better scaling without the divergence seen in ungated transformers.
Context Length Extension: Gating also facilitates context-length extension without requiring full model retraining. In practice, this means a model can be trained with a relatively short context window and later scaled to much longer sequences by retrospectively adjusting components such as the RoPE base. This adjustment effectively reparameterises the positional embedding geometry, allowing the model to operate at extended context lengths (e.g., up to 32k tokens) while preserving stability and without degrading previously learned representations.

**Figure 6:** First-token attention scores across layers in the baseline model. The early layers exhibit low attention to the first token, followed by a sharp increase around layer 6, with mid–late layers maintaining elevated attention. This is the attention sink phenomenon. The dashed line marks the mean score (0.467). 📖 **Source:** adapted by author from the original paper:

Leveraging Gating to Improve Performance, Learning Stability and Attention Mechanics

The Qwen team focus their investigation on how gating interacts with the LLMs softmax attention module, aiming to understand its influence on the module’s learning dynamics and to identify the optimal placement of the gate — for example, after the Q, K, or V projections, after the attention computation, or after the dense layers. The setup of this study is illustrated in the following diagram below:

**Figure 7:** The Qwen paper studies the placement of the Gating module with respect to the scaled dot product attention (SDPA) layer. Gating at the **SDPA output (G1)** or on the **value pathway (G2)** yields the strongest gains. These positions give the model the most direct control over what information flows through the attention block, allowing it to suppress noisy interactions or amplify useful ones. 📖 **Source:** adapted by author from the original paper:

The authors evaluate both mixture-of-experts — MoE (15B, 2.54B active) and dense (1.7B) — feed forward network (FFN) — models. The MoE variant uses 128 experts, top-8 softmax gating and fine-grained experts. Models are trained on subsets of a 4T-token high-quality corpus covering multilingual, math, and general knowledge data, with a 4096 sequence length. Training uses AdamW defaults, with specific learning-rate and batch-size details provided per experiment. They find that gating adds minimal overhead — <2% latency. Evaluation covers standard few-shot benchmarks: HellaSwag, MMLU, GSM8K, HumanEval, C-Eval, and CMMLU, plus perplexity tests across domains including English, Chinese, code, math, law, and literature.

The experimental evaluation is organised to study the following questions in a systematic way. I also add the key takeaways beneath each research question, which apply equally MoEs and FFN models tested by the authors:

Q1: Where is it best to place the gating in the attention head? After the K, Q, V projections? After the scaled dot product attention? After the final multi-head attention concatenation?

The authors find that inserting gating at the output of the scaled dot product attention (SDPA) module or after the value map (G2), are the most effective placements.
Furthermore, SDPA attention placement is more effective than at G2. To explain this, the authors demonstrate that gating placement at SDPA induces very low sparse gating scores, which is correlated with superior task performance.
Value gating (G2) produces higher, less sparse scores and performs worse than SDPA-output gating (G1). Sparsity is key to performance. This suggests that sparsity is most useful when the gating depends on the current query, allowing the model to filter irrelevant context. The gate decides what to suppress or amplify based on what the current token needs.

**Figure 8:** Most gating scores sit close to zero, revealing a sparse activation pattern (SDPA-output gating, elementwise application). The dashed line shows the average gate value (0.116). 📖 **Source:** adapted by author from the original paper:

Their experiments with input-independent gating confirm this: it offers minor gains through added non-linearity but lacks the selective sparsity provided by query-dependent gating.

This finding above is best explained through an example. Even though the K and V maps are technically input-dependent, they are not conditioned on the current query token. For example, if the query is “Paris,” the value tokens might be “France,” “capital,” “weather,” or “Eiffel Tower,” but each value token only knows its own representation and not what Paris is asking for. G2 gating bases its decision on the source tokens themselves, which may be irrelevant to the query’s needs. In contrast, G1 gating is computed from the query representation, so it is able to selectively suppress or amplify context based on what the query is actually trying to retrieve. This leads to sparser, cleaner gating and better performance for G1, whereas the Qwen team finds that G2 tends to produce higher, noisier scores and weaker results.

Q2: Do we regulate the output via elementwise multiplication for fine-grained control or do we just learn a scalar that coarsely adjusts output?

The results in the paper show that multiplicative SDPA gating is better than additive. When using a gating function in softmax attention, we are better of multiplying its output rather than adding it.

Q3: As attention in LLMs is typically multi-headed, do we share gates across heads or do we learn head-specific gating?

The authors are unequivocal that gating must be learned per head rather than shared across heads. They find that when gates are shared, the model tends to produce larger, less selective gating values, which dilutes head-level specialization and harms performance. In contrast, head-specific gating preserves each head’s unique role and consistently yields better results. Interestingly, the authors state that head-specific gating is the most critical design choice that has the largest effect on performance, with the granularity of the gating and activation function choice having a more minor impact.

Q4: We can modulate the output either multiplicatively or additively. Which approach works better?

Q5: What activation function makes more sense in the gating module, a sigmoid or a SiLU?

Sigmoid outperforms SiLU when used in the best-performing configuration, namely elementwise gating applied to the SDPA output (G1). Replacing sigmoid with SiLU in this setup consistently leads to worse results, indicating that sigmoid is the more effective activation for gating.

Mitigating the Scourge of Attention Sinks

A key issue in LLMs is attention sinking, where the first token absorbs most of the attention weight and overwhelms the rest of the sequence, leading to disproportionately large activations that can destabilise training and distort the model’s representations. Importantly, the Qwen team show that gating can mitigate this effect, with the SDPA output gating reducing the massive activations and attention sink.

**Figure 9:** When the attention distribution collapses onto the first token, its value vector dominates the weighted sum, leading to an outsized activation while the rest of the sequence is effectively ignored. 📖 **Source:** image by author.

Extending Context Length by Changing the Rotary Position Embeddings (RoPE) Base

To build long-context models, the Qwen team follow a three-stage training strategy, detailed below. This training strategy gives a further fascinating insight into how frontier labs train large-scale models, and what tools they find effective:

Expanding RoPE base: First, they expand the Rotary Position Embeddings (RoPE) base from 10k to 1M which flattens the positional frequency curve and allows stable attention at much longer position.
Mid-Training: the Qwen team then continue training the model for an additional 80B tokens using 32k-length sequences. This continuation phase (sometimes called “mid-training”) lets the model adapt naturally to the new RoPE geometry without relearning everything.
YaRN Extension: they then apply Yet Another RoPE eNhancement (YaRN) to expand the context length up to 128k, without further training.

Let’s step back and briefly clarify what RoPE is and why it matters in LLMs. Without injecting positional information, a Transformer’s attention mechanism has no sense of where tokens appear in a sequence. Like many techniques in AI there is a simple, underlying geometric intuition to how they work, that makes everything really clear. This is certainly the case for positional embeddings and RoPE. In a simple 2D analogy, you can imagine token embeddings as a cloud of points scattered in space, with no indication of their order or relative spacing in the original sequence.

RoPE encodes position by rotating each 2D slice of the query/key embedding by an angle proportional to the token’s position. The embedding is partitioned into many 2D sub-vectors, each assigned its own rotation frequency (θ₁, θ₂, …), so different slices rotate at different speeds. Low-frequency slices rotate slowly and capture broad, long-range positional structure, while high-frequency slices rotate rapidly and capture fine-grained, short-range relationships. Together, these multi-scale rotations allow attention to infer relative distances between tokens across both local and global contexts. This is a beautiful idea and implementation, and it’s methods like these that make me grateful to be working in the field of AI.

**Figure 10:** Illustration of RoPE (Rotary Position Embedding). Each query/key vector is divided into 2-D slices, and each slice is rotated by an angle proportional to the token’s position. The coloured patches at the *bottom*of each slice show where the rotated 2-D subvector now lies after applying RoPE. The shading at the foot of each vector slice indicates that location within the slice shifts, giving a new orientation determined by the slice’s rotation frequency. Because different slices rotate at different speeds, their coloured patches appear in different places, allowing RoPE to encode positional information across multiple frequency bands.. 📖 **Source:** image by author based on Figure 1 in the original RoPE paper:

The key insight here is that the relative angle between two rotated embeddings naturally encodes their relative distance in the sequence, allowing the attention mechanism to infer ordering and spacing through geometry alone. This makes positional information a property of how queries and keys interact. For example, if the tokens are close in the sequence, their rotations will be similar, which equates to a large dot product, giving a higher attention weight. Conversely, when tokens are farther apart, their rotations differ more, so the dot product between their queries and keys changes in a position-dependent way, typically reducing attention to distant tokens unless the model has learned that long-range interactions are important.

YaRN is a modern and flexible way to extend an LLM’s context window without retraining, and without causing the instabilities seen in naïvely extrapolated RoPE. RoPE begins to fail at long ranges because its highest-frequency rotational dimensions wrap around too quickly. Once positions exceed the training horizon, those dimensions produce repeated phases, meaning tokens that are far apart can appear deceptively similar in positional space. This phase aliasing (or matching) destabilises attention and can cause it to collapse. YaRN fixes this by smoothly stretching the RoPE frequency spectrum preserving the model’s short-range positional behaviour while gradually interpolating to lower frequencies for long-range positions. The result is a positional embedding scheme that behaves naturally up to 32k, 64k, or even 128k tokens, with far less distortion than older NTK or linear-scaling methods. Once their model was found to be stable at 32k, the Qwen team applied YaRN to further interpolate the RoPE frequencies, extending the effective context window to 128k.

In their evaluation, the Qwen team find, that within the trained 32k window, SDPA-gated models slightly outperform the baseline, indicating that gating improves attention dynamics without harming long-context stability, even under substantial positional scaling.

Additionally, with the YaRN extension and in the large-context regime, they find that the SDPA-output gated network significantly outperforms the baseline between 64k-128k context lengths. The authors tie this performance increase to the mitigation of the attention sink phenomenon, that they surmise the baseline model relies upon to distribute attention scores across tokens. They hypothesise that the SDPA-output gated model is much less sensitive to the RoPE and YaRN induced changes to the positioning encoding scheme and context length adjustments. Applying YaRN, which does not require further training, may disrupt these learned sink patterns, leading to the observed degradation in the base model performance. The SDPA-gated model, in contrast, does not rely on the attention sink to stabilise attention.

Coding Up our Own Gating Implementation

Before we conclude, it’s can be instructive to try and code up an implementation of an AI technique directly from a paper, and it’s a great way to solidify the key concepts. To this end, we’ll walk through a simple Python implementation of scaled dot product attention with softmax gating.

We will first define our key hyper parameters, such as the sequence length (seq_len), the hidden dimension of the model (d_model), the number of heads (n_heads) and the head dimension (head_dim).

import numpy as np

np.random.seed(0)

# ---- Toy config ----
seq_len   = 4      # tokens
d_model   = 8      # model dim
n_heads   = 2
head_dim  = d_model // n_heads

We next define some (fake) token embeddings (simply generated randomly here), alongside our randomly initialised project weights (not learnt for the purposes of this simple example).

# Fake token embeddings
x = np.random.randn(seq_len, d_model)        # [T, D]

# ---- Projection weights ----
W_q = np.random.randn(d_model, d_model)
W_k = np.random.randn(d_model, d_model)
W_v = np.random.randn(d_model, d_model)
W_o = np.random.randn(d_model, d_model)      # output projection

We then define the usual suspects, softmax, sigmoid, and also a method to split the dimension D into n_heads:

def softmax(logits, axis=-1):
    logits = logits - np.max(logits, axis=axis, keepdims=True)
    exp = np.exp(logits)
    return exp / np.sum(exp, axis=axis, keepdims=True)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# ---- Helper: split/concat heads ----
def split_heads
    return t.reshape(seq_len, n_heads, head_dim).transpose(1, 0, 2)

def concat_heads
    return t.transpose(1, 0, 2).reshape(seq_len, d_model)

Now we can dive into the core gating implementation and see exactly how it works in practice. In all of the examples below, we use random tensors as stand-ins for the learned gate parameters that a real model would train end-to-end.

#==================================================
# Forward pass 
# ============================================================
def attention_with_gates(x):
    # 1) Linear projections
    Q = x @ W_q   # [T, D]
    K = x @ W_k
    V = x @ W_v

    # ----- G4: gate on Queries (after W_q) -----
    G4 = sigmoid(np.random.randn(*Q.shape))
    Q = G4 * Q

    # ----- G3: gate on Keys (after W_k) -----
    G3 = sigmoid(np.random.randn(*K.shape))
    K = G3 * K

    # ----- G2: gate on Values (after W_v) -----
    G2 = sigmoid(np.random.randn(*V.shape))
    V = G2 * V

    # 2) Split into heads
    Qh = split_heads(Q)      # [H, T, Dh]
    Kh = split_heads(K)
    Vh = split_heads(V)

    # 3) Scaled Dot Product Attention per head
    scale = np.sqrt(head_dim)
    scores = Qh @ Kh.transpose(0, 2, 1) / scale   # [H, T, T]
    attn   = softmax(scores, axis=-1)
    head_out = attn @ Vh                          # [H, T, Dh]

    # 4) Concat heads
    multi_head_out = concat_heads(head_out)       # [T, D]

    # ----- G1: gate on concatenated heads (before W_o) -----
    G1 = sigmoid(np.random.randn(*multi_head_out.shape))
    multi_head_out = G1 * multi_head_out

    # 5) Output projection
    y = multi_head_out @ W_o                      # [T, D]

    # ----- G5: gate on final dense output -----
    G5 = sigmoid(np.random.randn(*y.shape))
    y = G5 * y

    return {
        "Q": Q, "K": K, "V": V,
        "G2": G2, "G3": G3, "G4": G4,
        "multi_head_out": multi_head_out,
        "G1": G1, "final_out": y, "G5": G5,
    }

out = attention_with_gates(x)
print("Final output shape:", out["final_out"].shape)

The code above inserts gating modules at four locations, replicating the positioning in the Qwen paper: the query map (G4), key map (G3), value map (G2), and the output of the SDPA module (G1). Although the Qwen team recommend using only the G1 configuration in practice — placing a single gate on the SDPA output — we include all four here for illustration. The goal is to show that gating is simply a lightweight modulation mechanism applied to different pathways within the attention block. Hopefully this makes the overall concept feel more concrete and intuitive.

Conclusions & Final Thoughts

In this article we took a whistle-stop tour into the concept of gating for softmax attention in LLMs and covered the key lessons learnt from the NeurIPS 2025 paper, “Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free”.

The Qwen paper is an AI tour-de-force and a treasure trove of practical findings that are immediately applicable to improving most contemporary LLM architectures. The Qwen team have prodcued an exhaustive study into the configuration of gating for LLM softmax attention, throwing light on this important component. There’s no doubt in my mind that most, if not all, frontier AI labs will be furiously scrambling to update their architectures in line with the guidance coming out of the Qwen paper, one of this year’s NeurIPS best papers, a highly coveted achievement in the field. As we speak there are probably thousands of GPUs crunching away at learning LLMs with gating module configurations inspired by the clear lessons in the Qwen paper.

Kudos to the Qwen team for making this knowledge public for the benefit of the entire community. The original code can be found here if you are interested in incorporating the Qwen team’s implementation into your own models or driving their research further (every great research contribution leads to more questions, there are turtles all the way down!) to address unanswered questions such as what internal dynamics change when a gate is added, and why this leads to the observed robustness across positional regimes.

Disclaimer: The views and opinions expressed in this article are solely my own and do not represent those of my employer or any affiliated organisations. The content is based on personal reflections and speculative thinking about the future of science and technology. It should not be interpreted as professional, academic, or investment advice. These forward-looking perspectives are intended to spark discussion and imagination, not to make predictions with certainty.

📚 Further Learning

Alex Heath (2025) — Google’s Rise, RL Mania, and a Party Boat — A first-hand roundup of NeurIPS 2025 takeaways, highlighting the surge of reinforcement learning, Google/DeepMind’s momentum, and the increasingly extravagant conference party culture. Published in Sources, a newsletter analysing AI industry trends.
Jianlin Su et al. (2024) — RoFormer: Enhanced Transformer with Rotary Position Embedding — The original RoPE paper that introduced rotary position embeddings, now used universally in LLMs. It explains how rotational encoding preserves relative position information and clarifies why changing the RoPE base affects long-range attention behavior.
Bowen Peng et al. (2023) — YaRN: Efficient Context Window Extension of Large Language Models — The core reference behind YaRN interpolation. This work shows how adjusting RoPE frequencies through smooth extrapolation can extend models to 128k+ contexts without retraining.
Zihan Qiu et al. (2025) — Gated Attention for Large Language Models: Non-Linearity, Sparsity, and Attention-Sink-Free — The definitive study on gating in softmax attention, reviwed in this article. It introduces SDPA-output gating (G1), explains why sigmoid gating introduces non-linearity and sparsity, shows how gating eliminates attention sinks, and demonstrates superior context-length generalization under RoPE/YaRN modifications.
Guangxuan Xiao et al. (2023) — StreamingLLM: Efficient Streaming LMs with Attention Sinks — The paper that formally identifies the “attention sink” phenomenon: early tokens attracting disproportionately large attention weights. It explains why baseline transformers often collapse attention to the first token.
Mingjie Sun et al. (2024) — Massive Activations in Large Language Models — Shows that extremely large hidden activations in specific layers propagate through the residual stream and cause pathological attention distributions. The Qwen paper empirically validates this link and demonstrates how gating suppresses massive activations.
Noam Shazeer (2020) — GLU Variants Improve Transformer — The foundational reference for gating inside feedforward blocks (SwiGLU, GEGLU). Modern LLMs heavily rely on this family of gated FFN activations; the Qwen paper connects this lineage to gating inside attention itself.
Hochreiter & Schmidhuber (1997) — LSTM: Long Short-Term Memory –The earliest and most influential gating architecture. LSTMs introduce input, output, and forget gates for selective information passage — the conceptual precursor to all modern gating strategies, including SDPA-output gating in transformers.
Xiangming Gu et al. (2024) — When Attention Sink Emerges in Language Models — Provides a modern empirical treatment of attention sinks, key biases, and non-informative early-token dominance.
Dong et al. (2025) — LongRed: Mitigating Short-Text Degradation of Long-Context LLMs — Offers a mathematical derivation (referenced in Qwen) showing how modifying RoPE changes attention distributions and hidden-state geometry.