Generative AI

Meet EAGLE 3.1: A predictive coding algorithm that corrects Attention Drift in LLM Inference

Predictive coding is a way to speed up language model prediction. A small, fast draft model raises several tokens. A large target model verifies them in parallel. If accepted, the prediction is fast. If it is rejected, the system goes back to normal.

The EAGLE Team, the vLLM Team, and the TorchSpec Team introduced the EAGLE series which includes EAGLE 1, EAGLE 2, and EAGLE 3 which have become the most widely adopted and widely distributed families of predictive modeling algorithms in all research and production applications. Today, that family gets a targeted reliability upgrade with the introduction of EAGLE 3.1.

What was going wrong

Although predictive recording works well in controlled settings, performance often degrades under different conversational templates, long content inputs, or out-of-order system information.

Team EAGLE followed this weakness until they found something called attention drift as the depth of projection increases, the artist gradually shifts attention away from the sink tokens and towards his generated tokens.

In simple words: an artist is a small model that predicts future tokens. As the projection deepens, it begins to deal with its previous effects instead of the original context. This reduces the length of reception and stability of the output.

Two main problems were identified. First, the integrated input becomes increasingly unbalanced as the hidden regions of the upper layer dominate the draft input. Second, the size of the latent variable increases in the estimation steps due to the non-standard residual path. Together, these results enable the producer to remain unstable in the depths of deep speculation.

Two Architectural Fixes in EAGLE 3.1

To deal with the attention drop, EAGLE 3.1 comes with two important architectural improvements: FC normalization after each hidden target and before the FC layer, and providing post-normalization hidden regions in the next recording step.

FC normalization stabilizes the hidden regions that the modeler finds in the target model. Without it, the size of the hidden state increases in steps, which makes the programmer even more unreliable. Applying normalization to each step keeps the input accountable.

The post-normalization design makes the method behave like a programmer's iteration of the decoding steps, rather than simply attaching more layers to the target model.

What These Amendments Bring

Compared to EAGLE 3, EAGLE 3.1 shows: better time of training to extrapolation of fixed time, stability of long strong content, high stability of dialog template and diversity of system information, and more stable reception length in different feeding areas.

For long context workloads, EAGLE 3.1 achieves up to 2× longer reception times compared to EAGLE 3.

Training Infrastructure: TorchSpec

TorchSpec now provides support for EAGLE 3.1 active training and future predictive modeling algorithms. By reducing training overhead and streamlining testing workflows, TorchSpec helps accelerate the iteration and testing of next-generation predictive modeling research and applications.

Based on TorchSpec and vLLM, the research team retrained and open sourced the EAGLE 3.1 draft model of Kimi K2.6, available on HuggingFace. The model serves as an example of using EAGLE 3.1 with TorchSpec training and vLLM providing support for a real-world deployment model.

vLLM Integration: Config-Driven and Backward-Compatible

EAGLE 3.1 resides in vLLM as a configuration-driven extension of the existing implementation of EAGLE 3. Integrations include FC support for routines, post-routine hidden state feedback, and removal of hard-coded guesswork from target hidden fields.

Backward compatibility with existing EAGLE 3 test environments is fully preserved. EAGLE 3.1 draft models that can be directly connected via the path of the prediction code.

vllm serve nvidia/Kimi-K2.6-NVFP4 
  --trust-remote-code 
  --tensor-parallel-size 4 
  --tool-call-parser kimi_k2 
  --enable-auto-tool-choice 
  --reasoning-parser kimi_k2 
  --attention-backend tokenspeed_mla 
  --speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3.1-mla","method":"eagle3","num_speculative_tokens":3}' 
  --language-model-only

Benchmark results on Kimi K2.6

The research team estimated the draft model of Kimi K2.6 EAGLE 3.1 on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on ​​the SPEED-Bench code dataset. EAGLE 3.1 delivers 2.03× higher per-user throughput than compromise 1. The speedup remains reasonable as the compromise scales: 1.71× for C=4 and 1.66× for C=16.

Marktechpost Visual Explainer

01 / 07

vLLM · May 26, 2026


The EAGLE team, the vLLM team, and the TorchSpec team have jointly released EAGLE 3.1 – a targeted fix for instabilities in prediction modeling in production LLM implementations.

#guess-recording
#vLLM
#LLM indicator
#working

02 / 07

It's the background

What is Speculative Decoding?


A method to accelerate LLM prediction using two models working together.

  • Small, fast draft model raises a few tokens in front
  • Big target model validates all proposed tokens in one pass
  • Accepted tokens are saved – rejected tokens are returned gracefully
  • The result: higher output without a change in output quality

03 / 07

The problem

Attention Drift in EAGLE 3


The EAGLE 3's performance is degraded in real-world use under three conditions:

  • What's different discussion templates
  • Long context input
  • Without distribution system information

Cause: attention drift – as the depth of projection increases, the artist shifts attention away from the sink tokens towards his generated tokens.

04 / 07

The cause

Two Basic Problems

  • I integrated input representation it becomes increasingly uneven – the hidden states of the upper layer dominate the frame input
  • The magnitude of the hidden state it increases in the estimation steps due to the non-standard residual method
  • Together, these make the artist you are slowly settling down in the depths of deep speculation

05/07

Buildings

Two Structural Adjustments

Fix 1
FC normalization applied after each hidden target and before the FC layer. It keeps the size of the hidden state bound to the decoding steps.

Fix 2
Hidden post-normal state response – standard hidden conditions are included in the next recording step, which makes the editor behave more like an iterative expression than the included layers.

06/07

Benchmarks · SPEED-Bench Coding · GB200 TP=4

Each user vs. And Spec Baseline

2.03×Consistency 1

1.71×Consistency 4

1.66×Consistency 16

For long context workloads, EAGLE 3.1 achieves 2× the length of the long acceptance compared to EAGLE 3. Tested in Kimi-K2.6-NVFP4 with vLLM.

07/07

Deployment · vLLM v0.22.0

How to Install EAGLE 3.1


It is backward-compatible with the EAGLE 3 test. It is already integrated into the main vLLM. Stable release: v0.22.0.

vllm serve nvidia/Kimi-K2.6-NVFP4 
  --trust-remote-code 
  --tensor-parallel-size 4 
  --tool-call-parser kimi_k2 
  --enable-auto-tool-choice 
  --reasoning-parser kimi_k2 
  --attention-backend tokenspeed_mla 
  --speculative-config 
    '{"model":"lightseekorg/kimi-k2.6-eagle3.1-mla",
      "method":"eagle3",
      "num_speculative_tokens":3}' 
  --language-model-only

Key Takeaways

  • EAGLE 3.1 fix attention drift – newly identified instability when the artist loses focus on the sink tokens in the deep projection.
  • Two architectural changes – FC normalization again hidden post-norm status response — tighten the frame at all stages of the projection.
  • For long context workloads, EAGLE 3.1 delivers up to 2× the maximum allowable length compared to EAGLE 3.
  • Benchmarks on the Kimi-K2.6-NVFP4 show 2.03× output per user for concurrency 1, it drops to 1.66× for C=16.
  • EAGLE 3.1 backwards-compatible with the EAGLE 3 test and is already integrated into the main vLLM, shipping with v0.22.0.

Check it out Technical details. Also, feel free to follow us Twitter and don't forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us


Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button