LmCast :: Stay tuned in

Eagle 3.1: Collaboration Between the EAGLE Team, vLLM Team, and TorchSpec Team

Recorded: May 26, 2026, 1:15 p.m.

Original Summarized

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec | vLLM BlogMenuSearch DocsDocumentationBlogEventsContactCommunityGitHubThemeDocsBlogEventsContactCommunitySearch⌘JBlogEAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpecMay 26, 20264 min readEAGLE Team, vLLM Team, and TorchSpec Team#speculative-decoding#performanceEAGLE 3.1 InnovationsEAGLE 3.1 Training with TorchSpecEAGLE 3.1 Integration with vLLMOpen-Source Collaboration Across the EcosystemTable of ContentsThe EAGLE series — including EAGLE 1, EAGLE 2, and EAGLE 3 — has become one of the most widely adopted and practically deployed families of speculative decoding algorithms across both research and production systems.
Today, the EAGLE team, vLLM team, and TorchSpec team are excited to jointly introduce EAGLE 3.1 — a major step forward in speculative decoding robustness, efficiency, and deployability.
EAGLE 3.1 Innovations
While speculative decoding performs well in controlled settings, performance often degrades under different chat templates, long-context inputs, or out-of-distribution system prompts.
The EAGLE team traced this fragility to a phenomenon we call attention drift — as speculation depth increases, the drafter gradually shifts attention away from sink tokens and toward its own generated tokens.
We identified two underlying issues. First, the fused input representation becomes increasingly imbalanced as higher-layer hidden states dominate the drafter input. Second, hidden-state magnitude grows across speculation steps due to the unnormalized residual path. Together, these effects make the drafter progressively less stable at deeper speculation depths.
Figure 1: EAGLE 3 vs. EAGLE 3.1 architecture comparison. EAGLE 3.1 adds FC normalization after each target hidden state and feeds post-norm hidden states into the next decoding step.
To address this issue, EAGLE 3.1 introduces two key architectural improvements:

FC normalization after each target hidden state and before the FC layer
Feeding post-norm hidden states into the next decoding step

Intuitively, the post-norm design makes the method behave more like recursively invoking the drafter across decoding steps, rather than simply appending additional layers to the target model.
These changes significantly improve robustness across deployment scenarios. Compared with EAGLE 3, EAGLE 3.1 demonstrates:

Better training-time to inference-time extrapolation
Stronger long-context robustness
Higher resilience to chat template and system prompt variation
More stable acceptance length across diverse serving environments

In long-context workloads, EAGLE 3.1 achieves up to 2× longer acceptance length compared with EAGLE 3.
EAGLE 3.1 Training with TorchSpec
TorchSpec now provides efficient training support for EAGLE 3.1 and future speculative decoding algorithms.
By lowering training overhead and simplifying experimentation workflows, TorchSpec helps accelerate iteration and exploration for next-generation speculative decoding research and deployment.
Based on TorchSpec and vLLM, we also trained and open-sourced an EAGLE 3.1 draft model for Kimi K2.6:
https://huggingface.co/lightseekorg/kimi-k2.6-eagle3.1-mla
The model serves as an example of deploying EAGLE 3.1 with TorchSpec training and vLLM serving support on a real-world serving model.
EAGLE 3.1 Integration with vLLM
EAGLE 3.1 lands in vLLM as a config-driven extension of the existing EAGLE 3 implementation.
The integration includes:

FC normalization support
Post-norm hidden-state feedback
Removal of hardcoded assumptions around target hidden states

At the same time, backward compatibility with existing EAGLE 3 checkpoints is fully preserved. As a result, EAGLE 3.1 draft models can be plugged directly through the same speculative-decoding code path, for example:
vllm serve nvidia/Kimi-K2.6-NVFP4 \
--trust-remote-code \
--tensor-parallel-size 4 \
--tool-call-parser kimi_k2 \
--enable-auto-tool-choice \
--reasoning-parser kimi_k2 \
--attention-backend tokenspeed_mla \
--speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3.1-mla","method":"eagle3","num_speculative_tokens":3}' \
--language-model-only
This makes draft-model upgrades in production vLLM serving smooth and easy.
The support has already been merged into the current main branch of vLLM and will be available via vLLM's nightly release as well as the upcoming v0.22.0 release.
As an early data point, we benchmarked the Kimi K2.6 EAGLE 3.1 draft model on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench coding dataset. EAGLE 3.1 delivers 2.03× higher per-user output throughput at concurrency 1, and the speedup stays meaningful as concurrency scales (1.71× at C=4, 1.66× at C=16).
Figure 2: Per-user output throughput (TPS) on Kimi-K2.6-NVFP4 with vLLM, TP=4, GB200 on SPEED-Bench coding. EAGLE 3.1-MLA vs. no-spec baseline.
Open-Source Collaboration Across the Ecosystem
This collaboration between the EAGLE team, vLLM team, TorchSpec team represents a strong example of open-source collaboration across algorithm research, system optimization, and training infrastructure.
The EAGLE team continues advancing speculative decoding algorithms, vLLM helps bring these innovations into production inference systems at scale, and TorchSpec enables efficient training and rapid experimentation for future speculative decoding algorithms.
Together, we hope to continue raising the overall baseline for speculative decoding and driving further improvements in token efficiency across the broader LLM ecosystem.Share:View Markdown SourceOldervLLM x Novita AI: PegaFlow for Production-Grade External KV CacheRelated PostsvLLM Tops the Artificial Analysis LeaderboardMay 11, 2026·15 min readHow vLLM built the leading deployments of DeepSeek V3.2, MiniMax-M2.5, and Qwen 3.5 397B.P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLMMar 13, 2026·12 min readEAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...How Speculative Decoding Boosts vLLM Performance by up to 2.8xOct 17, 2024·10 min readSpeculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in...EAGLE 3.1 InnovationsEAGLE 3.1 Training with TorchSpecEAGLE 3.1 Integration with vLLMOpen-Source Collaboration Across the EcosystemTable of Contents© 2026 vLLM·All rights reserved.GitHubXLinkedInSlackDiscuss

The EAGLE team, alongside the vLLM team and the TorchSpec team, jointly introduce EAGLE 3.1, representing a significant advancement in the robustness, efficiency, and deployability of speculative decoding algorithms. This development addresses performance fragility observed in speculative decoding, which often degrades when encountering variations in chat templates, long-context inputs, or out-of-distribution system prompts. The EAGLE team identified the root cause of this fragility as a phenomenon termed attention drift, where increasing speculation depth causes the drafter to shift its attention away from sink tokens toward its own generated tokens. This instability arose from two underlying issues: an imbalance in the fused input representation as higher-layer hidden states began to dominate the drafter input, and the growth in hidden-state magnitude across speculation steps due to the unnormalized residual path.

To resolve these issues and enhance stability across diverse deployment scenarios, EAGLE 3.1 incorporates two key architectural improvements. First, it implements FC normalization after each target hidden state and before the subsequent FC layer. Second, it feeds the resulting post-normalized hidden states into the next decoding step. This post-normalized design is intended to make the method behave more like recursively invoking the drafter across decoding steps, rather than simply stacking additional layers onto the target model. These changes have resulted in demonstrably improved robustness, including better training-time to inference-time extrapolation, stronger long-context robustness, increased resilience to variations in chat templates and system prompts, and more stable acceptance lengths across various serving environments. In long-context workloads specifically, EAGLE 3.1 demonstrated the capability to achieve up to two times longer acceptance lengths compared to the previous EAGLE 3 model.

TorchSpec has provided efficient training support for EAGLE 3.1 and future speculative decoding algorithms, streamlining the experimentation workflows and reducing training overhead. Furthermore, leveraging both TorchSpec and vLLM, the teams trained and open-sourced an EAGLE 3.1 draft model for the Kimi K2.6, serving as a practical example of deploying EAGLE 3.1 with TorchSpec training and vLLM serving support.

The EAGLE 3.1 architecture is integrated into vLLM as a configuration-driven extension of the existing EAGLE 3 implementation. This integration includes support for FC normalization, post-norm hidden-state feedback, and the removal of hardcoded assumptions regarding target hidden states. Crucially, backward compatibility with existing EAGLE 3 checkpoints is fully preserved, allowing EAGLE 3.1 draft models to be plugged directly into the same speculative-decoding code path within vLLM. This seamless integration is demonstrated through the ability to upgrade draft models in production vLLM serving environments smoothly. The support has been incorporated into the main branch of vLLM and will be available in nightly releases and the upcoming v0.22.0 release. Benchmarking the Kimi K2.6 EAGLE 3.1 draft model on Kimi-K2.6-NVFP4 using vLLM yielded a 2.03 times higher per-user output throughput at concurrency level one, with the speedup remaining meaningful as concurrency increased, showing a 1.71 times speedup at concurrency four and 1.66 times at concurrency sixteen on the SPEED-Bench coding dataset. This collaboration exemplifies open-source synergy, combining algorithmic research from EAGLE, production system optimization from vLLM, and efficient training infrastructure from TorchSpec to advance token efficiency within the broader LLM ecosystem.