NVIDIA AI Releases Nemotron 3: Hybrid Mamba Transformer MoE Stack for Long-Term Agent AI Content

NVIDIA released the Nemotron 3 family of open source models as part of the agency's full AI stack, including model weights, datasets and reinforcement learning tools. The family has three sizes, Nano, Super and Ultra, and is aimed at multi-agent systems that require long-term thinking and tight control over costs. Nano has about 30 billion parameters with about 3 billion active per token, Super has about 100 billion parameters with up to 10 billion active per token, and Ultra has 500 billion parameters with up to 50 billion active per token.

Family model and target workloads
Nemotron 3 is introduced as an open family model for agent applications. The line contains Nano, Super and Ultra modelseach enabled for different workload profiles.
Nemotron 3 Nano is an expert hybrid of the Mamba Transformer language with about 31.6 billion parameters. Only about 3.2 billion parameters are active per pass, or 3.6 billion including embeddings. This minimal activation allows the model to maintain high representation capacity while keeping computing power low.
Nemotron 3 Super it has 100 billion parameters up to 10 billion valid per token. Nemotron 3 Ultra scales this design to 500 billion parameters up to 50 billion active per token. Super targets high-precision imaging for large multi-agent applications, while Ultra is intended for complex research and workflow planning.
Nemotron 3 Nano available now in open weights and recipes, in Hugging Face and as an NVIDIA NIM microservice. The Super and Ultra are scheduled for the first half of 2026.
NVIDIA Nemotron 3 Nano delivers 4 times higher token throughput than Nemotron 2 Nano and reduces token consumption significantly, while supporting native context lengths of up to 1 million tokens. This combination is intended for multi-agent systems running large workloads such as long documents and large code bases.


Hybrid Mamba Transformer MoE architecture
The core design of the Nemotron 3 is a combination of the hybrid Mamba Transformer architecture. Models mix sequential Mamba blocks, attention barriers and multi-specialist blocks within a single stack.
In the Nemotron 3 Nano, the research team defines a pattern that separates Mamba 2 blocks, attention blocks and MoE blocks. The standard feedforward layers from previous Nemotron generations are replaced by MoE layers. The learned router selects a small subset of experts for each token, for example, 6 out of 128 Nano experts, which keeps the active parameter count close to 3.2 billion while the full model holds 31.6 billion parameters.


Mamba 2 handles long-range sequence modeling with region-space style updates, attention layers provide direct token-to-token interactions for structurally sensitive functions, and MoE provides parameter scaling without computationally proportional scaling. The important point is that many layers follow one another quickly or with little expert calculation, and full attention is used only where it is most important for thinking.
In Nemotron 3 Super and Ultra, NVIDIA adds LatentMoE. Tokens are displayed in a low-dimensional hidden space, experts work on that hidden space, and the output is displayed back. This design allows for multiple experts in the same communication and costing, which supports more specialization in all functions and languages.
Super and Ultra include multiple token predictions. Multiple output heads share a common trunk and predict several future tokens in a single pass. During training this improves preparation, and where it is thought of it allows for predictive modeling such as using a few full forward passes.
Training data, precise format and content window
Nemotron 3 is trained on a large amount of text and code data. The research team reports pre-training about 25 trillion tokens, and more than 3 trillion new unique tokens over the generation of Nemotron 2. Nemotron 3 Nano uses Nemotron Common Crawl v2 point 1, Nemotron CC Code and Nemotron Pretraining Code v2, as well as special data sets of scientific and reasoning content.
The Super and Ultra are highly optimized for NVFP4, a 4-bit floating-point format optimized for NVIDIA accelerators. Matrix multiplication functions run in NVFP4 while accumulation uses high precision. This reduces memory compression and improves performance while maintaining accuracy close to standard formats.
All Nemotron 3 models support context windows up to 1 million tokens. The architecture and training pipeline is tuned to implement long-horizon thinking at this length, which is important in multi-agent environments that span large traces and shared working memory between agents.
Key Takeaways
- Nemotron 3 is an open family model of three classes of agent AI: Nemotron 3 comes in Nano, Super and Ultra variants. Nano has about 30 billion parameters with about 3 billion active per token, Super has about 100 billion parameters with up to 10 billion active per token, and Ultra has 500 billion parameters with up to 50 billion active per token. The family targets multi-agent applications that require long-range functional reasoning.
- Hybrid Mamba Transformer MoE with 1 million token core: The Nemotron 3 models use the hybrid Mamba 2 plus Transformer architecture with a mix of experts and support a context window of 1 million tokens. This design provides long-term content management with high throughput, where only a small set of experts work on each token and attention is used where it is most important to think.
- Hidden MoE and multi-token prediction in Super and Ultra: The Super and Ultra variants add a hidden MoE where expert calculations take place in a reduced hidden area, which lowers communication costs and allows more experts, and multi-token prediction heads that generate several future tokens with each forward pass. These changes improve quality and enable predictive style acceleration for long text and series of thought loads.
- Large-scale training data and accuracy of NVFP4 efficiency: Nemotron 3 is pre-trained on about 25 billion tokens, with more than 3 billion new tokens from the previous generation, while Super and Ultra are trained mainly on NVFP4, the 4-point floating point format for NVIDIA GPUs. This combination improves performance and reduces memory usage while maintaining accuracy close to standard precision.
Check out Paper, technical blog again Model weights in HF. Feel free to check out our GitHub page for Tutorials, Codes and Notebooks. Also, feel free to follow us Twitter and don't forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper.
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among the audience.



