NVIDIA Releases Nemotron-Labs-TwoTower: An Open Weight Classification Language Model Built on the Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone

NVIDIA has been released Nemotron-Labs-TwoTowera distributed language model built on a dynamic pre-trained backbone. It is shipped as an open model under the NVIDIA Nemotron Open Model license. The release directs to the scripting pass-through bottle.
Autoregressive (AR) models determine one token at a time. That serial process includes production. Different language models take another route. They create tokens in parallel and refine them iteratively.
Most distributional language models use a single network for two functions. It represents clean tokens and removes corrupt ones at every step. TwoTower divides these activities into two towers. It maintains 98.7% of the quality of the combined AR benchmark. It also reports 2.42× higher clock rate productivity.
The TL;DR
- TwoTower separates the spread into a frozen AR core tower and a trained denoiser tower.
- It retains 98.7% of AR quality at 2.42× throughput (γ=0.8, S=16, 2×H100).
- Denoiser trained on ~2.1T tokens; the spine used 25T.
- One test area uses diffusion, mock-AR, and AR modes.
Nemotron-Labs-TwoTower
Two Tower is a block-wise autoregressive diffusion model. It is equipped with Nemotron-3-Nano-30B-A3B, a hybrid core of open weight. That backbone includes Mamba-2, self-awareness, and mix-of-experts (MoE) layers.
Each tower has 52 sections: 23 Mamba-2, 6 self-attention, and 23 MoE. The released test site is moving both towers, about 60B parameters are complete. The active parameters per token can be 3B per tower. The MoE employs 128 exchangeable experts, 6 of whom are acting, and 2 joint experts.
Both towers start out as copies of the same core checkpoint. Only the denoiser tower is qualified. The AR content tower is always frozen. The denoiser was trained on ~2.1T tokens, part of the 25T-token pretraining backbone.
How the Two Towers Work
The AR content tower works for reason on information and commitment tokens. Generates a KV cache for each layer and Mamba-2 end states. It maintains the ability of the spine to rotate.
The diffusion denoiser tower optimizes noisy blocks. Within a block, it uses bidirectional in-block attention. It remains the cause in relation to previous clean blocks.
Towers connect layer by layer. Denoiser layer i you cross the core tower layer i. This layer-oriented cross alignment provides multidimensional access to spinal imaging. The previous methods only propagate the last hidden state.
Two other denoiser adjustments are important. The Mamba-2 layers got their original shape from the Mamba shape of the content tower. The propagation time step adjusts each layer by using a single adaLN time state. That adaLN module adds only 1.5M parameters.
Generation runs block by block. Each block starts with S [MASK] tokens. A denoiser refines it further T steps, then do it. The content tower then processes the committed tokens to update its cache.
This explains why multiple steps of denoising can still beat single-token recording. Automatic decoding generates one token per step. TwoTower generates multiple tokens for each step before upgrading.
Measurements
Testing uses BF16 on 2×H100 GPUs. Default working point is confidence unwrapping, threshold γ=0.8, block size S=16. The table compares the base AR against the TwoTower distribution coding.
| Work | Nemotron-3-Nano-30B-A3B (AR) | Nemotron-Labs-TwoTower (diffusion) |
|---|---|---|
| MMLU (5-shot, acc) | 78.56 | 78.24 |
| MMLU-Pro (5-shot, CoT EM) | 62.59 | 60.93 |
| ARC-Challenge (25-shot, acc_norm) | 91.72 | 92.66 |
| WinoGrande (5-shot, acc) | 76.09 | 76.09 |
| RACE (0-shot, acc) | 88.90 | 88.90 |
| HumanEval (0-shot) | 79.27 | 75.58 |
| MBPP-Sanitized (3-shot) | 74.71 | 74.28 |
| GSM8K (8-shot, acc) | 92.49 | 90.14 |
| MATH-500 (4-shot) | 84.40 | 80.60 |
| MMLU Global Lite (five shots) | 73.97 | 73.94 |
| MGSM (8-shot, avg acc) | 80.80 | 80.40 |
| Quality is maintained | 100% | 98.7% |
| Generation output (× AR) | 1.0× | 2.42x |
Standard information resides within one point of the AR basis. The code and statistics show modest destruction. Common and multilingual scores are acquired or slightly improved. Decreasing γ makes more tokens per step and raises output, with reduced quality.
Running it: Three-Generation Approaches
The test area presents three ways of thinking. A full distribution of two towers uses 2 GPUs, about 59GB per GPU in BF16. AR mode only works on single 80GB GPU.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, trust_remote_code=True,
)
# context tower -> GPU 0, denoiser tower -> GPU 1
model.place_towers_on_devices("cuda:0", "cuda:1")
model.eval()
prompt = "France is a country "
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate_mask_diffusion(
inputs["input_ids"], max_new_tokens=128,
block_size=16, steps_per_block=16, mask_token_id=3,
temperature=0.1, confidence_threshold=0.8,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
The three methods are generate_mask_diffusion(), generate_mock_ar()again generate_ar(). Mask distribution is mandatory block_size tokens for each step. Mock-AR and AR generate one token per step.
When It Fits: Use Cases
The most specific use case is rapid mass production. The data team producing the synthetic text can trade a small quality drop for the output. For γ=0.8, that trade-off is 1.3% quality for 2.42× speed.
A second use case is to adjust the quality trade-off. Increasing γ maintains more quality, according to NVIDIA's paper. Lowering γ makes more tokens per step faster.
The third use case is login adaptation. The content tower maintains its own LM header for documenting assumptions, validation, or AR scoring. Teams can use AR and distribution in a single testing environment.
Strengths and Weaknesses
Power:
- Open weights under the NVIDIA Nemotron Open Model license; ready for commercial use
- 98.7% of AR quality was maintained at 2.42× throughput on the default workstation
- A single testing environment supports distribution, mock-AR, and AR decoding
- Denoiser trained on ~2.1T tokens, not full retraining
- Sequence length cache memory scales as an AR basis
Weaknesses:
- A full dual-tower deployment requires 2 GPUs and ~59GB per GPU on BF16
- Code and statistics degrade over general information (HumanEval 79.27 → 75.58)
- Keeping both towers stationary raises the memory of the fixed weight
- The extracted test area is the base model, before any instruction tuning or alignment
- Throughput past 3× comes with a significant loss of quality
Interactive Descriptor
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us



