NVIDIA Releases Nemotron-Labs-TwoTower: An Open Weight Classification Language Model Built on the Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone

0 1 4 minutes read

NVIDIA Releases Nemotron-Labs-TwoTower: An Open Weight Classification Language Model Built on the Frozen Autoregressive Nemotron-3-Nano-30B-A3B Backbone

NVIDIA has been released Nemotron-Labs-TwoTowera distributed language model built on a dynamic pre-trained backbone. It is shipped as an open model under the NVIDIA Nemotron Open Model license. The release directs to the scripting pass-through bottle.

Autoregressive (AR) models determine one token at a time. That serial process includes production. Different language models take another route. They create tokens in parallel and refine them iteratively.

Most distributional language models use a single network for two functions. It represents clean tokens and removes corrupt ones at every step. TwoTower divides these activities into two towers. It maintains 98.7% of the quality of the combined AR benchmark. It also reports 2.42× higher clock rate productivity.

The TL;DR

TwoTower separates the spread into a frozen AR core tower and a trained denoiser tower.
It retains 98.7% of AR quality at 2.42× throughput (γ=0.8, S=16, 2×H100).
Denoiser trained on ~2.1T tokens; the spine used 25T.
One test area uses diffusion, mock-AR, and AR modes.

Nemotron-Labs-TwoTower

Two Tower is a block-wise autoregressive diffusion model. It is equipped with Nemotron-3-Nano-30B-A3B, a hybrid core of open weight. That backbone includes Mamba-2, self-awareness, and mix-of-experts (MoE) layers.

Each tower has 52 sections: 23 Mamba-2, 6 self-attention, and 23 MoE. The released test site is moving both towers, about 60B parameters are complete. The active parameters per token can be 3B per tower. The MoE employs 128 exchangeable experts, 6 of whom are acting, and 2 joint experts.

Both towers start out as copies of the same core checkpoint. Only the denoiser tower is qualified. The AR content tower is always frozen. The denoiser was trained on ~2.1T tokens, part of the 25T-token pretraining backbone.

How the Two Towers Work

The AR content tower works for reason on information and commitment tokens. Generates a KV cache for each layer and Mamba-2 end states. It maintains the ability of the spine to rotate.

The diffusion denoiser tower optimizes noisy blocks. Within a block, it uses bidirectional in-block attention. It remains the cause in relation to previous clean blocks.

Towers connect layer by layer. Denoiser layer i you cross the core tower layer i. This layer-oriented cross alignment provides multidimensional access to spinal imaging. The previous methods only propagate the last hidden state.

Two other denoiser adjustments are important. The Mamba-2 layers got their original shape from the Mamba shape of the content tower. The propagation time step adjusts each layer by using a single adaLN time state. That adaLN module adds only 1.5M parameters.

Generation runs block by block. Each block starts with S [MASK] tokens. A denoiser refines it further T steps, then do it. The content tower then processes the committed tokens to update its cache.

This explains why multiple steps of denoising can still beat single-token recording. Automatic decoding generates one token per step. TwoTower generates multiple tokens for each step before upgrading.

Measurements

Testing uses BF16 on 2×H100 GPUs. Default working point is confidence unwrapping, threshold γ=0.8, block size S=16. The table compares the base AR against the TwoTower distribution coding.

Work	Nemotron-3-Nano-30B-A3B (AR)	Nemotron-Labs-TwoTower (diffusion)
MMLU (5-shot, acc)	78.56	78.24
MMLU-Pro (5-shot, CoT EM)	62.59	60.93
ARC-Challenge (25-shot, acc_norm)	91.72	92.66
WinoGrande (5-shot, acc)	76.09	76.09
RACE (0-shot, acc)	88.90	88.90
HumanEval (0-shot)	79.27	75.58
MBPP-Sanitized (3-shot)	74.71	74.28
GSM8K (8-shot, acc)	92.49	90.14
MATH-500 (4-shot)	84.40	80.60
MMLU Global Lite (five shots)	73.97	73.94
MGSM (8-shot, avg acc)	80.80	80.40
Quality is maintained	100%	98.7%
Generation output (× AR)	1.0×	2.42x

Standard information resides within one point of the AR basis. The code and statistics show modest destruction. Common and multilingual scores are acquired or slightly improved. Decreasing γ makes more tokens per step and raises output, with reduced quality.

Running it: Three-Generation Approaches

The test area presents three ways of thinking. A full distribution of two towers uses 2 GPUs, about 59GB per GPU in BF16. AR mode only works on single 80GB GPU.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, trust_remote_code=True,
)
# context tower -> GPU 0, denoiser tower -> GPU 1
model.place_towers_on_devices("cuda:0", "cuda:1")
model.eval()

prompt = "France is a country "
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

outputs = model.generate_mask_diffusion(
    inputs["input_ids"], max_new_tokens=128,
    block_size=16, steps_per_block=16, mask_token_id=3,
    temperature=0.1, confidence_threshold=0.8,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

The three methods are generate_mask_diffusion(), generate_mock_ar()again generate_ar(). Mask distribution is mandatory block_size tokens for each step. Mock-AR and AR generate one token per step.

When It Fits: Use Cases

The most specific use case is rapid mass production. The data team producing the synthetic text can trade a small quality drop for the output. For γ=0.8, that trade-off is 1.3% quality for 2.42× speed.

A second use case is to adjust the quality trade-off. Increasing γ maintains more quality, according to NVIDIA's paper. Lowering γ makes more tokens per step faster.

The third use case is login adaptation. The content tower maintains its own LM header for documenting assumptions, validation, or AR scoring. Teams can use AR and distribution in a single testing environment.

Strengths and Weaknesses

Power:

Open weights under the NVIDIA Nemotron Open Model license; ready for commercial use
98.7% of AR quality was maintained at 2.42× throughput on the default workstation
A single testing environment supports distribution, mock-AR, and AR decoding
Denoiser trained on ~2.1T tokens, not full retraining
Sequence length cache memory scales as an AR basis

Weaknesses:

A full dual-tower deployment requires 2 GPUs and ~59GB per GPU on BF16
Code and statistics degrade over general information (HumanEval 79.27 → 75.58)
Keeping both towers stationary raises the memory of the fixed weight
The extracted test area is the base model, before any instruction tuning or alignment
Throughput past 3× comes with a significant loss of quality

Interactive Descriptor

=T); // bottom: confirmation of completion within T var score=maskedIdx.map(function(i){return {i:i,c:confidence(i,state.r)};}); // ensure progress: if none will pass gamma and not final, top up var anyPass=scored.some(function(imi){return sc>=GAMMA;}); score.forEach(functions){ var commit = forceFinish || sc>=GAMMA; if (commit){ var word = STREAM[(streamPtr+s.i)%STREAM.length]; state.cells[s.i]= voice; state.commitStep[s.i]=region.r; commitThisStep++; } }); if(!anyPass && !forceFinish && commitThisStep===0){ scored.sort(function(a,b){return bc-ac;}); var top=score[0]; state.cells[top.i]= BROADCAST[(streamPtr+top.i)%STREAM.length]; state.commitStep[top.i]=region.r; commitThisStep=1; } // render with flash “newly committed” els.grid.innerHTML=''; for (var j=0;j<s var="" cell="document.createElement('div');" cell.classname="tt-cell" if="" cell.classlist.add="" cell.textcontent="[M]" else="" els.grid.appendchild="" renderheat="" masked="state.cells.filter(function(x){return" x="==null;}).length;" updatemeta="" s-masked="" state.done="true;" function="" advanceblock="" commit="" block="" into="" output="" stream="" start="" next="" for="" i="0;i<S;i++){" committedwords.push="" streamptr="" els.out.textcontent="committedWords.join('">=STREAM.length){ // finished the corridor position(); els.run.textContent=”u21bb Replay”; come back; } newBlock(); } function loop(){ stepOnce(); if(state.done){ setTimeout(advanceBlock, 450); } } function run(){ if(timer){ stop(); come back; } if(committedWords.length && streamPtr>=STREAM.length){ hardReset(); } els.run.textContent=”u23f8 Pause”; timer=setInterval(loop, 620); } function stop(){ if(timer){clearInterval(timer);timer=null;} els.run.textContent=”u25b6 Run generation”; } work hardReset(){ stop(); cfg(); streamPtr=0; commitWords=[]; els.out.textContent=””; newBlock(); els.run.textContent=”u25b6 Start generation”; } els.run.addEventListener('click',run); els.step.addEventListener('click',function(){ stop(); if(state && state.done){ advanceBlock(); return; } // block completed -> start next stepOnce(); }); els.reset.addEventListener('click',hardReset);<br /> [els.blocksize,els.steps,els.gamma,els.temp].forEach(function(el){ el.addEventListener('input',function(){ hardReset(); }); }); cfg(); newBlock(); // Auto-resizing for WordPress: report our height to the parent. function reportHeight(){ try { var h=root.offsetHeight+40; if(window.parent&&window.parent!==window){ window.parent.postMessage({mtpTwoTowerHeight:h},'*'); } }catch(e){} } var ro=window.ResizeObserver?new ResizeObserver(reportHeight):null; if(ro) ro.observe(root); window.addEventListener('load',reportHeight); setTime(reportHeight,1200); })();</p> <p></s>

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

Source link

nimda 3 hours ago

0 1 4 minutes read