NVIDIA Releases Nemotron-Labs-TwoTower Open-Weight Diffusion Language Model

Ethan Carter
16 hours ago
4 min read

NVIDIA released Nemotron-Labs-TwoTower, a two-tower diffusion language model that runs on a frozen Nemotron-3-Nano-30B-A3B autoregressive backbone. The model keeps 98.7 percent of the baseline quality while delivering 2.42 times higher generation throughput under BF16 on two H100 GPUs.

The announcement centers on a practical tradeoff. Instead of training an entirely new model, NVIDIA froze the original autoregressive tower and added a trainable denoiser tower. Layer-aligned cross-attention and state seeding let the two towers collaborate without retraining the entire 30-billion-parameter backbone. Total active parameters per token stay near 3 billion per tower even though the combined system reaches roughly 60 billion parameters.

Developers who need both high throughput and open weights now have a concrete option to test. The release comes at a moment when many teams face rising inference costs and seek alternatives that do not require changes to existing model checkpoints.

Model Architecture Uses Frozen Backbone Plus Separate Denoiser

The two-tower design splits responsibilities clearly. The context tower stays frozen at its original Nemotron-3-Nano-30B-A3B weights. The denoiser tower trains alone on about 2.1 trillion tokens. Both towers share information through layer-wise cross-attention and a state-seeding step that passes hidden states from one tower to the other at selected layers.

This separation matters because it avoids the cost of full pre-training. The backbone itself was trained on 25 trillion tokens before the diffusion project began. By keeping those weights untouched, NVIDIA limits total training compute while still producing a model that supports three decoding modes: full diffusion, autoregressive simulation inside the diffusion framework, and standard autoregressive decoding.

The block size used in the reported tests is 16 tokens with a gamma value of 0.8. Under those exact settings the model reaches the stated 2.42 times throughput gain while staying within 1.3 percent of the original autoregressive quality on the internal evaluation set.

Throughput Gain Comes With a Measured Quality Tradeoff

The 2.42 times speedup figure is tied to a specific hardware and precision setup. On two H100 GPUs in BF16, the diffusion path processes blocks faster than the original autoregressive baseline. The quality retention of 98.7 percent is also measured on the same test suite that defined the baseline.

Teams that prioritize raw speed over the final percentage point of quality now have a documented path. Those who need the last point of accuracy can still fall back to the pure autoregressive mode the backbone already supports. The three-mode design therefore gives operators a single checkpoint that can serve both latency-sensitive and quality-sensitive workloads.

The denoiser training corpus of 2.1 trillion tokens is large enough to adapt the model to diffusion dynamics without overwriting the backbone knowledge. Whether the same ratio holds at larger backbone sizes remains an open question for future releases.

Three Decoding Modes Let Users Choose Speed or Fidelity

The model ships with explicit support for diffusion decoding, simulated autoregressive decoding, and direct autoregressive decoding. Operators can therefore route different requests through different paths inside the same deployment.

Diffusion mode uses the trained denoiser tower and produces the highest throughput. Simulated autoregressive mode runs the diffusion process in a way that mimics token-by-token generation. Direct autoregressive mode bypasses the denoiser entirely and uses only the frozen context tower.

This flexibility reduces the need to maintain multiple separate models. A single set of weights can now handle both high-volume generation and lower-volume tasks that require the original quality ceiling.

Open Weights Lower the Barrier for Inference Research

NVIDIA made the weights available under an open-weight license. Researchers can therefore inspect the denoiser architecture, modify the cross-attention layers, or experiment with new block sizes without starting from random initialization.

The release also supplies the exact training token counts and the hardware configuration used for the reported numbers. Reproducibility at this level lets independent groups verify the 98.7 percent retention claim and test whether the 2.42 times speedup generalizes to other hardware.

Comparable two-tower experiments have appeared in smaller academic models, yet few reached the 30-billion-parameter scale with public weights. The combination of scale, documented training data volume, and open release gives the community a shared reference point.

Remaining Questions Focus on Scaling and Long-Context Behavior

The current numbers cover a single backbone size and a fixed block configuration. Larger backbones or longer contexts could change the throughput-to-quality curve. No public data yet shows how the same architecture behaves at 100 billion or 400 billion parameters.

Another open variable is multi-turn conversation length. The reported evaluation used standard benchmarks, but real deployments often involve longer histories. Whether the state-seeding mechanism maintains coherence across thousands of tokens requires further measurement.

Finally, the three decoding modes may need different serving stacks. Operators will have to decide how to expose each mode through APIs and whether to cache intermediate diffusion states. These engineering choices sit outside the model release itself.

Next Milestones to Watch Are Throughput Curves and Independent Benchmarks

Three signals will show whether the approach gains traction. First, any follow-up release that applies the same two-tower recipe to a larger Nemotron backbone will test whether the speedup scales. Second, independent labs publishing throughput and quality numbers on public leaderboards will confirm or adjust the 2.42 times and 98.7 percent figures. Third, adoption inside production inference services will reveal whether the three-mode design reduces overall serving cost enough to offset the added complexity.

Teams that track these three signals will know early whether frozen-backbone diffusion models become a standard pattern or remain a specialized option.