What Is Nemotron 3 Ultra
Nemotron 3 Ultra is NVIDIA's answer to a specific problem: frontier-class AI models are too slow and too expensive for long-running agent workflows. An agent that plans, calls tools, corrects errors, and validates results across dozens of sequential steps burns through compute at a rate that makes most frontier models economically impractical at scale.
NVIDIA's approach is a 550B-total-parameter Mixture-of-Experts architecture where only 55B parameters are active per token. This sparse activation provides the reasoning capability of a much larger model while keeping inference costs manageable. The result, according to NVIDIA's benchmarks, is 5x higher throughput and 30% lower cost per agentic task completion versus comparable open models.
The model is entirely open — weights, training data, and recipes — available now through NVIDIA NIM microservices and on Hugging Face.
Four Architectural Innovations
Hybrid Mamba-Transformer layers. Mamba layers handle long-context efficiency: they process streaming information without quadratic attention overhead, which matters when agents need to track long task histories. Transformer layers preserve precise factual recall when the agent needs to retrieve specific information from a large context window. The combination avoids the typical trade-off between the two.
NVFP4 quantization. A single model checkpoint runs on NVIDIA Hopper, Blackwell, and Ampere GPU architectures. This cross-generation compatibility removes the infrastructure headache of maintaining different model versions for different hardware generations. On Blackwell, NVFP4 delivers up to 5x higher throughput compared to BF16.
LatentMoE. Improves expert routing efficiency across the diverse sub-tasks that agentic workflows require — reasoning, code generation, tool calls, and domain-specific logic often appear in the same session.
Multi-token prediction (MTP). Predicts multiple future tokens in a single forward pass, reducing generation time for long outputs and multi-turn conversations.
How It Was Trained: Multi-Teacher On-Policy Distillation
Nemotron 3 Ultra introduces a training method called MOPD (Multi-Teacher On-Policy Distillation). More than 10 domain-specific teacher models are trained, each with its own specialized pipeline. During training, the student model generates its own attempts at tasks while each teacher scores the result in its area of expertise. This co-evolution between students and teachers enables continuous improvement across domains without requiring distillation from third-party models.
The pretraining foundation is 10 trillion tokens. Three domain gaps were specifically targeted:
- Legal: 4B tokens of synthetic legal data — LegalBench average improved from 64.6% to 74.7%
- Knowledge: 35B synthesized Wikipedia-based tokens — SimpleQA improved from 40.2% to 50.2%
- Code: 173B refreshed GitHub tokens through September 30, 2025
The release also includes 10M new SFT samples, 1M new RL tasks, and 15 new RL environments.
Benchmark Results
| Benchmark | Nemotron 3 Ultra | What It Measures |
|---|---|---|
| SWE-Bench Verified | 65–70.4% | Real-world software engineering on GitHub issues |
| Claw-Eval (pass@3) | 76.9% | Autonomous agent execution capability |
| GDPval | 72.9% | Real-world office and knowledge-work delivery |
| GPQA Diamond | Top tier | Advanced knowledge and complex reasoning |
Performance is consistent across Pi, OpenHands, Hermes, OpenCode, and Mini SWE Agent harnesses — 65% to 70.4% regardless of which framework is used. This framework-agnostic consistency is practically significant: teams don't need to tune deployment infrastructure to preserve model quality.
Companion Models
Two additional models launch alongside Nemotron 3 Ultra:
Nemotron 3.5 Content Safety — A 4B-parameter guardrail model covering 23 safety categories across text, images, and combined inputs in 12 languages. Designed for use as an inference-time safety filter, an LLM safety judge, or a foundation for domain-specific safety policy training.
Nemotron 3.5 ASR — A multilingual streaming speech recognition model supporting 40+ languages in a single checkpoint. Sub-100ms latency enables natural real-time voice orchestration for agent swarms. The English predecessor already powers voice input in Microsoft GitHub Copilot CLI, used by more than 20 million developers.
Key Takeaways
- Nemotron 3 Ultra is a fully open 550B MoE model with 55B active parameters, delivering 5x throughput over comparable open models.
- 30% lower cost per agentic task, validated on SWE-bench and Terminal bench 2.0.
- Single NVFP4 checkpoint runs on Hopper, Blackwell, and Ampere — no per-generation model variants needed.
- Fully open: weights, data, recipes under OpenMDW-1.1 permissive license for commercial fine-tuning.
- Companion releases: Nemotron 3.5 Content Safety (4B guardrail, 12 languages) and Nemotron 3.5 ASR (40+ language streaming speech).
Why This Matters
NVIDIA is establishing itself not just as the dominant hardware vendor but as a major open model provider. Nemotron 3 Ultra's combination of full openness, cross-generation hardware compatibility, and genuine agentic efficiency creates a credible alternative to closed API models for enterprise deployment.
For AI teams evaluating whether to build on closed frontier APIs or self-hosted open models, the efficiency and cost data from Nemotron 3 Ultra shifts the calculus meaningfully. A 30% cost reduction and 5x throughput gain are not marginal improvements — they change the economics of running production agentic systems at scale.
The broader signal is that the open model ecosystem is maturing fast enough to challenge closed models on the metrics that enterprise buyers care most about: cost, performance, and control.