Built for the New Reality of Agentic AI
AI is no longer just something you ask a question to — it is an agent that works on your behalf, planning, coding, testing, and iterating across long sessions. This new paradigm demands models that are not just accurate, but fast and cost-efficient across hundreds of turns. NVIDIA Nemotron 3 Ultra is purpose-built for this world.
- Total parameters: 550B (active parameters: 55B)
- NVFP4 inference speed on Blackwell: up to 5× faster than BF16
- Agentic task cost reduction: up to 30%
- GPU support: Hopper, Blackwell, Ampere (single checkpoint)
- License: Linux Foundation OpenMDW-1.1 (fully open)
Four Architectural Innovations Behind the Performance
| Innovation | What It Does |
|---|---|
| Hybrid Mamba-Transformer | Mamba layers handle long contexts efficiently where pure Transformers struggle |
| LatentMoE | 4× more experts at the same inference cost via improved routing |
| NVFP4 quantization | Runs one checkpoint across Hopper, Blackwell, and Ampere GPUs |
| Multi-token prediction | Faster generative speed, especially in multi-turn agent tasks |
Hybrid Mamba-Transformer: Standard Transformer architectures are memory- and compute-heavy over very long sequences. By mixing Mamba layers with Transformer layers, Nemotron 3 Ultra handles the extended context windows that coding and research agents typically require without the usual performance degradation.
LatentMoE: Traditional MoE models route tokens to a fixed set of experts. LatentMoE enables four times as many experts at the same inference cost, increasing effective model capacity without a proportional increase in compute. This directly supports the Nemotron team's design philosophy: faster models are smarter models, because faster training and inference enable more experience during post-training.
NVFP4 across all NVIDIA GPU architectures: A single NVFP4 checkpoint runs on Hopper, Blackwell, and Ampere GPUs — enterprises do not need to maintain separate weights for different hardware generations. On Blackwell, NVFP4 delivers up to 5× higher throughput per GPU compared to BF16 at equivalent interactivity levels.
Nemotron 3 Ultra integrates directly with NVIDIA's Hermes agentic harness and OpenCode. It is packaged as an NVIDIA NIM microservice and can be deployed on-premises, in cloud, or at the edge. Download BF16 or NVFP4 weights from HuggingFace, try it instantly on build.nvidia.com, or explore it via OpenRouter and Perplexity Pro. Cookbooks and training recipes are available on GitHub.
Multi-Teacher Distillation and the Nemotron Coalition
Rather than training in isolation, NVIDIA used Multi-Teacher On-Policy Distillation — gathering dense feedback from over ten domain-specific teacher models throughout training. This is done in partnership with the Nemotron Coalition, a community of companies that contribute to improving the model before it is released.
The result is strong performance across diverse domains: coding agents can plan, code, test, and debug end-to-end; research agents can search, evaluate, and synthesize across hundreds of sources. On SWE-bench and Terminal bench 2.0, Nemotron 3 Ultra completed benchmarks using fewer total tokens and fewer tokens per turn than comparable models — translating directly to a 30% reduction in cost for agentic task completion.
Immediately available on Perplexity Pro (subscription or API), OpenRouter, Anaconda, and build.nvidia.com. Download weights from HuggingFace. Deploy anywhere using NVIDIA NIM microservices. Full training recipes, cookbooks, and documentation are available on GitHub and the NVIDIA Developer portal.
Fully Open Under OpenMDW-1.1
Nemotron 3 Ultra is fully open — weights, training data, and recipes. NVIDIA is moving Nemotron releases to the Linux Foundation's OpenMDW-1.1 license, which covers the full set of model materials (architecture, parameters, documentation, software, and related artifacts) under a single permissive framework. This makes Nemotron 3 Ultra one of the most comprehensively open frontier-class models available, enabling enterprises to fine-tune and deploy on their own infrastructure without licensing complexity.
- 550B total / 55B active MoE architecture optimized for long-running agentic workflows
- NVFP4 enables up to 5× faster inference on Blackwell vs. BF16, single checkpoint across GPU generations
- Up to 30% lower cost per agentic task completion, verified on SWE-bench and Terminal bench 2.0
- Fully open: weights, data, and recipes under Linux Foundation OpenMDW-1.1
- Native integration with Hermes agentic harness and OpenCode; available on OpenRouter, Perplexity, and HuggingFace