What Is Nemotron 3 Ultra?
NVIDIA Nemotron 3 Ultra is the flagship of the Nemotron 3 family and NVIDIA's most capable open model to date. Unlike models optimized for single-turn chat, Nemotron 3 Ultra is explicitly built for long-running agentic AI — workflows where a model plans, calls tools, reads observations, delegates to sub-agents, validates outputs, and recovers from errors across many turns over extended periods.
Typical use cases: coding agents that plan, write, test, and debug across large codebases over hours; research agents that search, cross-reference, and synthesize findings from hundreds of sources.
- 🧠 550B total parameters (55B active, Sparse MoE)
- ⚡ 5x faster inference throughput vs. comparable open models
- 💰 30% lower cost per agentic task (SWE-bench / Terminal bench 2.0)
- 📏 1M token context (Ruler@1M: 95% accuracy)
- 📂 Fully open: weights + data + recipes all released
Four Architectural Innovations
1. Hybrid Mamba-Transformer Layers
Mamba layers handle long-context efficiency — critical for agentic tasks that accumulate hundreds of tool calls and observations over time. Transformer layers preserve precise fact retrieval from within large context windows. The hybrid combines the best of both architectures.
2. LatentMoE Expert Routing
LatentMoE allows Ultra to activate four times more experts at the same inference cost compared to standard MoE routing. This enables the model to fluidly handle diverse workflows spanning reasoning, code generation, tool use, and domain-specific logic within a single pipeline.
3. NVFP4 Quantization
A single NVFP4 checkpoint runs natively across NVIDIA Hopper, Blackwell, and Ampere GPUs. On Blackwell, NVFP4 delivers up to 5x higher throughput per GPU compared to BF16 at equivalent interactivity — with no per-architecture recompilation needed.
4. Multi-Token Prediction (MTP)
MTP predicts multiple future tokens in a single forward pass, reducing generation latency for long outputs. This is especially impactful in multi-turn agentic workflows where the model produces many sequential planning steps.
Try Nemotron 3 Ultra immediately: Perplexity Pro (API + UI), OpenRouter / Together AI / DeepInfra (inference platforms), Hugging Face (download weights), NVIDIA NIM microservice (on-premises deployment). Enterprise cloud options include AWS JumpStart, Amazon Bedrock, and Baseten. Cookbooks and tutorials are available via the NVIDIA developer blog.
Multi-Teacher On-Policy Distillation (MOPD)
The post-training methodology is as novel as the architecture. MOPD trains Ultra using feedback from 10+ specialized teacher models simultaneously. Each teacher is itself a domain-specific model trained on its own pipeline. As Ultra generates attempts, teachers score the outputs in their area of expertise. This co-evolution between student and teachers drives continuous improvement across domains.
| Training Data | Scale |
|---|---|
| Pre-training tokens | 10T + 212B new domain-targeted tokens |
| Cumulative SFT samples | 50M |
| Cumulative RL tasks | 2M |
| Cumulative RL environments | 55 |
| New SFT samples this release | 10M |
| New RL environments this release | 15 |
Nemotron 3 Ultra ships under the Linux Foundation's OpenMDW-1.1 (Model, Data, Weights) license — a new permissive license purpose-built for open AI distributions. It covers the full set of model materials under a single framework. Enterprises can fine-tune for domain-specific workflows and deploy commercially anywhere.
The Agentic Ecosystem: Hermes, NemoClaw, OpenShell
Ultra is designed to work within an ecosystem rather than in isolation. NVIDIA released three companion tools:
- Hermes Agent — an agentic harness providing planning loops, tool calls, and memory management
- NemoClaw — an open-source orchestration layer that connects Hermes and OpenCode with Ultra as the inference engine
- NVIDIA OpenShell — a secure sandboxed runtime that isolates agent-generated code and enforces governance policies
Together these enable end-to-end agentic workflows: a coding agent in OpenCode can iterate on a codebase for hours inside OpenShell, with NemoClaw managing orchestration and Ultra providing reasoning.
- 550B total / 55B active Hybrid Mamba-Transformer MoE — designed for long-running agent workflows
- 5x faster throughput and 30% lower cost vs. comparable open models on agentic benchmarks
- 1M token context (Ruler@1M: 95%), NVFP4 runs on Hopper/Blackwell/Ampere from one checkpoint
- Fully open: weights, data, recipes — commercially usable under OpenMDW-1.1
- Native integration with Hermes, NemoClaw, and OpenShell for production agentic deployment