Built for the Agentic Era
NVIDIA's framing for Nemotron 3 Ultra is unambiguous: "AI is no longer just a thing you ask a question to. Now it's an agent that works on your behalf." Coding agents plan, write, test, debug, and iterate across large codebases. Research agents search, evaluate, cross-reference, and synthesize across hundreds of sources. These workflows run for hours — and Nemotron 3 Ultra was designed from scratch to handle them efficiently.
Technical Architecture
Hybrid Mamba-Transformer Layers
The core architectural innovation is a hybrid of Transformer and Mamba (SSM) layers. State space models handle long-context sequences with significantly lower memory and compute than pure attention, making Nemotron 3 Ultra practical for agentic sessions that span millions of tokens.
LatentMoE for Expert Routing
LatentMoE enables four times as many experts to be available at the same inference cost as conventional MoE routing. The result: more domain-specialized capacity without sacrificing speed.
NVFP4 Cross-Architecture Quantization
A single NVFP4 checkpoint runs on NVIDIA Hopper, Blackwell, and Ampere GPUs. On Blackwell, it delivers up to 5x higher throughput versus BF16 at equivalent interactivity levels.
Multi-Teacher On-Policy Distillation
Nemotron 3 Ultra is trained with dense feedback from over ten domain-specific teacher models. The full data pipeline, training recipes, and weights are released openly, enabling fine-tuning for any specialized domain.
Nemotron 3 Ultra is available today on Perplexity Pro, OpenRouter, Anaconda, and build.nvidia.com. It is packaged as an NVIDIA NIM microservice for cloud, on-premises, or edge deployment.
Benchmark Performance
On SWE-bench and Terminal Bench 2.0, Nemotron 3 Ultra completed agentic benchmarks using fewer total tokens and fewer tokens per turn than comparable models.
| Attribute | Nemotron 3 Ultra | Comparable Open Models |
|---|---|---|
| Inference speed | 5× faster | Baseline |
| Agentic task cost | 30% lower | Baseline |
| GPU compatibility | Hopper, Blackwell, Ampere | Varies |
| License | Fully open (OpenMDW-1.1) | Mostly restricted |
| Weights open | ✅ | Mostly ❌ |
The Linux Foundation's permissive license purpose-built for AI model distributions. It covers architecture, parameters, documentation, software, and related artifacts under one framework, and permits commercial use and redistribution after fine-tuning.
Agent Framework Integration
Nemotron 3 Ultra integrates natively with NVIDIA's NemoClaw secure runtime and Hermes agent harness. Swapping in the model in OpenCode or Hermes requires a single-line JSON config change. NVIDIA also released cookbooks and Hugging Face model cards to help teams start in minutes.
Rather than building the model in isolation, NVIDIA formed the Nemotron Coalition — a group of partner companies that jointly contribute data and evaluations before each release. Nemotron 4 is already in development.
- Nemotron 3 Ultra is a 550B/55B hybrid Mamba-Transformer MoE model optimized for long-running agentic tasks.
- 5x faster inference and 30% lower cost versus comparable models on agentic benchmarks.
- NVFP4 quantization runs one checkpoint across Hopper, Blackwell, and Ampere GPUs.
- Fully open: weights, data, and training recipes released under OpenMDW-1.1.
- Available now on Perplexity Pro, OpenRouter, build.nvidia.com, and as an NVIDIA NIM microservice.