NVIDIA Launches Nemotron 3 Ultra: A Fully Open 550B Model Built for Long-Running AI Agents

NVIDIA has released Nemotron 3 Ultra, a fully open 550B-parameter Mixture-of-Experts model optimized for agentic AI workflows. It delivers 5x faster throughput and 30% lower cost than comparable open models, with weights, data, and training recipes all publicly released under the Linux Foundation's OpenMDW-1.1 license.

TL;DR — NVIDIA released Nemotron 3 Ultra on June 4, 2026. It's a 550B-parameter (55B active) Hybrid Mamba-Transformer MoE model designed specifically for long-running agentic workflows — not single-turn chat. It achieves 5x faster inference throughput and 30% lower task cost than comparable open models, with full open release of weights, data, and recipes under the Linux Foundation's OpenMDW-1.1 license.

What Is Nemotron 3 Ultra?

NVIDIA Nemotron 3 Ultra is the flagship of the Nemotron 3 family and NVIDIA's most capable open model to date. Unlike models optimized for single-turn chat, Nemotron 3 Ultra is explicitly built for long-running agentic AI — workflows where a model plans, calls tools, reads observations, delegates to sub-agents, validates outputs, and recovers from errors across many turns over extended periods.

Typical use cases: coding agents that plan, write, test, and debug across large codebases over hours; research agents that search, cross-reference, and synthesize findings from hundreds of sources.

Key Numbers

🧠 550B total parameters (55B active, Sparse MoE)
⚡ 5x faster inference throughput vs. comparable open models
💰 30% lower cost per agentic task (SWE-bench / Terminal bench 2.0)
📏 1M token context (Ruler@1M: 95% accuracy)
📂 Fully open: weights + data + recipes all released

Four Architectural Innovations

1. Hybrid Mamba-Transformer Layers

Mamba layers handle long-context efficiency — critical for agentic tasks that accumulate hundreds of tool calls and observations over time. Transformer layers preserve precise fact retrieval from within large context windows. The hybrid combines the best of both architectures.

2. LatentMoE Expert Routing

LatentMoE allows Ultra to activate four times more experts at the same inference cost compared to standard MoE routing. This enables the model to fluidly handle diverse workflows spanning reasoning, code generation, tool use, and domain-specific logic within a single pipeline.

3. NVFP4 Quantization

A single NVFP4 checkpoint runs natively across NVIDIA Hopper, Blackwell, and Ampere GPUs. On Blackwell, NVFP4 delivers up to 5x higher throughput per GPU compared to BF16 at equivalent interactivity — with no per-architecture recompilation needed.

4. Multi-Token Prediction (MTP)

MTP predicts multiple future tokens in a single forward pass, reducing generation latency for long outputs. This is especially impactful in multi-turn agentic workflows where the model produces many sequential planning steps.

💡

How to Get Started
Try Nemotron 3 Ultra immediately: Perplexity Pro (API + UI), OpenRouter / Together AI / DeepInfra (inference platforms), Hugging Face (download weights), NVIDIA NIM microservice (on-premises deployment). Enterprise cloud options include AWS JumpStart, Amazon Bedrock, and Baseten. Cookbooks and tutorials are available via the NVIDIA developer blog.

Multi-Teacher On-Policy Distillation (MOPD)

The post-training methodology is as novel as the architecture. MOPD trains Ultra using feedback from 10+ specialized teacher models simultaneously. Each teacher is itself a domain-specific model trained on its own pipeline. As Ultra generates attempts, teachers score the outputs in their area of expertise. This co-evolution between student and teachers drives continuous improvement across domains.

Training Data	Scale
Pre-training tokens	10T + 212B new domain-targeted tokens
Cumulative SFT samples	50M
Cumulative RL tasks	2M
Cumulative RL environments	55
New SFT samples this release	10M
New RL environments this release	15

ℹ️

OpenMDW-1.1 License
Nemotron 3 Ultra ships under the Linux Foundation's OpenMDW-1.1 (Model, Data, Weights) license — a new permissive license purpose-built for open AI distributions. It covers the full set of model materials under a single framework. Enterprises can fine-tune for domain-specific workflows and deploy commercially anywhere.

The Agentic Ecosystem: Hermes, NemoClaw, OpenShell

Ultra is designed to work within an ecosystem rather than in isolation. NVIDIA released three companion tools:

Hermes Agent — an agentic harness providing planning loops, tool calls, and memory management
NemoClaw — an open-source orchestration layer that connects Hermes and OpenCode with Ultra as the inference engine
NVIDIA OpenShell — a secure sandboxed runtime that isolates agent-generated code and enforces governance policies

Together these enable end-to-end agentic workflows: a coding agent in OpenCode can iterate on a codebase for hours inside OpenShell, with NemoClaw managing orchestration and Ultra providing reasoning.

Key Takeaways

550B total / 55B active Hybrid Mamba-Transformer MoE — designed for long-running agent workflows
5x faster throughput and 30% lower cost vs. comparable open models on agentic benchmarks
1M token context (Ruler@1M: 95%), NVFP4 runs on Hopper/Blackwell/Ampere from one checkpoint
Fully open: weights, data, recipes — commercially usable under OpenMDW-1.1
Native integration with Hermes, NemoClaw, and OpenShell for production agentic deployment

🔗

Official Sources & Download Links
— NVIDIA Technical Blog: Nemotron 3 Ultra Official Announcement
— NVIDIA Research: Nemotron 3 Family Papers & Resources
— Hugging Face: Download Nemotron 3 Ultra Weights
— build.nvidia.com: Try via API & NIM Microservice