A New Frontier for Open Agentic AI
NVIDIA completed its Nemotron 3 model family on June 4, 2026 with the release of Nemotron 3 Ultra — a 550B-parameter open language model designed from the ground up for the demands of real-world agentic AI systems. Unlike general-purpose models scaled up after the fact, Nemotron 3 Ultra was trained with an agent-focused post-training pipeline that prioritizes multi-step planning, tool use, code generation, long-context reasoning, and workflow orchestration.
The defining characteristic of Nemotron 3 Ultra is complete openness: NVIDIA releases the base model, post-trained checkpoint, NVFP4-quantized checkpoint, training data, and training recipe — all publicly available on Hugging Face under the Linux Foundation's new OpenMDW-1.1 license.
The Architecture: LatentMoE Hybrid Mamba-Transformer
The performance gains in Nemotron 3 Ultra come from a novel LatentMoE architecture that combines three components:
Mamba-2 SSM layers: State Space Models dramatically reduce KV cache memory requirements and improve throughput for long sequences — the primary bottleneck in long-running agent workflows that accumulate tens of thousands of tokens of context.
Mixture-of-Experts (MoE): Despite having 550B total parameters, only 55B are activated per forward pass, keeping per-token compute costs comparable to a ~55B dense model while benefiting from the knowledge of a 550B-scale model.
Multi-Token Prediction (MTP): Predicts multiple tokens per forward pass, improving generation speed for the multi-turn tasks common in agentic systems.
Combined with NVFP4 pre-training (4-bit floating-point quantization), this architecture achieves 5.9× higher throughput than GLM-5.1-754B-A40B on an 8K input / 64K output configuration.
Nemotron 3 Ultra supports configurable reasoning via the chat template. Set
enable_thinking=True to activate the internal chain-of-thought trace before the final response — ideal for complex planning tasks. Set it to False for lower-latency conversational use cases. The NVFP4 quantized checkpoint is recommended for production deployments to maximize throughput.Model Specifications
| Property | Value |
|---|---|
| Total Parameters | 550B (55B active) |
| Architecture | LatentMoE: Mamba-2 + MoE + Attention + MTP |
| Context Length | Up to 1,000,000 tokens |
| Pre-training Data | 20 trillion tokens |
| Languages | English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, Chinese |
| Min GPU Requirement | 16× H100 or 8× H200/GB200 |
| License | OpenMDW-1.1 (Linux Foundation) |
| Release Date | June 4, 2026 |
Why It Matters for Agentic Workflows
Most frontier-scale models are too slow and expensive for the kind of long-horizon, multi-tool agent tasks that enterprise AI systems require. Nemotron 3 Ultra addresses this directly.
In benchmarks on SWE-bench and Terminal Bench 2.0, Nemotron 3 Ultra completed tasks using fewer total tokens and fewer tokens per turn than comparable models. This translates to up to 30% lower cost for agentic workflows — a meaningful saving when agents run hundreds of turns per task.
The model was also trained using Multi-Teacher On-Policy Distillation (MOPD), which funnels feedback from over ten domain-specific teacher models into a single unified model. This gives Nemotron 3 Ultra strong performance across diverse domains — code, math, science, RAG, tool use — without requiring separate specialized models.
Nemotron 3 Ultra is packaged as an NVIDIA NIM microservice, allowing the same checkpoint to run across Hopper, Blackwell, and Ampere GPU architectures. It's available on AWS JumpStart, Azure, CoreWeave, Together AI, OpenRouter, DeepInfra, and a dozen other inference platforms. Enterprise teams can also deploy on-premises using NVIDIA DGX systems.
Available Checkpoints on Hugging Face
Three checkpoints are publicly available:
- NVFP4 (
NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4): Post-trained + quantized — recommended for production - BF16 (
NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16): Full-precision post-trained model - Base BF16 (
NVIDIA-Nemotron-3-Ultra-550B-A55B-Base-BF16): Pre-trained base for custom fine-tuning
All three support the same single-checkpoint deployment approach across GPU architectures. Cookbooks for vLLM, SGLang, and TRT-LLM are available in the NVIDIA developer resources.
Key Takeaways
- 550B-parameter fully open MoE model with weights, data, and training recipe all public
- Up to 5.9× faster inference than comparable open models via hybrid Mamba-Transformer architecture
- 30% lower cost for agentic tasks (fewer total tokens per task completion)
- 1M-token context window ideal for long-form document and codebase analysis
- NVIDIA NIM packaging enables identical deployment across cloud, on-prem, and edge environments