TL;DR
  • NVIDIA launched Nemotron 3 Ultra (550B total / 55B active params) at Computex 2026, claiming the top spot among US open-weights AI models
  • Scores 48 on the Artificial Analysis Intelligence Index — ahead of all US competitors but behind China's Kimi K2.6 at 54
  • Key differentiator: 300+ tokens/second inference speed, 3–6× faster than comparable Chinese models in production

NVIDIA CEO Jensen Huang unveiled Nemotron 3 Ultra during his Computex 2026 keynote on June 1, completing the Nemotron 3 family that began with the Nano variant in December 2025. The model officially released on June 4, 2026, available on HuggingFace, OpenRouter, and NVIDIA NIM.

Architecture: Hybrid Mamba-Transformer MoE

Nemotron 3 Ultra uses a novel hybrid Mamba-2 / Transformer / Mixture-of-Experts architecture. With 550 billion total parameters but only 55 billion active per token (90% sparsity), the model achieves intelligence comparable to much larger dense models while keeping inference costs closer to a 55B-class system. It supports up to 1 million tokens of context, a meaningful advantage for long-running enterprise AI agents where competing Chinese models often max out at 256K.

550B Total Parameters
48 / 100 Intelligence Index Score
300+ Tokens/sec Inference Speed

Performance: US #1, But China Still Leads

On the Artificial Analysis Intelligence Index, Nemotron 3 Ultra scores 48, making it the most intelligent open-weights model released by a US lab. It comfortably leads Google's Gemma 4 31B (39), Nemotron 3 Super (36), and OpenAI's gpt-oss-120b (33).

However, the China-US open-weights gap persists. Moonshot's Kimi K2.6 leads at 54, followed by GLM-5.1 (51) and MiniMax-M2.7 (49). Nemotron 3 Ultra narrows the gap considerably compared to prior US models but does not close it.

Pro Tip: NVIDIA also offers an NVFP4 quantized version of Nemotron 3 Ultra for even higher inference throughput. On GB200 hardware with NVFP4, it achieves 5.9× the throughput of GLM-5.1 and 4.8× that of Kimi K2.6 — critical for cost-sensitive enterprise deployments.

Speed: The Real Competitive Edge

Where Nemotron 3 Ultra clearly dominates is inference speed. On a pre-release DeepInfra endpoint, it served over 300 tokens per second — roughly 3–6× faster than comparable models from DeepSeek and Moonshot (typically 50–100 tokens/sec in production). For enterprise agentic workflows where latency directly impacts user experience and cost, this gap is decisive.

NVIDIA frames speed as the key selling point: even if intelligence benchmarks favor Chinese open models, Nemotron 3 Ultra is the fastest intelligent open model US developers can access commercially.

Benchmark Comparison

Benchmark Nemotron 3 Ultra Kimi K2.6 GLM-5.1 Qwen3.5
Agent Productivity 91% 91% 84% 89%
Long-Horizon Planning 33% 29% 40% 30%
Coding 54% 67% 64% 53%
Instruction Following 82% 74% 77% 78%
Professional Work 56% 56% 46% 53%
Long Context (1M tokens) 95% N/A N/A 90%

Nemotron wins on instruction following, professional tasks, and long-context handling. It trails on coding and long-horizon planning — areas where Kimi K2.6 and GLM-5.1 hold a clear edge.

License Note: Nemotron 3 Ultra is released under the NVIDIA Open Model License, permitting commercial use. Unusually, NVIDIA also published training recipes and a substantial portion of the training data alongside the weights, going further than most US frontier model releases.

What This Means

Nemotron 3 Ultra is a genuine milestone for US open-source AI. It narrows the gap with China's open-weights frontier and offers a commercially viable, blazing-fast alternative for enterprise teams who need to self-host. The remaining 6-point intelligence gap with Kimi K2.6 suggests the China-US open-weights race is far from over — but with Computex 2026 behind us, the next round of NVIDIA model releases may be just months away.

Key Takeaways

  • First US open-weights model to reach an intelligence score of 48 on the Artificial Analysis Index
  • China's Kimi K2.6 still leads at 54 — the US-China open-weights intelligence gap remains real
  • Inference speed of 300+ tokens/sec is 3–6× faster than comparable Chinese models in production
  • Hybrid Mamba-2 / Transformer / MoE architecture with 1M token context at competitive cost
  • Commercially usable under NVIDIA Open Model License; training recipes and data also published