What Happened
On June 8, 2026, Xiaomi — in collaboration with inference partner TileRT — released MiMo-V2.5-Pro-UltraSpeed, achieving what they describe as an industry first: a 1-trillion-parameter large language model decoding at over 1,000 tokens per second on a single standard 8-GPU commodity node, with peak throughput reaching approximately 1,200 tokens per second.
To put those numbers in context: according to Artificial Analysis benchmarks, GPT-5.5 (what most ChatGPT users interact with) runs at about 68 TPS. Claude Opus 4.6 lands around 71 TPS. Gemini Flash, previously among the fastest available models, reaches 192 TPS. MiMo UltraSpeed does 1,000+ — on a model that matches Opus on coding benchmarks.
- 🚀 Decode speed: 1,000+ TPS (peak ~1,200 TPS in demos)
- 📊 Vs. competitors: ~15× faster than GPT-5.5 (68 TPS), ~14× faster than Opus 4.6 (71 TPS)
- 💻 Hardware: Single standard 8-GPU commodity node — no custom silicon
- ⚡ Speedup vs. baseline: ~10× over standard MiMo-V2.5-Pro API
- 💰 Pricing: 3× the standard MiMo-V2.5-Pro API rate
The Three-Layer Technique
Xiaomi calls the approach "extreme model-system codesign." No single technique gets to 1,000 TPS; all three layers working together do.
1. FP4 (MXFP4) Quantization
FP4 quantization is applied exclusively to the Mixture-of-Experts (MoE) Expert layers — not the full model. This dramatically shrinks the model's memory footprint and reduces bandwidth overhead. Crucially, the quantization uses Quantization-Aware Training (QAT), which means the model learned to operate in FP4 precision during training, preserving capability close to the full-precision baseline.
2. DFlash Speculative Decoding
DFlash is a speculative decoding method based on block-level masked parallel prediction. Instead of generating one token per forward pass, DFlash predicts and verifies an entire masked block of tokens in a single step. In coding benchmarks, it achieves an average accepted token length of 6.30 per step — a significant multiplier over standard autoregressive generation.
3. TileRT Inference Engine
TileRT built a custom compilation engine and compute kernels specifically optimized for this combined FP4 + DFlash pipeline. The runtime adapts dynamically to the characteristics of speculative decoding, keeping the GPU fully utilized and eliminating common operator overhead.
| Component | Role | Benefit |
|---|---|---|
| FP4 (MXFP4) Quantization | Compress MoE Expert layers | Lower memory footprint, higher bandwidth |
| DFlash Speculative Decoding | Block-level parallel token prediction | More tokens accepted per step |
| TileRT Inference Engine | Custom compilation + kernels | Removes operator overhead |
| Combined Result | Single 8-GPU commodity node | 1,000+ TPS on 1T model |
At 1,000 TPS, inference speed stops being a bottleneck and becomes a capability multiplier. You can run dozens of reasoning paths in parallel, power real-time agentic loops, and tackle latency-sensitive workloads — fraud detection, trading signal generation, multi-agent coordination — that were simply impossible at 60–70 TPS. The paradigm shifts from "AI you ask a question" to "AI that works continuously in the background."
How It Compares to Custom Silicon
Companies like Cerebras (wafer-scale integration) and Groq (on-chip SRAM custom architecture) have achieved similar extreme inference speeds — but required purpose-built, multi-million-dollar hardware. Xiaomi's key claim is achieving comparable throughput on commodity GPUs through software co-design alone, making the approach deployable on standard cloud or on-premises infrastructure.
UltraSpeed is not a new or smaller model. It's a serving mode layered on top of the existing MiMo-V2.5-Pro flagship — accelerating the full 1-trillion-parameter architecture. The model's capabilities remain the same; only how fast it generates output changes. The FP4-DFlash checkpoint, including quantized weights and DFlash parameters, is open-sourced on Hugging Face for community testing and verification.
API Trial Details
The UltraSpeed API is available as a limited, application-based trial:
- Trial window: June 9 – June 23, 2026 (Beijing time / UTC+8)
- Pricing: 3× the standard MiMo-V2.5-Pro API rate
- Access: API only; Token Plan not supported
- Priority: Enterprises and professional developers with genuine business needs
- Apply at: platform.xiaomimimo.com/ultraspeed
- First-ever 1,000+ TPS decode speed on a 1-trillion-parameter model — achieved with commodity GPUs, not custom silicon
- Three co-designed techniques: FP4 quantization + DFlash speculative decoding + TileRT runtime
- ~15× faster than GPT-5.5 and ~14× faster than Claude Opus 4.6 at equivalent task quality
- FP4-DFlash checkpoint open-sourced on Hugging Face for community testing
- Developer API trial open June 9–23 (enterprise and professional dev priority)