TL;DR — Google DeepMind released DiffusionGemma under the Apache 2.0 license on June 10, 2026. Instead of predicting one token at a time, it simultaneously fills a 256-token "canvas" and iteratively refines it — a process borrowed from image diffusion models. The result: 1,000+ tokens per second on a single H100, up to 4x faster than comparable autoregressive LLMs. A quantized version runs on a consumer RTX 4090 (24 GB VRAM) fully offline.

Why Autoregressive LLMs Have a Speed Ceiling

Every mainstream language model — GPT, Claude, Gemini — generates text using an autoregressive loop: predict one token, commit it, then use everything generated so far to predict the next. Each step requires a full forward pass through the network, and most of that pass is spent reading and writing memory rather than doing useful math. The GPU ends up memory-bandwidth-bound, leaving its raw compute largely idle.

DiffusionGemma solves this by shifting the bottleneck from memory bandwidth to raw compute. Inspired by text-to-image diffusion models, it places all 256 token slots on a "canvas" at once, makes a rough guess for all of them simultaneously, then refines iteratively — locking in high-confidence tokens each pass and re-sampling the rest. Simple answers converge in ~12 passes; complex reasoning uses up to 48. Because there's no sequential token dependency, GPU utilization jumps dramatically.

Max speed gain vs. autoregressive
1,000+ Tokens/sec on NVIDIA H100
~18 GB VRAM (quantized, consumer GPU)

Architecture: 26B MoE, Only 4B Active Parameters

DiffusionGemma builds on the Gemma 4 26B Mixture-of-Experts (MoE) backbone. Total parameters are 26B, but only ~4B activate per inference step, keeping the memory footprint small enough to fit on consumer hardware when quantized.

Key technical highlights:

  • Bidirectional attention: Every token in the 256-slot canvas can attend to every other token simultaneously — the end of a sentence can be resolved before its middle.
  • Multi-canvas generation: Canvases chain sequentially, enabling long-form outputs beyond 256 tokens.
  • NVFP4 support: Native 4-bit floating-point on NVIDIA Blackwell GPUs for additional throughput on enterprise hardware.
  • Multimodal inputs: Accepts text, image, and video inputs (audio not yet supported).
💡
Who should use DiffusionGemma?
It's optimized for workflows where generation speed matters most: inline code completion, real-time interactive editing, non-linear text structures. For maximum accuracy — math reasoning, hard coding benchmarks — the standard autoregressive Gemma 4 models still win. Google itself notes DiffusionGemma is "experimental" and not recommended as a production replacement for high-quality tasks.

Benchmark Comparison

Metric DiffusionGemma Gemma 4 (Autoregressive)
Speed (H100) 1,000+ tok/s ~250 tok/s
Speed (RTX 5090) 700+ tok/s ~200 tok/s
MMLU (general knowledge) −5 pts vs. baseline Baseline
GPQA (grad-level science) −9 pts vs. baseline Baseline
HumanEval (coding) −8 pts vs. baseline Baseline
VRAM (quantized) ~18 GB ~18 GB

The accuracy trade-off is real but modest — and for speed-critical applications, 4x faster generation is often worth it.

ℹ️
Run it on a single RTX 4090 — 4 steps
1. Download unsloth/diffusiongemma-26B-A4B-it-GGUF (Q4_K_M, ~16 GB)
2. Check out llama.cpp diffusion sampler PR #24423 and build with CUDA
3. Run: llama-cli -m model.gguf -n 2048 -p "your prompt"
4. Expect ~100 tok/s, ~19 GB VRAM — fully offline, no API key needed.

What New Applications Does This Unlock?

Google DeepMind's internal demos showed DiffusionGemma powering experiences impossible at autoregressive latencies: a fake Wikipedia where every page is generated on-the-fly as you click, a Reddit clone where AI comments and images appear instantly, an OS-like interface where each click renders the next screen, and a voice-built todo app completed in 15 seconds. These are latency-sensitive, interactive, local-first workflows that become viable when generation speed crosses ~2,000 tokens/sec.

Key Takeaways
  • DiffusionGemma is the first open-weights text diffusion LLM — generating 256 tokens in parallel rather than one at a time.
  • Up to 4x faster than autoregressive equivalents: 1,000+ tok/s on H100, 700+ on RTX 5090.
  • Apache 2.0 license — free to download, run locally, and fine-tune.
  • Accuracy is modestly lower than same-size autoregressive models; best for speed-first use cases.
  • Quantized version fits on a single consumer RTX 4090 (24 GB VRAM) for fully offline deployment.
🔗
Official Sources & Documentation
Google Blog: Introducing DiffusionGemma
Google AI for Developers: DiffusionGemma Developer Docs
Google DeepMind: Model Overview Page
Hugging Face: Download Model Weights (Apache 2.0)