Why Autoregressive LLMs Have a Speed Ceiling
Every mainstream language model — GPT, Claude, Gemini — generates text using an autoregressive loop: predict one token, commit it, then use everything generated so far to predict the next. Each step requires a full forward pass through the network, and most of that pass is spent reading and writing memory rather than doing useful math. The GPU ends up memory-bandwidth-bound, leaving its raw compute largely idle.
DiffusionGemma solves this by shifting the bottleneck from memory bandwidth to raw compute. Inspired by text-to-image diffusion models, it places all 256 token slots on a "canvas" at once, makes a rough guess for all of them simultaneously, then refines iteratively — locking in high-confidence tokens each pass and re-sampling the rest. Simple answers converge in ~12 passes; complex reasoning uses up to 48. Because there's no sequential token dependency, GPU utilization jumps dramatically.
Architecture: 26B MoE, Only 4B Active Parameters
DiffusionGemma builds on the Gemma 4 26B Mixture-of-Experts (MoE) backbone. Total parameters are 26B, but only ~4B activate per inference step, keeping the memory footprint small enough to fit on consumer hardware when quantized.
Key technical highlights:
- Bidirectional attention: Every token in the 256-slot canvas can attend to every other token simultaneously — the end of a sentence can be resolved before its middle.
- Multi-canvas generation: Canvases chain sequentially, enabling long-form outputs beyond 256 tokens.
- NVFP4 support: Native 4-bit floating-point on NVIDIA Blackwell GPUs for additional throughput on enterprise hardware.
- Multimodal inputs: Accepts text, image, and video inputs (audio not yet supported).
It's optimized for workflows where generation speed matters most: inline code completion, real-time interactive editing, non-linear text structures. For maximum accuracy — math reasoning, hard coding benchmarks — the standard autoregressive Gemma 4 models still win. Google itself notes DiffusionGemma is "experimental" and not recommended as a production replacement for high-quality tasks.
Benchmark Comparison
| Metric | DiffusionGemma | Gemma 4 (Autoregressive) |
|---|---|---|
| Speed (H100) | 1,000+ tok/s | ~250 tok/s |
| Speed (RTX 5090) | 700+ tok/s | ~200 tok/s |
| MMLU (general knowledge) | −5 pts vs. baseline | Baseline |
| GPQA (grad-level science) | −9 pts vs. baseline | Baseline |
| HumanEval (coding) | −8 pts vs. baseline | Baseline |
| VRAM (quantized) | ~18 GB | ~18 GB |
The accuracy trade-off is real but modest — and for speed-critical applications, 4x faster generation is often worth it.
1. Download
unsloth/diffusiongemma-26B-A4B-it-GGUF (Q4_K_M, ~16 GB)2. Check out llama.cpp diffusion sampler PR #24423 and build with CUDA
3. Run:
llama-cli -m model.gguf -n 2048 -p "your prompt"4. Expect ~100 tok/s, ~19 GB VRAM — fully offline, no API key needed.
What New Applications Does This Unlock?
Google DeepMind's internal demos showed DiffusionGemma powering experiences impossible at autoregressive latencies: a fake Wikipedia where every page is generated on-the-fly as you click, a Reddit clone where AI comments and images appear instantly, an OS-like interface where each click renders the next screen, and a voice-built todo app completed in 15 seconds. These are latency-sensitive, interactive, local-first workflows that become viable when generation speed crosses ~2,000 tokens/sec.
- DiffusionGemma is the first open-weights text diffusion LLM — generating 256 tokens in parallel rather than one at a time.
- Up to 4x faster than autoregressive equivalents: 1,000+ tok/s on H100, 700+ on RTX 5090.
- Apache 2.0 license — free to download, run locally, and fine-tune.
- Accuracy is modestly lower than same-size autoregressive models; best for speed-first use cases.
- Quantized version fits on a single consumer RTX 4090 (24 GB VRAM) for fully offline deployment.