Google Launches DiffusionGemma: 4× Faster Text Generation With Parallel Diffusion

Google's open experimental model abandons sequential token generation for parallel 256-token canvas denoising, hitting 1,000+ tokens/sec on an H100 GPU. Released under Apache 2.0 on Hugging Face.

TL;DR: Google DeepMind released DiffusionGemma on June 10, 2026 — an experimental open model that replaces token-by-token generation with parallel diffusion over a 256-token canvas. It achieves up to 4× faster text generation than conventional LLMs, with 1,000+ tokens/sec on an H100. Available under Apache 2.0.

What Is DiffusionGemma?

Google DeepMind introduced DiffusionGemma on June 10, 2026, as an experimental open model that rethinks how text is generated at a fundamental level. Built on the Gemma 4 26B Mixture-of-Experts architecture, it swaps the standard autoregressive (one-token-at-a-time) generation loop for discrete text diffusion — generating entire blocks of text simultaneously through iterative refinement.

4×faster than autoregressive LLMs

1,000+tokens/sec on a single H100

700+tokens/sec on RTX 5090

3.8Bactive parameters at inference (26B total)

Why Autoregressive LLMs Hit a Speed Wall

Every standard LLM must load the full model weights from GPU memory to predict each single next token. This makes inference memory-bandwidth-bound — a fundamental bottleneck that no amount of raw GPU compute can fully overcome. You're limited by how fast you can shuttle data, not how fast you can crunch numbers.

DiffusionGemma sidesteps this entirely. It initializes a canvas of 256 random tokens, then runs multiple denoising passes over the entire canvas simultaneously. Each forward pass processes 256 tokens at once, reducing the number of memory transfers by roughly 10× compared to autoregressive generation — shifting the bottleneck from memory bandwidth to raw compute, which scales much more favorably with modern hardware.

How Uniform State Diffusion Works

Unlike image diffusion models that add continuous Gaussian noise, DiffusionGemma uses Uniform State Diffusion — a discrete approach where tokens are replaced with random vocabulary entries rather than blurred numerically.

The generation loop looks like this:

Initialize a 256-token canvas with random tokens
Run a full denoising pass over the entire canvas with bidirectional attention
Lock in the tokens the model is most confident about
Locked tokens become context for the next refinement pass
Repeat until the canvas converges to coherent text

This bidirectional attention mechanism — where every token can attend to all other tokens — is what makes DiffusionGemma particularly powerful for non-linear tasks like code infilling, inline editing, and structured document generation.

💡

Runs on Consumer Hardware
Quantized (Q4_K_M), DiffusionGemma requires only about 16–18 GB of VRAM, making it runnable on a single RTX 4090 (24 GB). With llama.cpp's diffusion sampler, users have reported around 100 tokens/sec fully offline. The quantized model weighs approximately 16 GB vs. ~50 GB for the full-precision version.

Ecosystem and Tooling

Google has launched DiffusionGemma with broad toolchain support from day one:

Tool	Role
Hugging Face Transformers	Standard inference
vLLM (Red Hat integration)	High-throughput production serving
MLX	Apple Silicon optimized inference
llama.cpp	Local quantized execution (PR in progress)
NVIDIA NIM	Enterprise microservice deployment
Unsloth / NVIDIA NeMo	Fine-tuning
Hackable Diffusion (JAX)	Official modular fine-tuning toolbox

NVIDIA collaborated on hardware-level optimizations spanning consumer GPUs (RTX 4090/5090 with quantization) through enterprise servers (Hopper and Blackwell using NVFP4 kernels), enabling near-lossless accuracy at accelerated compute throughput.

ℹ️

Quality Trade-off Is Real
DiffusionGemma is explicitly an experimental model optimized for speed. It scores approximately 5–9 points lower than its autoregressive Gemma 4 counterpart on standard benchmarks (general knowledge, graduate-level science, coding). For production workloads where output quality is the priority, Google recommends sticking with autoregressive Gemma 4 models.

Where DiffusionGemma Shines

The model's ultra-low latency opens up workflows that were impractical with slower autoregressive generation:

Real-time inline editing: suggest completions as the user types, with negligible delay
Code infilling: complete missing code blocks using bidirectional context from both before and after the gap
Rapid iteration: generate dozens of draft variations in the time a standard model produces one
Structured output: JSON, tables, and non-linear document formats where global consistency matters
Interactive local AI: chatbots and assistants that feel genuinely instant on personal hardware

Key Takeaways

DiffusionGemma: 26B MoE (3.8B active at inference), Apache 2.0, open weights on Hugging Face
Up to 4× faster than autoregressive LLMs — 1,000+ tokens/sec on H100, 700+ on RTX 5090
Bidirectional attention enables strong performance on code infilling and non-linear generation
Runs on a single RTX 4090 (24 GB VRAM) when quantized (~16–18 GB)
Scores ~5–9 points lower on quality benchmarks vs. autoregressive Gemma 4 — experimental, not production-ready
Supported by vLLM, MLX, Hugging Face Transformers, NVIDIA NIM from launch day

Access and Deployment

DiffusionGemma weights are available immediately on Hugging Face (google/diffusiongemma-26B-A4B-it) under Apache 2.0. The model is also accessible via Google's Vertex AI Model Garden and Kaggle. Detailed architecture documentation and inference guides are available through Google AI for Developers.

🔗

Resources · Official Sources · Getting Started
— Google DeepMind Gemma (GitHub) — Official repository with architecture docs and code
— llama.cpp (GitHub) — Local quantized inference engine with DiffusionGemma support
— vLLM (GitHub) — High-throughput serving engine with DiffusionGemma integration