Google has released Gemma 4 12B, the first medium-sized model in the Gemma family to support audio input and the first to use an encoder-free unified architecture for all modalities. It runs on developer laptops with 16GB VRAM (or Apple Silicon unified memory), ships with a native macOS desktop app for offline voice and visual interaction, and includes a `litert-lm serve` CLI for drop-in OpenAI-compatible local API servers. A dedicated multi-token prediction (MTP) model is also available for faster local inference.

Local AI just got a serious upgrade. Gemma 4 12B isn't a scaled-down Gemini — it's a purpose-built local model with an architectural choice that changes what's possible at the edge: by eliminating the heavy multi-stage vision and audio encoders that previous Gemma models required, the 12B parameter model stays small enough to run on a developer's own machine while handling the full multimodal range that enterprise applications actually need.

Four Milestones for Local AI

Milestone Detail Why It Matters
Encoder-free architecture No separate vision or audio encoders Reduces model size overhead for multimodal; first in Gemma family
Audio input support First medium-sized Gemma model with native audio Previously limited to small Edge models (e.g., E4B)
Local hardware target 16GB VRAM laptops or unified memory Standard MacBook Pro / developer GPU laptops
macOS desktop app First official Gemma desktop experience Full offline spoken and visual interaction on Apple Silicon
**Getting started now**: Gemma 4 12B is immediately available via LM Studio and Ollama with a few clicks. Download pretrained and instruction-tuned checkpoints from Hugging Face or Kaggle. Run `litert-lm serve` to spin up a local OpenAI-compatible API server that connects to Continue, Aider, Hermes, OpenCode, and OpenClaw without code changes. On Apple Silicon, the Google AI Edge Gallery app runs Gemma 4 12B completely offline.

LiteRT-LM: The Local Inference Stack

Google released two developer integrations powered by LiteRT-LM alongside Gemma 4 12B:

Native macOS apps: The Google AI Edge Gallery mobile app officially expands to desktop, running Gemma 4 12B offline on Apple Silicon GPUs. It includes a sandboxed Python execution loop to write, run, and plot scientific charts inside the chat interface. The Google AI Edge Eloquent app on Mac adds Gemma 12B support for Voice Edit conversational inputs.

Drop-in local API server: litert-lm serve runs Gemma 4 12B as a local, OpenAI-compatible API endpoint in a single command. Stateless prefix caching in memory matches conversation history and instantly bypasses prefill latency — making repeated queries within a session significantly faster.

**Multi-token prediction optimization**: A dedicated MTP model is released alongside the standard Gemma 4 12B checkpoint. MTP predicts multiple future tokens in a single forward pass, improving throughput on the long outputs and multi-turn exchanges that agentic workflows produce. On 16GB VRAM hardware, this optimization is the difference between a usable response time and waiting.

Gemma Skills Repository for Agentic Workflows

With the Gemma 4 12B launch, Google officially opened the Gemma Skills Repository — a library of skills designed to help agents build with the latest Gemma capabilities.

Gemma 4 12B deployment ecosystem

  • Local runtime: LM Studio, Ollama, Google AI Edge Gallery App, Google AI Edge Eloquent, LiteRT-LM CLI
  • Cloud deployment: Gemini Enterprise Agent Platform Model Garden, Cloud Run, GKE
  • Fine-tuning support: Hugging Face Transformers, llama.cpp, MLX, SGLang, vLLM, Unsloth
  • Agent integrations: litert-lm serve for Continue, Aider, Hermes, OpenCode, OpenClaw via standard API
  • License: Open source, commercial use permitted
Feature Gemma 4 12B Previous Gemma Models
Architecture Unified encoder-free Separate vision encoder required
Audio support Medium-sized model first Small Edge models only (e.g., E4B)
Local requirements 16GB VRAM Higher specs for comparable capability
Desktop app Official macOS app Not available
MTP optimization Dedicated MTP model included Not available

The release lands alongside Google's broader local AI push. For developers building applications where data privacy, offline operation, or low-latency inference matter — medical, legal, on-premise enterprise — Gemma 4 12B removes the cloud dependency at a parameter count that's actually deployable. The encoder-free design means developers get text, vision, and audio in a single model download rather than a pipeline of separately maintained components.