What Is Mellum2?
Mellum2 is an open-weight 12B-parameter Mixture-of-Experts language model released by JetBrains on June 1, 2026. The company behind PyCharm and IntelliJ originally launched Mellum in late 2024 as a compact, proprietary code-completion model for its own IDEs. Mellum2 is a full generational upgrade: built from scratch (not a fine-tune), covering both natural language and code, and open from day one under Apache 2.0.
The core premise of Mellum2 is efficient inference for production software engineering workflows. With 64 experts and only 8 activated per token, the model delivers intelligence at the level of a 12B dense model while computing at the cost of a 2.5B dense model — resulting in more than 2× faster inference compared to similarly sized open models.
Architecture Highlights
Mellum2 was pre-trained on approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the data mix from diverse web content toward code and mathematics (code ratio: 23% → 42% → 59%).
Key architectural decisions, each validated by ablation with inference efficiency as a constraint:
- Grouped-Query Attention (GQA) with 4 KV heads — reduces memory footprint during inference
- Sliding Window Attention on 3 of every 4 layers — efficient long-context processing without full attention overhead
- Multi-Token Prediction (MTP) head — serves as both an auxiliary pretraining objective and a built-in draft model for speculative decoding
- Layer-selective YaRN — extends context to 128K tokens
- Muon optimizer + FP8 hybrid precision — training efficiency at scale
Two post-training variants ship alongside the base: an Instruct model for direct task completion and a Thinking model that emits an explicit reasoning trace before its final answer (trained with RL on verifiable rewards).
What Is It For?
Mellum2 is not a replacement for frontier coding models. It targets the infrastructure layer of agentic AI systems — the high-volume, latency-sensitive calls that don't need a GPT-5.4-class model but still need to be fast, accurate, and self-hosted.
Modern AI development pipelines involve far more than a single model call. Routing decisions, retrieval augmentation, summarization, planning steps, validation checks, and tool invocations all happen at scale. Many are latency-sensitive. Mellum2 is built for this tier.
| Use Case | What Mellum2 Does |
|---|---|
| Routing | Classify requests to the right agent or tool |
| RAG | Retrieve, read, and summarize documents |
| Sub-agents | Execution layer in multi-agent pipelines |
| Code generation | Autocompletion, editing, refactoring |
| Summarization | Long thread or document condensation |
| Function calling | Tool use in agentic workflows |
vllm serve JetBrains/Mellum-2-12B-instruct --enable-auto-tool-choice --tool-call-parser hermes --port 8000Then connect an MCP CLI or Hermes Agent to the localhost endpoint for file system tool use or full agentic workflows.
Why It Matters: Closing the Gap Claude Code Can't
JetBrains' blog describes Mellum2's positioning clearly: it goes "where Claude Code can't." Managed coding tools like Claude Code, GitHub Copilot, and Cursor all require third-party API calls. For teams with strict data residency requirements — healthcare systems, financial institutions, defense contractors — that's a blocker. Mellum2 runs entirely on infrastructure you control.
Beyond privacy, Mellum2's Apache 2.0 license means organizations can fine-tune the model on their own codebases and redeploy without royalties or usage restrictions. The full technical report (arXiv 2605.31268) publishes the complete architecture rationale, data pipeline details, and training recipe for reproducibility.
For developers who don't need the biggest model but do need something fast, private, and customizable at the routing/RAG/sub-agent layer, Mellum2 fills a gap that until now required expensive proprietary APIs or underperforming smaller open models.
- Mellum2: 12B MoE, only 2.5B active params per token — 2× faster inference than dense peers
- 64 experts, 8 active per token; GQA, Sliding Window Attention, MTP for speculative decoding
- 128K context window via layer-selective YaRN; 10.6T token three-phase pre-training
- Three checkpoints: Base, Instruct, Thinking — all Apache 2.0 on Hugging Face
- Targets routing, RAG, sub-agents, and private on-premises deployment where managed APIs can't go
— Mellum2 Open-Source Launch Post (JetBrains AI Blog)
— Mellum2 Models on Hugging Face
— Mellum2 Technical Report (arXiv)