What Is Holo3.1?
H Company's Holo3.1 is an updated family of Vision-Language Models (VLMs) designed specifically for computer-use agents — systems that perceive a screen and autonomously operate a keyboard and mouse to complete real-world tasks. The original Holo3 (released March 2026) set a new state-of-the-art on OSWorld-Verified at 78.85%, outperforming GPT-5.4 and Opus 4.6 at a fraction of the cost. Holo3.1 builds on that foundation with three major production improvements.
Three Key Improvements
1. Local Quantized Inference
For the first time, H Company ships quantized checkpoints alongside the full-precision model. The 35B-A3B model is available in FP8, NVFP4, and Q4 GGUF formats. On an NVIDIA DGX Spark, NVFP4 (W4A16) cuts average agent step time from 6.8s (FP8) to 3.3s — roughly a 2× end-to-end speedup. On an RTX 4090, perception-to-action latency reaches 140ms, four times faster than typical cloud API roundtrips. With 4-bit quantization, the full agent stack fits on a 12GB VRAM GPU.
The privacy implication is significant: continuous desktop screenshots never leave the machine. This opens computer-use agents to finance, healthcare, and legal workflows where cloud API transmission is prohibited.
2. Mobile Automation
Holo3.1 adds first-class mobile support. Trained on Android UI traces, the 35B-A3B model jumps from 67% to 79.3% on the AndroidWorld benchmark. The 4B and 9B variants go from 58% to 72%. In H Company's own Holotab product harness, Holo3.1 shows more than 25% improvement over Holo3.
3. Native Function Calling
Holo3.1 supports the OpenAI-compatible function-calling protocol alongside its existing structured JSON output. The two execution modes now perform at near-parity, eliminating the 10–15% accuracy gap that hampered Holo3 integration with third-party frameworks. This means Holo3.1 plugs directly into LangGraph, CrewAI, AutoGen, and custom harnesses without adapter layers.
Model Family
| Model | Parameters | Best For |
|---|---|---|
| Holo3.1-0.8B | 0.8B | Ultra-lightweight edge agents |
| Holo3.1-4B | 4B | Cost-efficient private deployment |
| Holo3.1-9B | 9B | Balanced performance and latency |
| Holo3.1-35B-A3B | 35B total / 3B active | State-of-the-art production |
All models are built on the Qwen 3.5 base and released under Apache 2.0.
Why Local Inference Matters Now
The timing of Holo3.1 is no coincidence. A widely shared Hacker News post (716 upvotes) demonstrated running Gemma 4 on a GPU-free 2016 Intel Xeon using aggressive speculative decoding. The message from the developer community is clear: the local inference era for agentic AI has arrived.
For teams evaluating computer-use vendors, Holo3.1's quantized variants lower the hidden integration cost that often outweighs raw benchmark scores in production decisions. Privacy-sensitive industries — those that cannot stream screenshots to a cloud API — finally have a production-ready option.
H Company is actively developing a desktop agent harness based on Holo3.1, with more deployment environments planned. The goal is a universal computer-use agent that operates across environments, integrates into any agent stack, and runs wherever the workflow lives.
- First computer-use model family to ship FP8, NVFP4, and Q4 GGUF quantized checkpoints for local inference
- AndroidWorld: 35B-A3B jumps from 67% to 79.3%; 4B/9B models from 58% to 72%
- NVFP4 cuts average step time from 6.8s to 3.3s (~2×); 140ms on RTX 4090
- Native OpenAI-compatible function calling for seamless LangGraph, CrewAI, AutoGen integration
- Four model sizes (0.8B–35B-A3B), Apache 2.0 license, full agent stack runs on 12GB VRAM
— Holo3.1 Official Blog (H Company)
— Holo3.1 Launch Announcement (Hugging Face Blog)
— Holo3.1 Quickstart Guide (H Company Hub)