Local LLM Benchmark

Model Capability Star Map

How do small local models perform on real-world tasks?

Quality assessed by Claude Sonnet 4.5 (LLM-as-judge) · 126 tasks per model · GDPval dataset by OpenAI

Qwen2.5-3B
67.4%
vs
Qwen3.5-2B
75.0%
vs
Qwen3.5-0.8B
48.4%
Avg Score
67.4%
across 9 categories
Strongest
Structured Output
92.9%
Weakest
Reasoning
51.5%
Tasks
31
20 passed
Category Breakdown
The Headline
The newer Qwen3.5-2B scores 75% on practical tasks — beating the larger Qwen2.5-3B at 67%.
On professional GDPval tasks, all three converge: 52%, 52%, and 45%.
As judged by Claude Sonnet 4.5 · 126 tasks per model · Running locally on a $300 GPU
Practical: 31 tasks across 9 categories · GDPval: 95 professional tasks across 9 sectors, 44 occupations
Quality assessed by Claude Sonnet 4.5 (LLM-as-judge) · GDPval dataset by OpenAI