AI BENCHY Category Failures
Puzzle Solving: Wrong answer
Puzzle Solving
Wrong answer
See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster. Sort by: Response Time (avg) ↑.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #135 | Kimi K2.5 none | Moonshot AI | 3 | 3.0 | 0/3 | 4.04s |
| #24 | GPT-5.2 Chat none | OpenAI | 1 | 7.7 | 2/3 | 4.10s |
| #121 | Owl Alpha none | Openrouter | 1 | 5.4 | 1/3 | 4.18s |
| #85 | Gemma 4 31B none | 1 | 6.5 | 1/3 | 4.23s | |
| #156 | Hy3 preview none | Tencent | 2 | 3.1 | 0/3 | 4.56s |
| #69 | Claude Opus 4.6 medium | Anthropic | 1 | 7.7 | 2/3 | 4.71s |
| #118 | Qwen3.6 27B none | Qwen | 1 | 5.3 | 1/3 | 5.15s |
| #84 | Grok 4.20 Multi Agent Beta medium | X AI | 1 | 6.7 | 1/3 | 5.19s |
| #40 | Gemini 3.1 Flash Lite Preview medium | 1 | 7.7 | 2/3 | 5.30s | |
| #43 | MiMo-V2.5-Pro medium | Xiaomi | 1 | 6.7 | 1/3 | 5.31s |
| #79 | Hunter Alpha medium | OpenRouter | 1 | 6.1 | 1/3 | 5.35s |
| #159 | Ling-2.6-1T none | Inclusionai | 2 | 3.1 | 0/3 | 5.36s |
| #46 | Qwen3.6 35B A3B medium | Qwen | 1 | 8.0 | 2/3 | 5.95s |
| #22 | Step 3.7 Flash medium | Stepfun | 2 | 5.7 | 1/3 | 6.19s |
| #65 | Grok 4.20 medium | X AI | 1 | 7.7 | 2/3 | 6.22s |