AI BENCHY Category Failures
Puzzle Solving: Wrong answer
Puzzle Solving
Wrong answer
See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster. Sort by: Response Time (avg) ↓.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #62 | Step 3.5 Flash medium | Stepfun | 1 | 5.3 | 1/3 | 7.22s |
| #138 | Ling-2.6-flash none | Inclusionai | 2 | 2.9 | 0/3 | 6.51s |
| #65 | Grok 4.20 medium | X AI | 1 | 7.7 | 2/3 | 6.22s |
| #22 | Step 3.7 Flash medium | Stepfun | 2 | 5.7 | 1/3 | 6.19s |
| #46 | Qwen3.6 35B A3B medium | Qwen | 1 | 8.0 | 2/3 | 5.95s |
| #159 | Ling-2.6-1T none | Inclusionai | 2 | 3.1 | 0/3 | 5.36s |
| #79 | Hunter Alpha medium | OpenRouter | 1 | 6.1 | 1/3 | 5.35s |
| #43 | MiMo-V2.5-Pro medium | Xiaomi | 1 | 6.7 | 1/3 | 5.31s |
| #40 | Gemini 3.1 Flash Lite Preview medium | 1 | 7.7 | 2/3 | 5.30s | |
| #84 | Grok 4.20 Multi Agent Beta medium | X AI | 1 | 6.7 | 1/3 | 5.19s |
| #118 | Qwen3.6 27B none | Qwen | 1 | 5.3 | 1/3 | 5.15s |
| #69 | Claude Opus 4.6 medium | Anthropic | 1 | 7.7 | 2/3 | 4.71s |
| #156 | Hy3 preview none | Tencent | 2 | 3.1 | 0/3 | 4.56s |
| #85 | Gemma 4 31B none | 1 | 6.5 | 1/3 | 4.23s | |
| #121 | Owl Alpha none | Openrouter | 1 | 5.4 | 1/3 | 4.18s |