AI BENCHY Category Failures
Puzzle Solving: Wrong answer
Puzzle Solving
Wrong answer
See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster. Sort by: Response Time (avg) ↑.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #138 | Ling-2.6-flash none | Inclusionai | 2 | 2.9 | 0/3 | 6.51s |
| #62 | Step 3.5 Flash medium | Stepfun | 1 | 5.3 | 1/3 | 7.22s |
| #86 | Grok 4.1 Fast medium | X AI | 1 | 5.3 | 1/3 | 7.40s |
| #89 | Hy3 preview low | Tencent | 1 | 5.3 | 1/3 | 7.51s |
| #126 | gpt-oss-120b none | OpenAI | 1 | 6.0 | 1/3 | 8.21s |
| #100 | Grok Build 0.1 none | X AI | 1 | 6.4 | 1/3 | 9.55s |
| #92 | Laguna M.1 medium | Poolside | 1 | 5.3 | 1/3 | 10.2s |
| #71 | Step 3.7 Flash high | Stepfun | 2 | 5.3 | 1/3 | 10.2s |
| #59 | GLM 5V Turbo medium | Z.ai | 1 | 7.7 | 2/3 | 10.2s |
| #108 | Qwen3.5-Flash none | Qwen | 3 | 3.1 | 0/3 | 10.9s |
| #129 | MiniMax M2.5 medium | Minimax | 1 | 5.3 | 1/3 | 11.2s |
| #119 | Cobuddy medium | Baidu | 2 | 3.6 | 0/3 | 12.8s |
| #158 | GLM 4.7 Flash medium | Z.ai | 2 | 2.9 | 0/3 | 12.9s |
| #54 | GPT-5 Mini medium | OpenAI | 1 | 5.6 | 1/3 | 15.2s |
| #36 | Qwen3.5 Plus 2026-04-20 medium | Qwen | 1 | 8.2 | 2/3 | 17.7s |