AI BENCHY
Advertise here

AI BENCHY Category Failures

Puzzle Solving: Wrong answer

Puzzle Solving
Wrong answer

See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

15

Total Failures

147

Most Affected Model

Qwen3.6 27B 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#94 GPT-5 Nano medium OpenAI 1 5.3 1/3 20.6s
#47 Grok Build 0.1 medium X AI 1 7.7 2/3 18.3s
#36 Qwen3.5 Plus 2026-04-20 medium Qwen 1 8.2 2/3 17.7s
#54 GPT-5 Mini medium OpenAI 1 5.6 1/3 15.2s
#158 GLM 4.7 Flash medium Z.ai 2 2.9 0/3 12.9s
#119 Cobuddy medium Baidu 2 3.6 0/3 12.8s
#129 MiniMax M2.5 medium Minimax 1 5.3 1/3 11.2s
#108 Qwen3.5-Flash none Qwen 3 3.1 0/3 10.9s
#59 GLM 5V Turbo medium Z.ai 1 7.7 2/3 10.2s
#71 Step 3.7 Flash high Stepfun 2 5.3 1/3 10.2s
#92 Laguna M.1 medium Poolside 1 5.3 1/3 10.2s
#100 Grok Build 0.1 none X AI 1 6.4 1/3 9.55s
#126 gpt-oss-120b none OpenAI 1 6.0 1/3 8.21s
#89 Hy3 preview low Tencent 1 5.3 1/3 7.51s
#86 Grok 4.1 Fast medium X AI 1 5.3 1/3 7.40s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost