AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Puzzle Solving: Wrong answer

Puzzle Solving
Wrong answer

See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster. Sort by: Tests Correct ↑.

Models Shown

15

Total Failures

147

Most Affected Model

GPT-5.4 Nano 2
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#47 Grok Build 0.1 medium X AI 1 7.7 2/3 18.3s
#48 Gemini 3 Flash Preview none Google 1 7.7 2/3 1.05s
#55 GLM 5.1 medium Z.ai 1 8.2 2/3 31.6s
#59 GLM 5V Turbo medium Z.ai 1 7.7 2/3 10.2s
#64 MiMo-V2-Flash medium Xiaomi 1 7.7 2/3 3.87s
#65 Grok 4.20 medium X AI 1 7.7 2/3 6.22s
#67 MiniMax M3 medium Minimax 1 7.9 2/3 49.9s
#69 Claude Opus 4.6 medium Anthropic 1 7.7 2/3 4.71s
#73 Seed-2.0-Mini medium Bytedance Seed 1 8.2 2/3 31.8s
#78 Qwen3.6 27B medium Qwen 1 7.7 2/3 61.1s
#88 Qwen3.7 Plus none Qwen 1 7.7 2/3 1.71s
#91 GPT-5.5 none OpenAI 1 7.7 2/3 1.29s
#95 Qwen3.5 Plus 2026-02-15 none Qwen 1 7.7 2/3 2.71s
#97 Gemini 2.5 Flash none Google 1 7.7 2/3 604ms
#98 GLM 5 none Z.ai 1 7.7 2/3 1.91s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost