AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Puzzle Solving: Wrong answer

Puzzle Solving
Wrong answer

See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

15

Total Failures

147

Most Affected Model

Gemini 3.5 Flash 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#73 Seed-2.0-Mini medium Bytedance Seed 1 8.2 2/3 31.8s
#78 Qwen3.6 27B medium Qwen 1 7.7 2/3 61.1s
#88 Qwen3.7 Plus none Qwen 1 7.7 2/3 1.71s
#91 GPT-5.5 none OpenAI 1 7.7 2/3 1.29s
#95 Qwen3.5 Plus 2026-02-15 none Qwen 1 7.7 2/3 2.71s
#97 Gemini 2.5 Flash none Google 1 7.7 2/3 604ms
#98 GLM 5 none Z.ai 1 7.7 2/3 1.91s
#106 Grok 4.20 Beta none X AI 1 7.7 2/3 586ms
#112 GLM 5.1 none Z.ai 1 7.7 2/3 1.45s
#22 Step 3.7 Flash medium Stepfun 2 5.7 1/3 6.19s
#38 Grok 4.3 medium X AI 1 5.9 1/3 22.5s
#41 Nemotron 3 Ultra 550b A55b medium NVIDIA 2 5.5 1/3 3.54s
#43 MiMo-V2.5-Pro medium Xiaomi 1 6.7 1/3 5.31s
#54 GPT-5 Mini medium OpenAI 1 5.6 1/3 15.2s
#57 Step 3.7 Flash low Stepfun 2 5.5 1/3 1.84s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost