AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Puzzle Solving: Wrong answer

Puzzle Solving
Wrong answer

See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

15

Total Failures

85

Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#52 Grok 4.1 Fast medium X AI 1 5.3 1/3 8.08s
#30 Step 3.5 Flash medium Stepfun 1 5.3 1/3 7.72s
#88 Nemotron 3 Super none NVIDIA 1 5.7 1/3 7.50s
#64 DeepSeek V3.2 none DeepSeek 1 8.5 2/3 7.37s
#84 gpt-oss-120b none OpenAI 1 4.5 0/3 6.86s
#59 Qwen3.5-Flash none Qwen 2 3.3 0/3 5.90s
#50 Hunter Alpha medium OpenRouter 1 6.1 1/3 5.36s
#76 Kimi K2.5 none Moonshot AI 3 3.1 0/3 4.73s
#37 Claude Opus 4.6 medium Anthropic 1 7.7 2/3 4.60s
#28 GPT-5.2 Chat none OpenAI 1 7.7 2/3 4.42s
#15 Gemini 2.5 Flash medium Google 1 7.7 2/3 3.94s
#35 MiMo-V2-Omni medium Xiaomi 1 6.5 1/3 3.88s
#41 MiMo-V2-Flash medium Xiaomi 1 7.7 2/3 3.77s
#38 GPT-5.4 Nano medium OpenAI 1 4.0 0/3 3.65s
#17 Gemini 3.1 Flash Lite Preview medium Google 1 7.7 2/3 3.58s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost