AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Puzzle Solving: Wrong answer

Puzzle Solving
Wrong answer

See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

15

Total Failures

85

Most Affected Model

Grok 4.20 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#61 Seed-2.0-Lite none Bytedance Seed 2 5.2 1/3 2.46s
#49 Qwen3.5 Plus 2026-02-15 none Qwen 1 7.7 2/3 2.82s
#48 Gemma 4 31B none Google 1 5.5 1/3 2.95s
#72 Hunter Alpha none OpenRouter 1 5.8 1/3 3.06s
#78 Trinity Large Preview none Arcee AI 2 5.4 1/3 3.30s
#17 Gemini 3.1 Flash Lite Preview medium Google 1 7.7 2/3 3.58s
#38 GPT-5.4 Nano medium OpenAI 1 4.0 0/3 3.65s
#41 MiMo-V2-Flash medium Xiaomi 1 7.7 2/3 3.77s
#35 MiMo-V2-Omni medium Xiaomi 1 6.5 1/3 3.88s
#15 Gemini 2.5 Flash medium Google 1 7.7 2/3 3.94s
#28 GPT-5.2 Chat none OpenAI 1 7.7 2/3 4.42s
#37 Claude Opus 4.6 medium Anthropic 1 7.7 2/3 4.60s
#76 Kimi K2.5 none Moonshot AI 3 3.1 0/3 4.73s
#50 Hunter Alpha medium OpenRouter 1 6.1 1/3 5.36s
#59 Qwen3.5-Flash none Qwen 2 3.3 0/3 5.90s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost