AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Puzzle Solving: Wrong answer

Puzzle Solving
Wrong answer

See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

15

Total Failures

147

Most Affected Model

Mistral Small 4 2
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#149 Nemotron 3 Nano Omni 30b A3b Reasoning medium NVIDIA 2 2.9 0/3 1.40s
#124 Kimi K2.6 none Moonshot AI 2 3.1 0/3 1.40s
#125 GPT-5.4 none OpenAI 1 5.6 1/3 1.44s
#112 GLM 5.1 none Z.ai 1 7.7 2/3 1.45s
#120 Mimo V2 PRO none Xiaomi 1 6.0 1/3 1.61s
#88 Qwen3.7 Plus none Qwen 1 7.7 2/3 1.71s
#160 LFM2-24B-A2B none Liquid 2 3.8 0/3 1.78s
#57 Step 3.7 Flash low Stepfun 2 5.5 1/3 1.84s
#152 MiMo-V2-Flash none Xiaomi 2 5.3 1/3 1.86s
#98 GLM 5 none Z.ai 1 7.7 2/3 1.91s
#107 Laguna Xs.2 medium Poolside 1 5.3 1/3 1.93s
#44 Gemini 3.1 Flash Lite medium Google 1 7.6 2/3 1.95s
#151 Trinity Large Preview none Arcee AI 2 3.6 0/3 1.97s
#114 Qwen3.5 Plus 2026-04-20 none Qwen 2 6.7 1/3 1.97s
#143 MiMo-V2.5 none Xiaomi 1 5.4 1/3 2.13s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost