AI BENCHY
Advertise here

AI BENCHY Category Failures

Puzzle Solving: Wrong answer

Puzzle Solving
Wrong answer

See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster.

Models Shown

15

Total Failures

147

Most Affected Model

Qwen3.5-Flash 3
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#124 Kimi K2.6 none Moonshot AI 2 3.1 0/3 1.40s
#127 Grok 4.20 none X AI 2 5.3 1/3 473ms
#128 Qwen3.6 Flash none Qwen 2 3.5 0/3 1.21s
#131 Qwen3.5-122B-A10B none Qwen 2 3.8 0/3 1.00s
#132 Mistral Small 4 medium Mistral 2 3.4 0/3 2.17s
#137 Elephant Alpha none Openrouter 2 4.2 0/3 807ms
#138 Ling-2.6-flash none Inclusionai 2 2.9 0/3 6.51s
#142 Mistral Small 4 none Mistral 2 3.1 0/3 399ms
#145 Laguna M.1 none Poolside 2 3.0 0/3 891ms
#147 GPT-4o-mini none OpenAI 2 3.5 0/3 1.21s
#149 Nemotron 3 Nano Omni 30b A3b Reasoning medium NVIDIA 2 2.9 0/3 1.40s
#150 Qwen3 Coder Next medium Qwen 2 3.0 0/3 1.25s
#151 Trinity Large Preview none Arcee AI 2 3.6 0/3 1.97s
#152 MiMo-V2-Flash none Xiaomi 2 5.3 1/3 1.86s
#154 Qwen3.5-9B none Qwen 2 3.2 0/3 621ms

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost