AI BENCHY
Advertise here

AI BENCHY Category Failures

Puzzle Solving: Wrong answer

Puzzle Solving
Wrong answer

See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

15

Total Failures

147

Most Affected Model

Gemini 3.5 Flash 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#94 GPT-5 Nano medium OpenAI 1 5.3 1/3 20.6s
#99 gpt-oss-120b medium OpenAI 1 5.3 1/3 21.7s
#100 Grok Build 0.1 none X AI 1 6.4 1/3 9.55s
#102 Gemma 4 26B A4B none Google 1 6.2 1/3 744ms
#103 DeepSeek V4 Pro high DeepSeek 1 5.9 1/3 34.8s
#104 Nemotron 3 Ultra 550b A55b none NVIDIA 1 5.9 1/3 1.06s
#107 Laguna Xs.2 medium Poolside 1 5.3 1/3 1.93s
#109 GLM 5V Turbo none Z.ai 1 5.3 1/3 2.40s
#110 Seed-2.0-Lite none Bytedance Seed 2 5.3 1/3 2.78s
#111 Owl Alpha medium Openrouter 1 5.3 1/3 3.40s
#114 Qwen3.5 Plus 2026-04-20 none Qwen 2 6.7 1/3 1.97s
#115 Qwen3.5-27B none Qwen 1 6.7 1/3 1.38s
#116 Hunter Alpha none OpenRouter 1 5.8 1/3 3.71s
#118 Qwen3.6 27B none Qwen 1 5.3 1/3 5.15s
#120 Mimo V2 PRO none Xiaomi 1 6.0 1/3 1.61s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost