AI BENCHY
Advertise here

AI BENCHY Failures

Wrong answer Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one.

Models Shown

15

Total Failures

1204

Most Affected Model

Mercury 2 16
Rank Model Company Wrong answer Count Score Tests Correct Response Time (avg)
#134 GLM 5 Turbo none Z.ai 13 5.2 6/21 2.82s
#144 GPT-5.4 Mini none OpenAI 13 4.9 5/21 1.13s
#150 Qwen3 Coder Next medium Qwen 13 4.6 4/21 8.58s
#152 MiMo-V2-Flash none Xiaomi 13 4.6 4/21 2.76s
#153 Qwen3.6 35B A3B none Qwen 13 4.6 4/21 3.73s
#157 Grok 4.1 Fast none X AI 13 4.4 3/19 1.62s
#163 Granite 4.1 8B none IBM Granite 13 4.0 2/21 728ms
#95 Qwen3.5 Plus 2026-02-15 none Qwen 12 6.3 9/21 2.31s
#97 Gemini 2.5 Flash none Google 12 6.2 9/21 875ms
#98 GLM 5 none Z.ai 12 6.1 9/21 4.03s
#104 Nemotron 3 Ultra 550b A55b none NVIDIA 12 6.0 8/21 2.27s
#114 Qwen3.5 Plus 2026-04-20 none Qwen 12 5.7 7/21 4.39s
#115 Qwen3.5-27B none Qwen 12 5.7 7/21 1.68s
#117 Qwen3.5-35B-A3B none Qwen 12 5.6 7/21 3.37s
#128 Qwen3.6 Flash none Qwen 12 5.4 7/21 1.60s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)