Wrong answer Failure Ranking

AI BENCHY Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Score ↑.

Models Shown

Total Failures

1204

Most Affected Model

Categories

In category Domain specific314 In category Anti-AI Tricks245 In category Coding194 In category Puzzle Solving147 In category Trivia130 In category Instructions following53 In category Combined52 In category Data parsing and extraction35 In category General Intelligence32 In category Tool Calling2

Rank	Model	Company	Wrong answer Count	Score	Tests Correct	Response Time (avg)
#163	Granite 4.1 8B none	IBM Granite	13	4.0	2/21	728ms
#162	Nemotron 3 Nano Omni 30b A3b Reasoning none	NVIDIA	9	4.1	2/19	728ms
#161	Qwen3.5-9B medium	Qwen	2	4.2	3/21	82.2s
#160	LFM2-24B-A2B none	Liquid	9	4.2	2/16	782ms
#159	Ling-2.6-1T none	Inclusionai	12	4.3	3/21	7.72s
#158	GLM 4.7 Flash medium	Z.ai	9	4.4	4/21	35.1s
#157	Grok 4.1 Fast none	X AI	13	4.4	3/19	1.62s
#156	Hy3 preview none	Tencent	8	4.4	4/21	12.9s
#155	Mercury 2 none	Inception	16	4.5	4/21	653ms
#154	Qwen3.5-9B none	Qwen	14	4.6	4/21	1.89s
#153	Qwen3.6 35B A3B none	Qwen	13	4.6	4/21	3.73s
#152	MiMo-V2-Flash none	Xiaomi	13	4.6	4/21	2.76s
#151	Trinity Large Preview none	Arcee AI	12	4.6	4/21	2.98s
#150	Qwen3 Coder Next medium	Qwen	13	4.6	4/21	8.58s
#149	Nemotron 3 Nano Omni 30b A3b Reasoning medium	NVIDIA	7	4.6	4/19	17.1s

1 2 11

→

Wrong answer Failures