Wrong answer Failure Ranking

AI BENCHY Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Tests Correct ↑.

Models Shown

Total Failures

1204

Most Affected Model

Categories

In category Domain specific314 In category Anti-AI Tricks245 In category Coding194 In category Puzzle Solving147 In category Trivia130 In category Instructions following53 In category Combined52 In category Data parsing and extraction35 In category General Intelligence32 In category Tool Calling2

Rank	Model	Company	Wrong answer Count	Score	Tests Correct	Response Time (avg)
#163	Granite 4.1 8B none	IBM Granite	13	4.0	2/21	728ms
#162	Nemotron 3 Nano Omni 30b A3b Reasoning none	NVIDIA	9	4.1	2/19	728ms
#160	LFM2-24B-A2B none	Liquid	9	4.2	2/16	782ms
#159	Ling-2.6-1T none	Inclusionai	12	4.3	3/21	7.72s
#161	Qwen3.5-9B medium	Qwen	2	4.2	3/21	82.2s
#157	Grok 4.1 Fast none	X AI	13	4.4	3/19	1.62s
#148	GPT-5.4 Nano none	OpenAI	15	4.7	4/21	1.48s
#150	Qwen3 Coder Next medium	Qwen	13	4.6	4/21	8.58s
#151	Trinity Large Preview none	Arcee AI	12	4.6	4/21	2.98s
#152	MiMo-V2-Flash none	Xiaomi	13	4.6	4/21	2.76s
#153	Qwen3.6 35B A3B none	Qwen	13	4.6	4/21	3.73s
#154	Qwen3.5-9B none	Qwen	14	4.6	4/21	1.89s
#155	Mercury 2 none	Inception	16	4.5	4/21	653ms
#156	Hy3 preview none	Tencent	8	4.4	4/21	12.9s
#158	GLM 4.7 Flash medium	Z.ai	9	4.4	4/21	35.1s

1 2 11

→

Wrong answer Failures