Wrong answer Failure Ranking

AI BENCHY Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

1204

Most Affected Model

Categories

In category Domain specific314 In category Anti-AI Tricks245 In category Coding194 In category Puzzle Solving147 In category Trivia130 In category Instructions following53 In category Combined52 In category Data parsing and extraction35 In category General Intelligence32 In category Tool Calling2

Rank	Model	Company	Wrong answer Count	Score	Tests Correct	Response Time (avg)
#142	Mistral Small 4 none	Mistral	15	4.9	5/21	630ms
#155	Mercury 2 none	Inception	16	4.5	4/21	653ms
#163	Granite 4.1 8B none	IBM Granite	13	4.0	2/21	728ms
#162	Nemotron 3 Nano Omni 30b A3b Reasoning none	NVIDIA	9	4.1	2/19	728ms
#160	LFM2-24B-A2B none	Liquid	9	4.2	2/16	782ms
#146	Laguna Xs.2 none	Poolside	8	4.8	5/19	806ms
#97	Gemini 2.5 Flash none	Google	12	6.2	9/21	875ms
#90	Gemini 3.1 Flash Lite none	Google	11	6.4	9/21	1.06s
#127	Grok 4.20 none	X AI	10	5.4	6/18	1.11s
#144	GPT-5.4 Mini none	OpenAI	13	4.9	5/21	1.13s
#106	Grok 4.20 Beta none	X AI	10	5.8	6/18	1.19s
#58	Gemini 3.1 Flash Lite Preview none	Google	7	7.2	12/21	1.21s
#137	Elephant Alpha none	Openrouter	9	5.1	5/21	1.22s
#136	Elephant Alpha medium	Openrouter	9	5.1	6/21	1.27s
#34	Qwen3.7 Max none	Qwen	7	7.7	14/21	1.30s

1 2 11

→

Wrong answer Failures