Wrong answer Failure Ranking

AI BENCHY Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Failure Count ↑.

Models Shown

Total Failures

1204

Most Affected Model

Categories

In category Domain specific314 In category Anti-AI Tricks245 In category Coding194 In category Puzzle Solving147 In category Trivia130 In category Instructions following53 In category Combined52 In category Data parsing and extraction35 In category General Intelligence32 In category Tool Calling2

Rank	Model	Company	Wrong answer Count	Score	Tests Correct	Response Time (avg)
#1	Gemini 3 Flash Preview medium	Google	1	9.8	20/21	18.6s
#2	Gemini 3.5 Flash high	Google	1	9.6	20/21	8.84s
#83	Step 3.5 Flash none	Stepfun	1	6.6	6/12	39.0s
#3	Gemini 3.5 Flash low	Google	2	9.4	19/21	3.27s
#4	Gemini 3.1 Pro Preview medium	Google	2	9.4	19/21	20.1s
#7	Gemini 3.5 Flash medium	Google	2	9.0	18/21	4.94s
#12	Gemini 3.1 Flash Lite Preview high	Google	2	8.6	13/16	68.1s
#27	Gemma 4 31B medium	Google	2	7.8	14/21	56.5s
#66	Qwen3.5-35B-A3B medium	Qwen	2	7.1	11/21	72.6s
#93	Qwen3.6 Plus Preview medium	Qwen	2	6.3	9/19	15.2s
#161	Qwen3.5-9B medium	Qwen	2	4.2	3/21	82.2s
#5	Qwen3.7 Max medium	Qwen	3	9.1	18/21	16.0s
#6	GPT-5.5 low	OpenAI	3	9.0	18/21	9.76s
#8	Claude Opus 4.7 none	Anthropic	3	8.9	16/19	3.02s
#10	Claude Opus 4.8 medium	Anthropic	3	8.7	17/21	9.66s

1 2 11

→

Wrong answer Failures