Wrong answer Failure Ranking

AI BENCHY Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Total Cost ↑.

Models Shown

Total Failures

1243

Most Affected Model

North Mini Code 9

Categories

In category Domain specific325 In category Anti-AI Tricks250 In category Coding201 In category Puzzle Solving154 In category Trivia133 In category Instructions following54 In category Combined53 In category General Intelligence36 In category Data parsing and extraction35 In category Tool Calling2

169/169

Rank	Model	Company	Wrong answer Count	Score	Total Cost	Tests Correct	Response Time (avg)
#40	MiniMax M3 medium	Minimax	3	7.6	$0.131	11/21	68.2s
Total Tests 21 Wrong Tests 10 Total Cost $0.131 Response Time (avg) 68.2s
#75	Qwen3.6 35B A3B medium	Qwen	4	6.7	$0.146	13/21	18.1s
Total Tests 21 Wrong Tests 8 Total Cost $0.146 Response Time (avg) 18.1s
#41	DeepSeek V4 Pro high	DeepSeek	6	7.6	$0.157	9/21	77.2s
Total Tests 21 Wrong Tests 12 Total Cost $0.157 Response Time (avg) 77.2s
#26	Nemotron 3 Ultra 550b A55b medium	NVIDIA	7	8.1	$0.158	13/21	15.1s
Total Tests 21 Wrong Tests 8 Total Cost $0.158 Response Time (avg) 15.1s
#16	GPT-5 Mini medium	OpenAI	5	8.5	$0.159	12/21	23.6s
Total Tests 21 Wrong Tests 9 Total Cost $0.159 Response Time (avg) 23.6s
#18	Seed-2.0-Lite medium	Bytedance Seed	5	8.5	$0.175	14/21	47.1s
Total Tests 21 Wrong Tests 7 Total Cost $0.175 Response Time (avg) 47.1s
#25	Qwen3.7 Plus medium	Qwen	5	8.2	$0.177	15/21	38.9s
Total Tests 21 Wrong Tests 6 Total Cost $0.177 Response Time (avg) 38.9s
#15	GLM 5 medium	Z.ai	3	8.6	$0.228	15/21	33.5s
Total Tests 21 Wrong Tests 6 Total Cost $0.228 Response Time (avg) 33.5s
#90	GPT-5.5 none	OpenAI	11	6.3	$0.231	10/21	1.89s
Total Tests 21 Wrong Tests 11 Total Cost $0.231 Response Time (avg) 1.89s
#47	Qwen3.6 Flash medium	Qwen	8	7.5	$0.288	12/21	19.2s
Total Tests 21 Wrong Tests 9 Total Cost $0.288 Response Time (avg) 19.2s
#64	GLM 5.1 medium	Z.ai	4	7.1	$0.292	12/21	33.7s
Total Tests 21 Wrong Tests 9 Total Cost $0.292 Response Time (avg) 33.7s
#30	Qwen3.6 Plus medium	Qwen	5	7.8	$0.294	14/21	30.7s
Total Tests 21 Wrong Tests 7 Total Cost $0.294 Response Time (avg) 30.7s
#146	MiniMax M2.5 medium	Minimax	7	4.7	$0.303	5/21	65.4s
Total Tests 21 Wrong Tests 16 Total Cost $0.303 Response Time (avg) 65.4s
#28	Qwen3.5 Plus 2026-02-15 medium	Qwen	4	8.0	$0.310	14/21	73.8s
Total Tests 21 Wrong Tests 7 Total Cost $0.310 Response Time (avg) 73.8s
#55	Claude Sonnet 4.6 none	Anthropic	5	7.3	$0.316	11/21	5.04s
Total Tests 21 Wrong Tests 10 Total Cost $0.316 Response Time (avg) 5.04s

Wrong answer Failures

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)