Wrong answer Failure Ranking

AI BENCHY Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Total Cost ↑.

Models Shown

Total Failures

1243

Most Affected Model

North Mini Code 9

Categories

In category Domain specific325 In category Anti-AI Tricks250 In category Coding201 In category Puzzle Solving154 In category Trivia133 In category Instructions following54 In category Combined53 In category General Intelligence36 In category Data parsing and extraction35 In category Tool Calling2

169/169

Rank	Model	Company	Wrong answer Count	Score	Total Cost	Tests Correct	Response Time (avg)
#106	Qwen3.5 Plus 2026-02-15 none	Qwen	12	5.8	$0.016	9/21	2.31s
Total Tests 21 Wrong Tests 12 Total Cost $0.016 Response Time (avg) 2.31s
#119	MiMo-V2.5-Pro none	Xiaomi	11	5.5	$0.017	6/21	1.78s
Total Tests 21 Wrong Tests 15 Total Cost $0.017 Response Time (avg) 1.78s
#126	DeepSeek V3.2 none	DeepSeek	7	5.3	$0.017	6/21	13.8s
Total Tests 21 Wrong Tests 15 Total Cost $0.017 Response Time (avg) 13.8s
#84	Gemini 3.1 Flash Lite Preview none	Google	7	6.4	$0.018	12/21	1.21s
Total Tests 21 Wrong Tests 9 Total Cost $0.018 Response Time (avg) 1.21s
#86	Hy3 preview low	Tencent	4	6.4	$0.018	10/21	24.6s
Total Tests 21 Wrong Tests 11 Total Cost $0.018 Response Time (avg) 24.6s
#92	Seed-2.0-Lite none	Bytedance Seed	13	6.2	$0.019	8/21	2.49s
Total Tests 21 Wrong Tests 13 Total Cost $0.019 Response Time (avg) 2.49s
#125	Qwen3.5-122B-A10B none	Qwen	13	5.3	$0.020	6/21	3.41s
Total Tests 21 Wrong Tests 15 Total Cost $0.020 Response Time (avg) 3.41s
#168	Step 3.5 Flash none	Stepfun	1	2.6	$0.020	6/12	39.0s
Total Tests 12 Wrong Tests 6 Total Cost $0.020 Response Time (avg) 39.0s
#87	Nemotron 3 Super medium	NVIDIA	5	6.3	$0.021	8/21	32.0s
Total Tests 21 Wrong Tests 13 Total Cost $0.021 Response Time (avg) 32.0s
#114	Mimo V2 Omni none	Xiaomi	10	5.7	$0.021	8/21	2.44s
Total Tests 21 Wrong Tests 13 Total Cost $0.021 Response Time (avg) 2.44s
#54	Hy3 preview medium	Tencent	3	7.3	$0.021	14/21	16.3s
Total Tests 21 Wrong Tests 7 Total Cost $0.021 Response Time (avg) 16.3s
#60	Qwen3.7 Plus none	Qwen	10	7.2	$0.023	10/21	2.85s
Total Tests 21 Wrong Tests 11 Total Cost $0.023 Response Time (avg) 2.85s
#67	Gemini 3 Flash Preview none	Google	8	6.9	$0.025	13/21	1.65s
Total Tests 21 Wrong Tests 8 Total Cost $0.025 Response Time (avg) 1.65s
#159	MiMo-V2-Flash none	Xiaomi	13	4.3	$0.025	4/21	2.76s
Total Tests 21 Wrong Tests 17 Total Cost $0.025 Response Time (avg) 2.76s
#82	Gemini 3.1 Flash Lite Preview low	Google	7	6.5	$0.026	13/21	2.77s
Total Tests 21 Wrong Tests 8 Total Cost $0.026 Response Time (avg) 2.77s

Wrong answer Failures

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)