Wrong answer Failure Ranking

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one.

Models Shown

Total Failures

1558

Most Affected Model

Categories

In category Domain specific412 In category Anti-AI Tricks293 In category Coding252 In category Puzzle Solving201 In category Trivia168 In category Combined68 In category Instructions following61 In category General Intelligence59 In category Data parsing and extraction41 In category Tool Calling3

209/209

Rank	Model	Company	Wrong answer Count	Score	Total Cost	Tests Correct	Response Time (avg)
#136	GPT-5.4 Mini none	OpenAI	13	5.9	$0.095	6/22	1.53s
Total Tests 22 Wrong Tests 16 Total Cost $0.095 Response Time (avg) 1.53s
#142	Qwen3.5-122B-A10B none	Qwen	13	5.7	$0.247	6/22	12.9s
Total Tests 22 Wrong Tests 16 Total Cost $0.247 Response Time (avg) 12.9s
#151	GLM 5.1 none	Z.ai	13	5.5	$0.164	7/22	6.70s
Total Tests 22 Wrong Tests 15 Total Cost $0.164 Response Time (avg) 6.70s
#161	Qwen3.6 35B A3B none	Qwen	13	5.3	$0.061	4/22	5.52s
Total Tests 22 Wrong Tests 18 Total Cost $0.061 Response Time (avg) 5.52s
#164	Inkling none	Thinkingmachines	13	5.2	$0.147	6/22	3.50s
Total Tests 22 Wrong Tests 16 Total Cost $0.147 Response Time (avg) 3.50s
#170	GLM 5 Turbo none	Z.ai	13	5.1	$0.047	6/21	2.82s
Total Tests 21 Wrong Tests 15 Total Cost $0.047 Response Time (avg) 2.82s
#176	GLM 4.7 Flash none	Z.ai	13	4.9	$0.016	6/22	9.15s
Total Tests 22 Wrong Tests 16 Total Cost $0.016 Response Time (avg) 9.15s
#182	KAT-Coder-Air V2.5 none	Kwaipilot	13	4.8	$0.067	5/22	12.2s
Total Tests 22 Wrong Tests 17 Total Cost $0.067 Response Time (avg) 12.2s
#187	Qwen3 Coder Next medium	Qwen	13	4.7	$0.032	4/22	9.61s
Total Tests 22 Wrong Tests 18 Total Cost $0.032 Response Time (avg) 9.61s
#200	MiMo-V2-Flash none	Xiaomi	13	4.0	$0.025	4/21	2.76s
Total Tests 21 Wrong Tests 17 Total Cost $0.025 Response Time (avg) 2.76s
#201	Granite 4.1 8B none	IBM Granite	13	4.0	$0.007	2/22	1.45s
Total Tests 22 Wrong Tests 20 Total Cost $0.007 Response Time (avg) 1.45s
#203	Grok 4.1 Fast none	X AI	13	3.8	$0.008	3/19	1.62s
Total Tests 19 Wrong Tests 16 Total Cost $0.008 Response Time (avg) 1.62s
#103	Qwen3.5-27B none	Qwen	12	6.5	$0.090	8/22	4.76s
Total Tests 22 Wrong Tests 14 Total Cost $0.090 Response Time (avg) 4.76s
#107	Qwen3.5 Plus 2026-02-15 none	Qwen	12	6.4	$0.073	10/22	9.85s
Total Tests 22 Wrong Tests 12 Total Cost $0.073 Response Time (avg) 9.85s
#118	Gemini 2.5 Flash none	Google	12	6.2	$0.017	9/22	6.20s
Total Tests 22 Wrong Tests 13 Total Cost $0.017 Response Time (avg) 6.20s

Wrong answer Failures

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)