Wrong answer Failure Ranking

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one.

Models Shown

Total Failures

1642

Most Affected Model

Categories

In category Domain specific433 In category Anti-AI Tricks306 In category Coding266 In category Puzzle Solving214 In category Trivia176 In category Combined71 In category General Intelligence66 In category Instructions following65 In category Data parsing and extraction41 In category Tool Calling4

219/219

Rank	Model	Company	Wrong answer Count	Score	Total Cost	Tests Correct	Response Time (avg)
#128	Gemini 3.1 Flash Lite none	Google	11	6.1	$0.046	9/22	1.75s
Total Tests 22 Wrong Tests 13 Total Cost $0.046 Response Time (avg) 1.75s
#138	GPT-5.6 Terra none	OpenAI	11	6.0	$0.349	8/22	1.65s
Total Tests 22 Wrong Tests 14 Total Cost $0.349 Response Time (avg) 1.65s
#144	Kimi K2.6 none	Moonshot AI	11	5.8	$0.184	7/22	19.6s
Total Tests 22 Wrong Tests 15 Total Cost $0.184 Response Time (avg) 19.6s
#151	GLM 5V Turbo none	Z.ai	11	5.6	$0.052	8/21	2.99s
Total Tests 21 Wrong Tests 13 Total Cost $0.052 Response Time (avg) 2.99s
#153	Mimo V2 PRO none	Xiaomi	11	5.6	$0.045	7/21	2.27s
Total Tests 21 Wrong Tests 14 Total Cost $0.045 Response Time (avg) 2.27s
#155	KAT-Coder-Air V2.5 medium	Kwaipilot	11	5.6	$0.048	8/22	8.42s
Total Tests 22 Wrong Tests 14 Total Cost $0.048 Response Time (avg) 8.42s
#158	Qwen3.6 27B none	Qwen	11	5.5	$0.087	7/22	10.7s
Total Tests 22 Wrong Tests 15 Total Cost $0.087 Response Time (avg) 10.7s
#160	MiMo-V2.5-Pro none	Xiaomi	11	5.5	$0.068	6/22	4.12s
Total Tests 22 Wrong Tests 16 Total Cost $0.068 Response Time (avg) 4.12s
#167	Laguna S 2.1 high	Poolside	11	5.4	$0.127	4/22	111.6s
Total Tests 22 Wrong Tests 18 Total Cost $0.127 Response Time (avg) 111.6s
#66	KAT-Coder-Pro V2.5 low	Kwaipilot	10	7.4	$0.387	11/22	19.5s
Total Tests 22 Wrong Tests 11 Total Cost $0.387 Response Time (avg) 19.5s
#73	KAT-Coder-Pro V2.5 high	Kwaipilot	10	7.2	$0.482	11/22	20.8s
Total Tests 22 Wrong Tests 11 Total Cost $0.482 Response Time (avg) 20.8s
#75	Qwen3.7 Plus none	Qwen	10	7.2	$0.106	11/22	12.1s
Total Tests 22 Wrong Tests 11 Total Cost $0.106 Response Time (avg) 12.1s
#87	GPT-5.6 Sol none	OpenAI	10	6.9	$0.524	11/22	2.16s
Total Tests 22 Wrong Tests 11 Total Cost $0.524 Response Time (avg) 2.16s
#97	KAT-Coder-Pro V2.5 none	Kwaipilot	10	6.7	$0.476	11/22	25.6s
Total Tests 22 Wrong Tests 11 Total Cost $0.476 Response Time (avg) 25.6s
#103	Qwen3.6 Max Preview none	Qwen	10	6.6	$0.231	12/22	7.82s
Total Tests 22 Wrong Tests 10 Total Cost $0.231 Response Time (avg) 7.82s

Wrong answer Failures

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)