Wrong answer Failure Ranking

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Tests Correct ↑.

Models Shown

Total Failures

1642

Most Affected Model

Laguna S 2.1 18

Categories

In category Domain specific433 In category Anti-AI Tricks306 In category Coding266 In category Puzzle Solving214 In category Trivia176 In category Combined71 In category General Intelligence66 In category Instructions following65 In category Data parsing and extraction41 In category Tool Calling4

219/219

Rank	Model	Company	Wrong answer Count	Score	Total Cost	Tests Correct	Response Time (avg)
#205	Elephant Alpha medium	Openrouter	9	4.3	$0.000	6/21	1.27s
Total Tests 21 Wrong Tests 15 Total Cost $0.000 Response Time (avg) 1.27s
#208	Laguna Xs.2 medium	Poolside	6	4.1	$0.015	6/19	6.73s
Total Tests 19 Wrong Tests 13 Total Cost $0.015 Response Time (avg) 6.73s
#216	gpt-oss-120b none	OpenAI	8	3.7	$0.010	6/19	21.6s
Total Tests 19 Wrong Tests 13 Total Cost $0.010 Response Time (avg) 21.6s
#117	LongCat 2.0 none	Meituan	14	6.3	$0.044	7/22	5.18s
Total Tests 22 Wrong Tests 15 Total Cost $0.044 Response Time (avg) 5.18s
#130	Qwen3.6 Flash none	Qwen	12	6.1	$0.062	7/22	3.74s
Total Tests 22 Wrong Tests 15 Total Cost $0.062 Response Time (avg) 3.74s
#133	Qwen3.5-35B-A3B none	Qwen	12	6.1	$0.106	7/22	12.7s
Total Tests 22 Wrong Tests 15 Total Cost $0.106 Response Time (avg) 12.7s
#144	Kimi K2.6 none	Moonshot AI	11	5.8	$0.184	7/22	19.6s
Total Tests 22 Wrong Tests 15 Total Cost $0.184 Response Time (avg) 19.6s
#145	GPT-5.4 none	OpenAI	14	5.8	$0.397	7/22	2.07s
Total Tests 22 Wrong Tests 15 Total Cost $0.397 Response Time (avg) 2.07s
#150	KAT-Coder-Air V2.5 high	Kwaipilot	9	5.6	$0.077	7/22	15.9s
Total Tests 22 Wrong Tests 15 Total Cost $0.077 Response Time (avg) 15.9s
#157	GLM 5.1 none	Z.ai	13	5.5	$0.164	7/22	6.70s
Total Tests 22 Wrong Tests 15 Total Cost $0.164 Response Time (avg) 6.70s
#158	Qwen3.6 27B none	Qwen	11	5.5	$0.087	7/22	10.7s
Total Tests 22 Wrong Tests 15 Total Cost $0.087 Response Time (avg) 10.7s
#165	KAT-Coder-Air V2.5 low	Kwaipilot	7	5.4	$0.041	7/22	10.1s
Total Tests 22 Wrong Tests 15 Total Cost $0.041 Response Time (avg) 10.1s
#153	Mimo V2 PRO none	Xiaomi	11	5.6	$0.045	7/21	2.27s
Total Tests 21 Wrong Tests 14 Total Cost $0.045 Response Time (avg) 2.27s
#154	Owl Alpha none	Openrouter	10	5.6	$0.000	7/21	9.88s
Total Tests 21 Wrong Tests 14 Total Cost $0.000 Response Time (avg) 9.88s
#197	Cobuddy medium	Baidu	9	4.7	$0.000	7/21	39.9s
Total Tests 21 Wrong Tests 14 Total Cost $0.000 Response Time (avg) 39.9s

Wrong answer Failures

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)