Wrong answer Failure Ranking

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Tests Correct ↑.

Models Shown

Total Failures

1642

Most Affected Model

Laguna S 2.1 18

Categories

In category Domain specific433 In category Anti-AI Tricks306 In category Coding266 In category Puzzle Solving214 In category Trivia176 In category Combined71 In category General Intelligence66 In category Instructions following65 In category Data parsing and extraction41 In category Tool Calling4

219/219

Rank	Model	Company	Wrong answer Count	Score	Total Cost	Tests Correct	Response Time (avg)
#82	Mercury 2 medium	Inception	8	7.0	$0.093	10/22	2.72s
Total Tests 22 Wrong Tests 12 Total Cost $0.093 Response Time (avg) 2.72s
#86	DeepSeek V4 Pro none	DeepSeek	8	6.9	$0.096	10/22	11.6s
Total Tests 22 Wrong Tests 12 Total Cost $0.096 Response Time (avg) 11.6s
#96	LongCat 2.0 low	Meituan	8	6.7	$0.391	10/22	100.3s
Total Tests 22 Wrong Tests 12 Total Cost $0.391 Response Time (avg) 100.3s
#105	Qwen3.6 27B medium	Qwen	6	6.5	$0.779	10/22	106.3s
Total Tests 22 Wrong Tests 12 Total Cost $0.779 Response Time (avg) 106.3s
#113	Qwen3.5 Plus 2026-02-15 none	Qwen	12	6.4	$0.073	10/22	9.85s
Total Tests 22 Wrong Tests 12 Total Cost $0.073 Response Time (avg) 9.85s
#121	Gemma 4 31B none	Google	9	6.2	$0.021	10/22	5.34s
Total Tests 22 Wrong Tests 12 Total Cost $0.021 Response Time (avg) 5.34s
#123	GPT-5.6 Luna low	OpenAI	10	6.2	$0.249	10/22	5.04s
Total Tests 22 Wrong Tests 12 Total Cost $0.249 Response Time (avg) 5.04s
#126	Gemini 3.1 Flash Lite minimal	Google	8	6.1	$0.047	10/22	1.86s
Total Tests 22 Wrong Tests 12 Total Cost $0.047 Response Time (avg) 1.86s
#129	Inkling low	Thinkingmachines	8	6.1	$0.187	10/22	5.15s
Total Tests 22 Wrong Tests 12 Total Cost $0.187 Response Time (avg) 5.15s
#184	Qwen3.6 Plus Preview medium	Qwen	2	4.9	$0.000	9/19	15.2s
Total Tests 19 Wrong Tests 10 Total Cost $0.000 Response Time (avg) 15.2s
#194	Grok 4.1 Fast medium	X AI	4	4.7	$0.069	9/19	23.8s
Total Tests 19 Wrong Tests 10 Total Cost $0.069 Response Time (avg) 23.8s
#195	Laguna M.1 medium	Poolside	4	4.7	$0.033	9/19	14.7s
Total Tests 19 Wrong Tests 10 Total Cost $0.033 Response Time (avg) 14.7s
#140	Mimo V2 Omni medium	Xiaomi	5	5.9	$0.683	10/21	41.2s
Total Tests 21 Wrong Tests 11 Total Cost $0.683 Response Time (avg) 41.2s
#159	Hy3 preview low	Tencent	4	5.5	$0.015	10/21	24.6s
Total Tests 21 Wrong Tests 11 Total Cost $0.015 Response Time (avg) 24.6s
#66	KAT-Coder-Pro V2.5 low	Kwaipilot	10	7.4	$0.387	11/22	19.5s
Total Tests 22 Wrong Tests 11 Total Cost $0.387 Response Time (avg) 19.5s

Wrong answer Failures

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)