Wrong answer Failure Ranking

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one.

Models Shown

Total Failures

1558

Most Affected Model

Categories

In category Domain specific412 In category Anti-AI Tricks293 In category Coding252 In category Puzzle Solving201 In category Trivia168 In category Combined68 In category Instructions following61 In category General Intelligence59 In category Data parsing and extraction41 In category Tool Calling3

209/209

Rank	Model	Company	Wrong answer Count	Score	Total Cost	Tests Correct	Response Time (avg)
#124	Qwen3.6 Flash none	Qwen	12	6.1	$0.062	7/22	3.74s
Total Tests 22 Wrong Tests 15 Total Cost $0.062 Response Time (avg) 3.74s
#126	Qwen3.5 Plus 2026-04-20 none	Qwen	12	6.1	$0.122	8/22	13.6s
Total Tests 22 Wrong Tests 14 Total Cost $0.122 Response Time (avg) 13.6s
#127	Qwen3.5-35B-A3B none	Qwen	12	6.1	$0.106	7/22	12.7s
Total Tests 22 Wrong Tests 15 Total Cost $0.106 Response Time (avg) 12.7s
#129	Nemotron 3 Ultra none	NVIDIA	12	6.1	$0.095	8/22	3.87s
Total Tests 22 Wrong Tests 14 Total Cost $0.095 Response Time (avg) 3.87s
#141	GLM 5 none	Z.ai	12	5.7	$0.041	9/21	4.03s
Total Tests 21 Wrong Tests 12 Total Cost $0.041 Response Time (avg) 4.03s
#150	DeepSeek V4 Flash none	DeepSeek	12	5.6	$0.044	5/22	36.8s
Total Tests 22 Wrong Tests 17 Total Cost $0.044 Response Time (avg) 36.8s
#162	Ling-2.6-1T none	Inclusionai	12	5.3	$0.016	4/22	8.58s
Total Tests 22 Wrong Tests 18 Total Cost $0.016 Response Time (avg) 8.58s
#167	Mistral Small 4 medium	Mistral	12	5.1	$0.096	5/22	10.8s
Total Tests 22 Wrong Tests 17 Total Cost $0.096 Response Time (avg) 10.8s
#171	North Mini Code none	Cohere	12	5.1	$0.000	4/22	29.9s
Total Tests 22 Wrong Tests 18 Total Cost $0.000 Response Time (avg) 29.9s
#183	Trinity Large Preview none	Arcee AI	12	4.8	$0.008	4/21	2.98s
Total Tests 21 Wrong Tests 17 Total Cost $0.008 Response Time (avg) 2.98s
#87	GPT-5.5 none	OpenAI	11	6.9	$0.544	11/22	2.36s
Total Tests 22 Wrong Tests 11 Total Cost $0.544 Response Time (avg) 2.36s
#102	Laguna XS 2.1 medium	Poolside	11	6.5	$0.068	9/22	47.9s
Total Tests 22 Wrong Tests 13 Total Cost $0.068 Response Time (avg) 47.9s
#122	Gemini 3.1 Flash Lite none	Google	11	6.1	$0.046	9/22	1.75s
Total Tests 22 Wrong Tests 13 Total Cost $0.046 Response Time (avg) 1.75s
#132	GPT-5.6 Terra none	OpenAI	11	6.0	$0.349	8/22	1.65s
Total Tests 22 Wrong Tests 14 Total Cost $0.349 Response Time (avg) 1.65s
#138	Kimi K2.6 none	Moonshot AI	11	5.8	$0.184	7/22	19.6s
Total Tests 22 Wrong Tests 15 Total Cost $0.184 Response Time (avg) 19.6s

Wrong answer Failures

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)