Wrong answer Failure Ranking

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Response Time (avg) ↓.

Models Shown

Total Failures

1585

Most Affected Model

Step 3.5 Flash 4

Categories

In category Domain specific421 In category Anti-AI Tricks293 In category Coding259 In category Puzzle Solving204 In category Trivia172 In category Combined69 In category General Intelligence62 In category Instructions following61 In category Data parsing and extraction41 In category Tool Calling3

215/215

Rank	Model	Company	Wrong answer Count	Score	Total Cost	Tests Correct	Response Time (avg)
#161	Kimi K2.5 none	Moonshot AI	15	5.5	$0.127	6/22	19.2s
Total Tests 22 Wrong Tests 16 Total Cost $0.127 Response Time (avg) 19.2s
#48	GPT-5.6 Luna high	OpenAI	7	7.7	$1.017	15/22	18.7s
Total Tests 22 Wrong Tests 7 Total Cost $1.017 Response Time (avg) 18.7s
#179	DeepSeek V3.2 none	DeepSeek	7	5.0	$0.054	6/22	18.3s
Total Tests 22 Wrong Tests 16 Total Cost $0.054 Response Time (avg) 18.3s
#20	Claude Fable 5 medium	Anthropic	2	8.6	$3.478	17/22	17.2s
Total Tests 22 Wrong Tests 5 Total Cost $3.478 Response Time (avg) 17.2s
#213	Nemotron 3 Nano Omni 30b A3b Reasoning medium	NVIDIA	7	3.4	$0.000	4/19	17.1s
Total Tests 19 Wrong Tests 15 Total Cost $0.000 Response Time (avg) 17.1s
#16	GPT-5.3-Codex medium	OpenAI	4	8.9	$0.920	16/22	17.0s
Total Tests 22 Wrong Tests 6 Total Cost $0.920 Response Time (avg) 17.0s
#110	Gemini 3.1 Flash Lite Preview low	Google	7	6.5	$0.646	13/22	16.7s
Total Tests 22 Wrong Tests 9 Total Cost $0.646 Response Time (avg) 16.7s
#106	Hy3 preview medium	Tencent	3	6.5	$0.018	14/21	16.3s
Total Tests 21 Wrong Tests 7 Total Cost $0.018 Response Time (avg) 16.3s
#111	Gemini 3.1 Flash Lite low	Google	9	6.5	$0.621	12/22	16.3s
Total Tests 22 Wrong Tests 10 Total Cost $0.621 Response Time (avg) 16.3s
#36	Inkling medium	Thinkingmachines	4	8.0	$0.391	15/22	16.2s
Total Tests 22 Wrong Tests 7 Total Cost $0.391 Response Time (avg) 16.2s
#150	KAT-Coder-Air V2.5 high	Kwaipilot	9	5.6	$0.077	7/22	15.9s
Total Tests 22 Wrong Tests 15 Total Cost $0.077 Response Time (avg) 15.9s
#23	Grok 4.5 low	X AI	6	8.4	$0.935	16/22	15.6s
Total Tests 22 Wrong Tests 6 Total Cost $0.935 Response Time (avg) 15.6s
#181	Qwen3.6 Plus Preview medium	Qwen	2	4.9	$0.000	9/19	15.2s
Total Tests 19 Wrong Tests 10 Total Cost $0.000 Response Time (avg) 15.2s
#4	Gemini 3.5 Flash high	Google	1	9.5	$1.976	20/22	15.1s
Total Tests 22 Wrong Tests 2 Total Cost $1.976 Response Time (avg) 15.1s
#2	Gemini 3.6 Flash high	Google	1	9.7	$1.785	21/22	14.9s
Total Tests 22 Wrong Tests 1 Total Cost $1.785 Response Time (avg) 14.9s

Wrong answer Failures

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)