Wrong answer Failure Ranking

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Score ↓.

Models Shown

Total Failures

1642

Most Affected Model

Gemini 3.6 Flash 1

Categories

In category Domain specific433 In category Anti-AI Tricks306 In category Coding266 In category Puzzle Solving214 In category Trivia176 In category Combined71 In category General Intelligence66 In category Instructions following65 In category Data parsing and extraction41 In category Tool Calling4

219/219

Rank	Model	Company	Wrong answer Count	Score	Total Cost	Tests Correct	Response Time (avg)
#31	Gemini 3.5 Flash-Lite high	Google	6	8.1	$0.584	14/22	9.48s
Total Tests 22 Wrong Tests 8 Total Cost $0.584 Response Time (avg) 9.48s
#32	Inkling high	Thinkingmachines	4	8.0	$1.006	15/22	64.2s
Total Tests 22 Wrong Tests 7 Total Cost $1.006 Response Time (avg) 64.2s
#33	Step 3.7 Flash medium	Stepfun	5	8.0	$0.515	14/22	26.4s
Total Tests 22 Wrong Tests 8 Total Cost $0.515 Response Time (avg) 26.4s
#34	GPT-5.2 Chat none	OpenAI	6	8.0	$0.604	14/22	7.65s
Total Tests 22 Wrong Tests 8 Total Cost $0.604 Response Time (avg) 7.65s
#35	GLM 5.2 high	Z.ai	3	8.0	$0.796	14/22	62.7s
Total Tests 22 Wrong Tests 8 Total Cost $0.796 Response Time (avg) 62.7s
#36	Inkling medium	Thinkingmachines	4	8.0	$0.391	15/22	16.2s
Total Tests 22 Wrong Tests 7 Total Cost $0.391 Response Time (avg) 16.2s
#38	GPT-5.6 Terra high	OpenAI	7	8.0	$1.055	14/22	11.3s
Total Tests 22 Wrong Tests 8 Total Cost $1.055 Response Time (avg) 11.3s
#39	Seed-2.0-Lite medium	Bytedance Seed	5	7.9	$0.234	14/22	48.5s
Total Tests 22 Wrong Tests 8 Total Cost $0.234 Response Time (avg) 48.5s
#40	Qwen3.7 Plus medium	Qwen	5	7.9	$0.267	15/22	51.5s
Total Tests 22 Wrong Tests 7 Total Cost $0.267 Response Time (avg) 51.5s
#41	Qwen3.6 Plus medium	Qwen	5	7.8	$0.405	15/22	43.1s
Total Tests 22 Wrong Tests 7 Total Cost $0.405 Response Time (avg) 43.1s
#42	GLM 5.2 medium	Z.ai	3	7.8	$0.182	15/21	23.3s
Total Tests 21 Wrong Tests 6 Total Cost $0.182 Response Time (avg) 23.3s
#43	GPT-5.6 Terra medium	OpenAI	8	7.8	$0.676	14/22	7.11s
Total Tests 22 Wrong Tests 8 Total Cost $0.676 Response Time (avg) 7.11s
#44	Claude Sonnet 4.6 medium	Anthropic	4	7.8	$2.057	14/22	25.9s
Total Tests 22 Wrong Tests 8 Total Cost $2.057 Response Time (avg) 25.9s
#45	Claude Opus 4.8 low	Anthropic	4	7.8	$2.077	16/22	12.7s
Total Tests 22 Wrong Tests 6 Total Cost $2.077 Response Time (avg) 12.7s
#46	GLM 5 medium	Z.ai	3	7.7	$0.307	15/21	33.5s
Total Tests 21 Wrong Tests 6 Total Cost $0.307 Response Time (avg) 33.5s

Wrong answer Failures

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)