Wrong answer Failure Ranking

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Failure Count ↑.

Models Shown

Total Failures

1558

Most Affected Model

Gemini 3 Flash Preview 1

Categories

In category Domain specific412 In category Anti-AI Tricks293 In category Coding252 In category Puzzle Solving201 In category Trivia168 In category Combined68 In category Instructions following61 In category General Intelligence59 In category Data parsing and extraction41 In category Tool Calling3

209/209

Rank	Model	Company	Wrong answer Count	Score	Total Cost	Tests Correct	Response Time (avg)
#14	Claude Opus 4.8 medium	Anthropic	3	8.8	$1.931	18/22	12.5s
Total Tests 22 Wrong Tests 4 Total Cost $1.931 Response Time (avg) 12.5s
#15	Claude Opus 4.7 medium	Anthropic	3	8.7	$1.477	18/22	7.61s
Total Tests 22 Wrong Tests 4 Total Cost $1.477 Response Time (avg) 7.61s
#21	GPT-5.2 medium	OpenAI	3	8.4	$0.951	14/22	22.6s
Total Tests 22 Wrong Tests 8 Total Cost $0.951 Response Time (avg) 22.6s
#31	GLM 5.2 high	Z.ai	3	8.0	$0.970	14/22	62.7s
Total Tests 22 Wrong Tests 8 Total Cost $0.970 Response Time (avg) 62.7s
#38	GLM 5.2 medium	Z.ai	3	7.8	$0.222	15/21	23.3s
Total Tests 21 Wrong Tests 6 Total Cost $0.222 Response Time (avg) 23.3s
#42	GLM 5 medium	Z.ai	3	7.7	$0.307	15/21	33.5s
Total Tests 21 Wrong Tests 6 Total Cost $0.307 Response Time (avg) 33.5s
#43	Claude Opus 4.6 medium	Anthropic	3	7.7	$3.059	13/22	34.3s
Total Tests 22 Wrong Tests 9 Total Cost $3.059 Response Time (avg) 34.3s
#47	MiniMax M3 medium	Minimax	3	7.6	$0.286	12/22	75.0s
Total Tests 22 Wrong Tests 10 Total Cost $0.286 Response Time (avg) 75.0s
#68	Kimi K2.6 medium	Moonshot AI	3	7.2	$1.036	12/22	110.0s
Total Tests 22 Wrong Tests 10 Total Cost $1.036 Response Time (avg) 110.0s
#79	Gemini 3.5 Flash none	Google	3	7.0	$1.079	15/22	9.93s
Total Tests 22 Wrong Tests 7 Total Cost $1.079 Response Time (avg) 9.93s
#84	MiMo-V2.5-Pro medium	Xiaomi	3	6.9	$0.187	12/22	33.9s
Total Tests 22 Wrong Tests 10 Total Cost $0.187 Response Time (avg) 33.9s
#94	Claude Opus 4.7 none	Anthropic	3	6.6	$0.505	16/19	3.02s
Total Tests 19 Wrong Tests 3 Total Cost $0.505 Response Time (avg) 3.02s
#95	Gemma 4 26B A4B medium	Google	3	6.6	$0.089	14/22	103.8s
Total Tests 22 Wrong Tests 8 Total Cost $0.089 Response Time (avg) 103.8s
#100	Hy3 preview medium	Tencent	3	6.5	$0.018	14/21	16.3s
Total Tests 21 Wrong Tests 7 Total Cost $0.018 Response Time (avg) 16.3s
#131	Grok 4.20 Beta medium	X AI	3	6.0	$0.750	14/18	9.75s
Total Tests 18 Wrong Tests 4 Total Cost $0.750 Response Time (avg) 9.75s

Wrong answer Failures

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)