Combined x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Combined, so you can spot weak points faster.

Models Shown

Total Failures

Most Affected Model

Failure Reasons

Invalid tool call91 Wrong answer68 No answer29 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Domain specific412 Anti-AI Tricks293 Coding252 Puzzle Solving201 Trivia168 Combined68 Instructions following61 General Intelligence59 Data parsing and extraction41 Tool Calling3

63/63

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#61	Gemini 3 Flash Preview low	Google	2	3.0	$0.177	0/2	10.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.177 Response Time (avg) 10.2s
#139	GPT-5.4 none	OpenAI	2	3.0	$0.397	0/2	9.26s
Total Tests 2 Wrong Tests 2 Total Cost $0.397 Response Time (avg) 9.26s
#165	Mistral Small 4 none	Mistral	2	3.0	$0.022	0/2	7.44s
Total Tests 2 Wrong Tests 2 Total Cost $0.022 Response Time (avg) 7.44s
#177	Nemotron 3 Super none	NVIDIA	2	3.0	$0.008	0/2	18.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 18.2s
#189	Mercury 2 none	Inception	2	3.0	$0.030	0/2	2.56s
Total Tests 2 Wrong Tests 2 Total Cost $0.030 Response Time (avg) 2.56s
#20	Grok 4.5 low	X AI	1	6.5	$0.935	1/2	12.8s
Total Tests 2 Wrong Tests 1 Total Cost $0.935 Response Time (avg) 12.8s
#52	Kimi K2.7 Code medium	Moonshot AI	1	7.3	$0.751	1/2	66.0s
Total Tests 2 Wrong Tests 1 Total Cost $0.751 Response Time (avg) 66.0s
#59	Qwen3.7 Max none	Qwen	1	6.5	$0.197	1/2	37.2s
Total Tests 2 Wrong Tests 1 Total Cost $0.197 Response Time (avg) 37.2s
#83	GPT-5.6 Sol none	OpenAI	1	6.5	$0.524	1/2	8.37s
Total Tests 2 Wrong Tests 1 Total Cost $0.524 Response Time (avg) 8.37s
#87	GPT-5.5 none	OpenAI	1	6.5	$0.544	1/2	8.90s
Total Tests 2 Wrong Tests 1 Total Cost $0.544 Response Time (avg) 8.90s
#89	Gemini 3 Flash Preview none	Google	1	3.8	$0.085	0/2	12.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.085 Response Time (avg) 12.4s
#92	KAT-Coder-Pro V2.5 none	Kwaipilot	1	4.1	$0.476	0/2	183.1s
Total Tests 2 Wrong Tests 2 Total Cost $0.476 Response Time (avg) 183.1s
#98	Qwen3.6 Max Preview none	Qwen	1	6.5	$0.231	1/2	61.6s
Total Tests 2 Wrong Tests 1 Total Cost $0.231 Response Time (avg) 61.6s
#103	Qwen3.5-27B none	Qwen	1	6.4	$0.090	1/2	39.4s
Total Tests 2 Wrong Tests 1 Total Cost $0.090 Response Time (avg) 39.4s
#104	Gemini 3.1 Flash Lite Preview low	Google	1	3.0	$0.646	0/2	160.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.646 Response Time (avg) 160.6s

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost