Combined x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Combined, so you can spot weak points faster.

Models Shown

Total Failures

Most Affected Model

Failure Reasons

Invalid tool call91 Wrong answer68 No answer29 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Domain specific412 Anti-AI Tricks293 Coding252 Puzzle Solving201 Trivia168 Combined68 Instructions following61 General Intelligence59 Data parsing and extraction41 Tool Calling3

63/63

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#164	Inkling none	Thinkingmachines	1	2.9	$0.147	0/2	25.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.147 Response Time (avg) 25.7s
#166	Qwen3 Coder Next none	Qwen	1	3.0	$0.025	0/2	30.9s
Total Tests 2 Wrong Tests 2 Total Cost $0.025 Response Time (avg) 30.9s
#167	Mistral Small 4 medium	Mistral	1	3.0	$0.096	0/2	32.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.096 Response Time (avg) 32.4s
#168	MiMo-V2.5 none	Xiaomi	1	3.0	$0.025	0/2	28.9s
Total Tests 2 Wrong Tests 2 Total Cost $0.025 Response Time (avg) 28.9s
#170	GLM 5 Turbo none	Z.ai	1	1.5	$0.047	0/1	4.89s
Total Tests 1 Wrong Tests 1 Total Cost $0.047 Response Time (avg) 4.89s
#174	GPT-4o-mini none	OpenAI	1	3.0	$0.010	0/2	6.32s
Total Tests 2 Wrong Tests 2 Total Cost $0.010 Response Time (avg) 6.32s
#180	GPT-5.4 Nano none	OpenAI	1	3.0	$0.041	0/2	14.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.041 Response Time (avg) 14.7s
#182	KAT-Coder-Air V2.5 none	Kwaipilot	1	3.8	$0.067	0/2	73.0s
Total Tests 2 Wrong Tests 2 Total Cost $0.067 Response Time (avg) 73.0s
#183	Trinity Large Preview none	Arcee AI	1	1.5	$0.008	0/1	8.91s
Total Tests 1 Wrong Tests 1 Total Cost $0.008 Response Time (avg) 8.91s
#187	Qwen3 Coder Next medium	Qwen	1	3.0	$0.032	0/2	14.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.032 Response Time (avg) 14.6s
#193	Elephant Alpha none	Openrouter	1	1.5	$0.000	0/1	3.81s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 3.81s
#195	Elephant Alpha medium	Openrouter	1	1.5	$0.000	0/1	3.70s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 3.70s
#196	Hunter Alpha none	OpenRouter	1	1.5	$0.000	0/1	15.2s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 15.2s
#198	Laguna Xs.2 medium	Poolside	1	1.5	$0.015	0/1	15.9s
Total Tests 1 Wrong Tests 1 Total Cost $0.015 Response Time (avg) 15.9s
#199	Hy3 preview none	Tencent	1	1.5	$0.003	0/1	35.8s
Total Tests 1 Wrong Tests 1 Total Cost $0.003 Response Time (avg) 35.8s

←

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost