Combined x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Combined, so you can spot weak points faster.

Models Shown

Total Failures

Most Affected Model

Failure Reasons

Invalid tool call91 Wrong answer68 No answer29 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Domain specific412 Anti-AI Tricks293 Coding252 Puzzle Solving201 Trivia168 Combined68 Instructions following61 General Intelligence59 Data parsing and extraction41 Tool Calling3

63/63

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#133	Gemini 3 PRO Preview medium	Google	1	1.5	$0.385	0/1	10.4s
Total Tests 1 Wrong Tests 1 Total Cost $0.385 Response Time (avg) 10.4s
#136	GPT-5.4 Mini none	OpenAI	1	6.5	$0.095	1/2	6.22s
Total Tests 2 Wrong Tests 1 Total Cost $0.095 Response Time (avg) 6.22s
#138	Kimi K2.6 none	Moonshot AI	1	3.0	$0.184	0/2	77.8s
Total Tests 2 Wrong Tests 2 Total Cost $0.184 Response Time (avg) 77.8s
#141	GLM 5 none	Z.ai	1	1.5	$0.041	0/1	4.98s
Total Tests 1 Wrong Tests 1 Total Cost $0.041 Response Time (avg) 4.98s
#142	Qwen3.5-122B-A10B none	Qwen	1	5.2	$0.247	0/2	129.3s
Total Tests 2 Wrong Tests 2 Total Cost $0.247 Response Time (avg) 129.3s
#145	GLM 5V Turbo none	Z.ai	1	1.5	$0.052	0/1	6.51s
Total Tests 1 Wrong Tests 1 Total Cost $0.052 Response Time (avg) 6.51s
#146	Owl Alpha medium	Openrouter	1	1.5	$0.000	0/1	10.0s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 10.0s
#147	Mimo V2 PRO none	Xiaomi	1	1.5	$0.045	0/1	6.58s
Total Tests 1 Wrong Tests 1 Total Cost $0.045 Response Time (avg) 6.58s
#148	Owl Alpha none	Openrouter	1	1.5	$0.000	0/1	21.7s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 21.7s
#155	Kimi K2.5 none	Moonshot AI	1	2.8	$0.127	0/2	61.0s
Total Tests 2 Wrong Tests 2 Total Cost $0.127 Response Time (avg) 61.0s
#156	Gemma 4 26B A4B none	Google	1	3.0	$0.015	0/2	37.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.015 Response Time (avg) 37.2s
#157	Mimo V2 Omni none	Xiaomi	1	1.5	$0.021	0/1	5.96s
Total Tests 1 Wrong Tests 1 Total Cost $0.021 Response Time (avg) 5.96s
#159	GPT-5.6 Luna none	OpenAI	1	3.2	$0.142	0/2	6.68s
Total Tests 2 Wrong Tests 2 Total Cost $0.142 Response Time (avg) 6.68s
#160	Laguna XS 2.1 none	Poolside	1	3.0	$0.008	0/2	10.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 10.4s
#162	Ling-2.6-1T none	Inclusionai	1	6.5	$0.016	1/2	23.8s
Total Tests 2 Wrong Tests 1 Total Cost $0.016 Response Time (avg) 23.8s

←

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost