Combined x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Combined, so you can spot weak points faster. Sort by: Tests Correct ↑.

Models Shown

Total Failures

Most Affected Model

Gemini 3 Flash Preview 2

Failure Reasons

Invalid tool call91 Wrong answer69 No answer32 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Domain specific421 Anti-AI Tricks293 Coding259 Puzzle Solving204 Trivia172 Combined69 General Intelligence62 Instructions following61 Data parsing and extraction41 Tool Calling3

64/64

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#133	Qwen3.5-35B-A3B none	Qwen	1	3.8	$0.106	0/2	128.3s
Total Tests 2 Wrong Tests 2 Total Cost $0.106 Response Time (avg) 128.3s
#135	Nemotron 3 Ultra none	NVIDIA	1	3.0	$0.095	0/2	21.1s
Total Tests 2 Wrong Tests 2 Total Cost $0.095 Response Time (avg) 21.1s
#139	Gemini 3 PRO Preview medium	Google	1	1.5	$0.385	0/1	10.4s
Total Tests 1 Wrong Tests 1 Total Cost $0.385 Response Time (avg) 10.4s
#144	Kimi K2.6 none	Moonshot AI	1	3.0	$0.184	0/2	77.8s
Total Tests 2 Wrong Tests 2 Total Cost $0.184 Response Time (avg) 77.8s
#145	GPT-5.4 none	OpenAI	2	3.0	$0.397	0/2	9.26s
Total Tests 2 Wrong Tests 2 Total Cost $0.397 Response Time (avg) 9.26s
#147	GLM 5 none	Z.ai	1	1.5	$0.041	0/1	4.98s
Total Tests 1 Wrong Tests 1 Total Cost $0.041 Response Time (avg) 4.98s
#148	Qwen3.5-122B-A10B none	Qwen	1	5.2	$0.247	0/2	129.3s
Total Tests 2 Wrong Tests 2 Total Cost $0.247 Response Time (avg) 129.3s
#151	GLM 5V Turbo none	Z.ai	1	1.5	$0.052	0/1	6.51s
Total Tests 1 Wrong Tests 1 Total Cost $0.052 Response Time (avg) 6.51s
#152	Owl Alpha medium	Openrouter	1	1.5	$0.000	0/1	10.0s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 10.0s
#153	Mimo V2 PRO none	Xiaomi	1	1.5	$0.045	0/1	6.58s
Total Tests 1 Wrong Tests 1 Total Cost $0.045 Response Time (avg) 6.58s
#154	Owl Alpha none	Openrouter	1	1.5	$0.000	0/1	21.7s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 21.7s
#161	Kimi K2.5 none	Moonshot AI	1	2.8	$0.127	0/2	61.0s
Total Tests 2 Wrong Tests 2 Total Cost $0.127 Response Time (avg) 61.0s
#162	Gemma 4 26B A4B none	Google	1	3.0	$0.015	0/2	37.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.015 Response Time (avg) 37.2s
#163	Mimo V2 Omni none	Xiaomi	1	1.5	$0.021	0/1	5.96s
Total Tests 1 Wrong Tests 1 Total Cost $0.021 Response Time (avg) 5.96s
#165	GPT-5.6 Luna none	OpenAI	1	3.2	$0.142	0/2	6.68s
Total Tests 2 Wrong Tests 2 Total Cost $0.142 Response Time (avg) 6.68s

←

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost