Combined x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Combined, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

Most Affected Model

Laguna Xs.2 1

Failure Reasons

Invalid tool call91 Wrong answer68 No answer29 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Domain specific412 Anti-AI Tricks293 Coding252 Puzzle Solving201 Trivia168 Combined68 Instructions following61 General Intelligence59 Data parsing and extraction41 Tool Calling3

63/63

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#165	Mistral Small 4 none	Mistral	2	3.0	$0.022	0/2	7.44s
Total Tests 2 Wrong Tests 2 Total Cost $0.022 Response Time (avg) 7.44s
#120	Gemini 3.1 Flash Lite minimal	Google	1	3.0	$0.047	0/2	7.75s
Total Tests 2 Wrong Tests 2 Total Cost $0.047 Response Time (avg) 7.75s
#83	GPT-5.6 Sol none	OpenAI	1	6.5	$0.524	1/2	8.37s
Total Tests 2 Wrong Tests 1 Total Cost $0.524 Response Time (avg) 8.37s
#87	GPT-5.5 none	OpenAI	1	6.5	$0.544	1/2	8.90s
Total Tests 2 Wrong Tests 1 Total Cost $0.544 Response Time (avg) 8.90s
#183	Trinity Large Preview none	Arcee AI	1	1.5	$0.008	0/1	8.91s
Total Tests 1 Wrong Tests 1 Total Cost $0.008 Response Time (avg) 8.91s
#139	GPT-5.4 none	OpenAI	2	3.0	$0.397	0/2	9.26s
Total Tests 2 Wrong Tests 2 Total Cost $0.397 Response Time (avg) 9.26s
#122	Gemini 3.1 Flash Lite none	Google	1	3.0	$0.046	0/2	9.49s
Total Tests 2 Wrong Tests 2 Total Cost $0.046 Response Time (avg) 9.49s
#146	Owl Alpha medium	Openrouter	1	1.5	$0.000	0/1	10.0s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 10.0s
#61	Gemini 3 Flash Preview low	Google	2	3.0	$0.177	0/2	10.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.177 Response Time (avg) 10.2s
#133	Gemini 3 PRO Preview medium	Google	1	1.5	$0.385	0/1	10.4s
Total Tests 1 Wrong Tests 1 Total Cost $0.385 Response Time (avg) 10.4s
#160	Laguna XS 2.1 none	Poolside	1	3.0	$0.008	0/2	10.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 10.4s
#89	Gemini 3 Flash Preview none	Google	1	3.8	$0.085	0/2	12.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.085 Response Time (avg) 12.4s
#20	Grok 4.5 low	X AI	1	6.5	$0.935	1/2	12.8s
Total Tests 2 Wrong Tests 1 Total Cost $0.935 Response Time (avg) 12.8s
#117	GPT-5.6 Luna low	OpenAI	1	2.8	$0.249	0/2	13.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.249 Response Time (avg) 13.7s
#187	Qwen3 Coder Next medium	Qwen	1	3.0	$0.032	0/2	14.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.032 Response Time (avg) 14.6s

←

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost