Combined x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Combined, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

Total Failures

Most Affected Model

Qwen3.5-Flash 1

Failure Reasons

Invalid tool call91 Wrong answer69 No answer32 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Domain specific421 Anti-AI Tricks293 Coding259 Puzzle Solving204 Trivia172 Combined69 General Intelligence62 Instructions following61 Data parsing and extraction41 Tool Calling3

64/64

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#183	Nemotron 3 Super none	NVIDIA	2	3.0	$0.008	0/2	18.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 18.2s
#204	Laguna Xs.2 medium	Poolside	1	1.5	$0.015	0/1	15.9s
Total Tests 1 Wrong Tests 1 Total Cost $0.015 Response Time (avg) 15.9s
#202	Hunter Alpha none	OpenRouter	1	1.5	$0.000	0/1	15.2s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 15.2s
#186	GPT-5.4 Nano none	OpenAI	1	3.0	$0.041	0/2	14.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.041 Response Time (avg) 14.7s
#193	Qwen3 Coder Next medium	Qwen	1	3.0	$0.032	0/2	14.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.032 Response Time (avg) 14.6s
#123	GPT-5.6 Luna low	OpenAI	1	2.8	$0.249	0/2	13.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.249 Response Time (avg) 13.7s
#23	Grok 4.5 low	X AI	1	6.5	$0.935	1/2	12.8s
Total Tests 2 Wrong Tests 1 Total Cost $0.935 Response Time (avg) 12.8s
#93	Gemini 3 Flash Preview none	Google	1	3.8	$0.085	0/2	12.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.085 Response Time (avg) 12.4s
#166	Laguna XS 2.1 none	Poolside	1	3.0	$0.008	0/2	10.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 10.4s
#139	Gemini 3 PRO Preview medium	Google	1	1.5	$0.385	0/1	10.4s
Total Tests 1 Wrong Tests 1 Total Cost $0.385 Response Time (avg) 10.4s
#65	Gemini 3 Flash Preview low	Google	2	3.0	$0.177	0/2	10.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.177 Response Time (avg) 10.2s
#152	Owl Alpha medium	Openrouter	1	1.5	$0.000	0/1	10.0s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 10.0s
#128	Gemini 3.1 Flash Lite none	Google	1	3.0	$0.046	0/2	9.49s
Total Tests 2 Wrong Tests 2 Total Cost $0.046 Response Time (avg) 9.49s
#145	GPT-5.4 none	OpenAI	2	3.0	$0.397	0/2	9.26s
Total Tests 2 Wrong Tests 2 Total Cost $0.397 Response Time (avg) 9.26s
#189	Trinity Large Preview none	Arcee AI	1	1.5	$0.008	0/1	8.91s
Total Tests 1 Wrong Tests 1 Total Cost $0.008 Response Time (avg) 8.91s

←

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost