Combined x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Combined, so you can spot weak points faster.

Models Shown

Total Failures

Most Affected Model

Failure Reasons

Invalid tool call91 Wrong answer68 No answer29 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Domain specific412 Anti-AI Tricks293 Coding252 Puzzle Solving201 Trivia168 Combined68 Instructions following61 General Intelligence59 Data parsing and extraction41 Tool Calling3

63/63

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#105	Gemini 3.1 Flash Lite low	Google	1	3.2	$0.621	0/2	161.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.621 Response Time (avg) 161.2s
#106	Gemini 3.1 Flash Lite Preview none	Google	1	3.0	$0.052	0/2	6.23s
Total Tests 2 Wrong Tests 2 Total Cost $0.052 Response Time (avg) 6.23s
#107	Qwen3.5 Plus 2026-02-15 none	Qwen	1	6.5	$0.073	1/2	64.8s
Total Tests 2 Wrong Tests 1 Total Cost $0.073 Response Time (avg) 64.8s
#109	Mimo V2 PRO medium	Xiaomi	1	2.3	$0.333	0/1	64.7s
Total Tests 1 Wrong Tests 1 Total Cost $0.333 Response Time (avg) 64.7s
#111	LongCat 2.0 none	Meituan	1	6.5	$0.044	1/2	28.4s
Total Tests 2 Wrong Tests 1 Total Cost $0.044 Response Time (avg) 28.4s
#115	Gemma 4 31B none	Google	1	3.8	$0.035	0/2	30.0s
Total Tests 2 Wrong Tests 2 Total Cost $0.035 Response Time (avg) 30.0s
#116	Seed-2.0-Lite none	Bytedance Seed	1	3.0	$0.066	0/2	25.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.066 Response Time (avg) 25.6s
#117	GPT-5.6 Luna low	OpenAI	1	2.8	$0.249	0/2	13.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.249 Response Time (avg) 13.7s
#118	Gemini 2.5 Flash none	Google	1	3.0	$0.017	0/2	61.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.017 Response Time (avg) 61.2s
#120	Gemini 3.1 Flash Lite minimal	Google	1	3.0	$0.047	0/2	7.75s
Total Tests 2 Wrong Tests 2 Total Cost $0.047 Response Time (avg) 7.75s
#122	Gemini 3.1 Flash Lite none	Google	1	3.0	$0.046	0/2	9.49s
Total Tests 2 Wrong Tests 2 Total Cost $0.046 Response Time (avg) 9.49s
#125	Qwen3.5-Flash none	Qwen	1	2.9	$0.073	0/2	243.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.073 Response Time (avg) 243.6s
#126	Qwen3.5 Plus 2026-04-20 none	Qwen	1	6.4	$0.122	1/2	109.7s
Total Tests 2 Wrong Tests 1 Total Cost $0.122 Response Time (avg) 109.7s
#127	Qwen3.5-35B-A3B none	Qwen	1	3.8	$0.106	0/2	128.3s
Total Tests 2 Wrong Tests 2 Total Cost $0.106 Response Time (avg) 128.3s
#129	Nemotron 3 Ultra none	NVIDIA	1	3.0	$0.095	0/2	21.1s
Total Tests 2 Wrong Tests 2 Total Cost $0.095 Response Time (avg) 21.1s

←

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost