Combined Model Ranking

See which AI models perform best on Combined, which ones stay reliable, and where the biggest gaps appear. Sort by: Metric ↑.

Models Shown

Average Combined Score

5.5

Best Model

Gemini 3 PRO Preview 1.5

Failure Reasons

With failure reason Invalid tool call96 With failure reason Wrong answer71 With failure reason No answer33 With failure reason API error26 With failure reason Timed out5 With failure reason Did not follow instructions1 With failure reason Extra formatting1

220/220

Rank	Model	Company	Combined Score	Score	Total Cost	Tests Correct	Response Time (avg)
#92	Gemini 3.5 Flash minimal	Google	3.0	6.8	$0.300	0/2	14.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.300 Response Time (avg) 14.4s
#94	Qwen3.6 35B A3B medium	Qwen	3.0	6.7	$0.746	0/2	817.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.746 Response Time (avg) 817.6s
#110	Gemini 3.1 Flash Lite Preview low	Google	3.0	6.5	$0.646	0/2	160.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.646 Response Time (avg) 160.6s
#112	Gemini 3.1 Flash Lite Preview none	Google	3.0	6.4	$0.052	0/2	6.23s
Total Tests 2 Wrong Tests 2 Total Cost $0.052 Response Time (avg) 6.23s
#122	Seed-2.0-Lite none	Bytedance Seed	3.0	6.2	$0.066	0/2	25.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.066 Response Time (avg) 25.6s
#124	Gemini 2.5 Flash none	Google	3.0	6.2	$0.017	0/2	61.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.017 Response Time (avg) 61.2s
#126	Gemini 3.1 Flash Lite minimal	Google	3.0	6.1	$0.047	0/2	7.75s
Total Tests 2 Wrong Tests 2 Total Cost $0.047 Response Time (avg) 7.75s
#128	Gemini 3.1 Flash Lite none	Google	3.0	6.1	$0.046	0/2	9.49s
Total Tests 2 Wrong Tests 2 Total Cost $0.046 Response Time (avg) 9.49s
#135	Nemotron 3 Ultra none	NVIDIA	3.0	6.1	$0.095	0/2	21.1s
Total Tests 2 Wrong Tests 2 Total Cost $0.095 Response Time (avg) 21.1s
#144	Kimi K2.6 none	Moonshot AI	3.0	5.8	$0.184	0/2	77.8s
Total Tests 2 Wrong Tests 2 Total Cost $0.184 Response Time (avg) 77.8s
#145	GPT-5.4 none	OpenAI	3.0	5.8	$0.397	0/2	9.26s
Total Tests 2 Wrong Tests 2 Total Cost $0.397 Response Time (avg) 9.26s
#160	MiMo-V2.5-Pro none	Xiaomi	3.0	5.5	$0.068	0/2	28.3s
Total Tests 2 Wrong Tests 2 Total Cost $0.068 Response Time (avg) 28.3s
#162	Gemma 4 26B A4B none	Google	3.0	5.5	$0.015	0/2	37.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.015 Response Time (avg) 37.2s
#168	Laguna XS 2.1 none	Poolside	3.0	5.3	$0.008	0/2	10.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 10.4s
#173	Mistral Small 4 none	Mistral	3.0	5.1	$0.022	0/2	7.44s
Total Tests 2 Wrong Tests 2 Total Cost $0.022 Response Time (avg) 7.44s

Combined Ranking

Filter models

Top Models by Combined Score

Combined Score vs Total Cost

Top Models by Response Time (avg)