Combined Model Ranking

See which AI models perform best on Combined, which ones stay reliable, and where the biggest gaps appear.

Models Shown

Average Combined Score

5.6

Best Model

Failure Reasons

With failure reason Invalid tool call91 With failure reason Wrong answer69 With failure reason No answer32 With failure reason API error26 With failure reason Timed out5 With failure reason Did not follow instructions1 With failure reason Extra formatting1

216/216

Rank	Model	Company	Combined Score	Score	Total Cost	Tests Correct	Response Time (avg)
#122	Seed-2.0-Lite none	Bytedance Seed	3.0	6.2	$0.066	0/2	25.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.066 Response Time (avg) 25.6s
#124	Gemini 2.5 Flash none	Google	3.0	6.2	$0.017	0/2	61.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.017 Response Time (avg) 61.2s
#126	Gemini 3.1 Flash Lite minimal	Google	3.0	6.1	$0.047	0/2	7.75s
Total Tests 2 Wrong Tests 2 Total Cost $0.047 Response Time (avg) 7.75s
#128	Gemini 3.1 Flash Lite none	Google	3.0	6.1	$0.046	0/2	9.49s
Total Tests 2 Wrong Tests 2 Total Cost $0.046 Response Time (avg) 9.49s
#135	Nemotron 3 Ultra none	NVIDIA	3.0	6.1	$0.095	0/2	21.1s
Total Tests 2 Wrong Tests 2 Total Cost $0.095 Response Time (avg) 21.1s
#144	Kimi K2.6 none	Moonshot AI	3.0	5.8	$0.184	0/2	77.8s
Total Tests 2 Wrong Tests 2 Total Cost $0.184 Response Time (avg) 77.8s
#145	GPT-5.4 none	OpenAI	3.0	5.8	$0.397	0/2	9.26s
Total Tests 2 Wrong Tests 2 Total Cost $0.397 Response Time (avg) 9.26s
#160	MiMo-V2.5-Pro none	Xiaomi	3.0	5.5	$0.068	0/2	28.3s
Total Tests 2 Wrong Tests 2 Total Cost $0.068 Response Time (avg) 28.3s
#162	Gemma 4 26B A4B none	Google	3.0	5.5	$0.015	0/2	37.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.015 Response Time (avg) 37.2s
#166	Laguna XS 2.1 none	Poolside	3.0	5.3	$0.008	0/2	10.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 10.4s
#171	Mistral Small 4 none	Mistral	3.0	5.1	$0.022	0/2	7.44s
Total Tests 2 Wrong Tests 2 Total Cost $0.022 Response Time (avg) 7.44s
#172	Qwen3 Coder Next none	Qwen	3.0	5.1	$0.025	0/2	30.9s
Total Tests 2 Wrong Tests 2 Total Cost $0.025 Response Time (avg) 30.9s
#173	Mistral Small 4 medium	Mistral	3.0	5.1	$0.096	0/2	32.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.096 Response Time (avg) 32.4s
#174	MiMo-V2.5 none	Xiaomi	3.0	5.1	$0.025	0/2	28.9s
Total Tests 2 Wrong Tests 2 Total Cost $0.025 Response Time (avg) 28.9s
#175	Qwen3.5-9B none	Qwen	3.0	5.1	$0.021	0/2	194.0s
Total Tests 2 Wrong Tests 2 Total Cost $0.021 Response Time (avg) 194.0s

Combined Ranking

Filter models

Top Models by Combined Score

Combined Score vs Total Cost

Top Models by Response Time (avg)