Combined Model Ranking

See which AI models perform best on Combined, which ones stay reliable, and where the biggest gaps appear.

Models Shown

Average Combined Score

5.6

Best Model

Failure Reasons

With failure reason Invalid tool call91 With failure reason Wrong answer69 With failure reason No answer32 With failure reason API error26 With failure reason Timed out5 With failure reason Did not follow instructions1 With failure reason Extra formatting1

216/216

Rank	Model	Company	Combined Score	Score	Total Cost	Tests Correct	Response Time (avg)
#27	Muse Spark 1.1 low	Meta	6.6	8.3	$0.647	1/2	29.4s
Total Tests 2 Wrong Tests 1 Total Cost $0.647 Response Time (avg) 29.4s
#20	Claude Fable 5 medium	Anthropic	6.5	8.6	$3.478	1/2	27.5s
Total Tests 2 Wrong Tests 1 Total Cost $3.478 Response Time (avg) 27.5s
#23	Grok 4.5 low	X AI	6.5	8.4	$0.935	1/2	12.8s
Total Tests 2 Wrong Tests 1 Total Cost $0.935 Response Time (avg) 12.8s
#37	Kimi K3 max	Moonshot AI	6.5	8.0	$3.112	1/2	223.0s
Total Tests 2 Wrong Tests 1 Total Cost $3.112 Response Time (avg) 223.0s
#63	Qwen3.7 Max none	Qwen	6.5	7.4	$0.197	1/2	37.2s
Total Tests 2 Wrong Tests 1 Total Cost $0.197 Response Time (avg) 37.2s
#74	Qwen3.5 Plus 2026-04-20 medium	Qwen	6.5	7.2	$0.317	1/2	92.4s
Total Tests 2 Wrong Tests 1 Total Cost $0.317 Response Time (avg) 92.4s
#77	Grok 4.3 medium	X AI	6.5	7.1	$0.779	1/2	55.1s
Total Tests 2 Wrong Tests 1 Total Cost $0.779 Response Time (avg) 55.1s
#87	GPT-5.6 Sol none	OpenAI	6.5	6.9	$0.524	1/2	8.37s
Total Tests 2 Wrong Tests 1 Total Cost $0.524 Response Time (avg) 8.37s
#89	Qwen3.6 Flash medium	Qwen	6.5	6.9	$0.738	1/2	299.2s
Total Tests 2 Wrong Tests 1 Total Cost $0.738 Response Time (avg) 299.2s
#91	GPT-5.5 none	OpenAI	6.5	6.9	$0.544	1/2	8.90s
Total Tests 2 Wrong Tests 1 Total Cost $0.544 Response Time (avg) 8.90s
#103	Qwen3.6 Max Preview none	Qwen	6.5	6.6	$0.231	1/2	61.6s
Total Tests 2 Wrong Tests 1 Total Cost $0.231 Response Time (avg) 61.6s
#113	Qwen3.5 Plus 2026-02-15 none	Qwen	6.5	6.4	$0.073	1/2	64.8s
Total Tests 2 Wrong Tests 1 Total Cost $0.073 Response Time (avg) 64.8s
#117	LongCat 2.0 none	Meituan	6.5	6.3	$0.044	1/2	28.4s
Total Tests 2 Wrong Tests 1 Total Cost $0.044 Response Time (avg) 28.4s
#118	Claude Sonnet 5 none	Anthropic	6.5	6.3	$0.548	1/2	31.4s
Total Tests 2 Wrong Tests 1 Total Cost $0.548 Response Time (avg) 31.4s
#127	gpt-oss-120b medium	OpenAI	6.5	6.1	$0.019	1/2	24.0s
Total Tests 2 Wrong Tests 1 Total Cost $0.019 Response Time (avg) 24.0s

Combined Ranking

Filter models

Top Models by Combined Score

Combined Score vs Total Cost

Top Models by Response Time (avg)