Tool Calling Model Ranking

See which AI models perform best on Tool Calling, which ones stay reliable, and where the biggest gaps appear. Sort by: Response Time (avg) ↓.

Models Shown

Average Tool Calling Score

8.8

Best Model

Ring-2.6-1T 10.0

Failure Reasons

With failure reason API error17 With failure reason Invalid tool call9 With failure reason Did not follow instructions8 With failure reason Wrong answer3 With failure reason No answer2

216/216

Rank	Model	Company	Tool Calling Score	Score	Total Cost	Tests Correct	Response Time (avg)
#176	GLM 5 Turbo none	Z.ai	10.0	5.1	$0.047	1/1	8.21s
Total Tests 1 Wrong Tests 0 Total Cost $0.047 Response Time (avg) 8.21s
#115	Mimo V2 PRO medium	Xiaomi	10.0	6.3	$0.333	1/1	8.19s
Total Tests 1 Wrong Tests 0 Total Cost $0.333 Response Time (avg) 8.19s
#169	Gemini 3.1 Flash Lite Preview high	Google	10.0	5.3	$2.310	1/1	7.73s
Total Tests 1 Wrong Tests 0 Total Cost $2.310 Response Time (avg) 7.73s
#55	Nemotron 3 Ultra medium	NVIDIA	10.0	7.5	$0.774	1/1	7.72s
Total Tests 1 Wrong Tests 0 Total Cost $0.774 Response Time (avg) 7.72s
#57	GPT-5.4 Nano medium	OpenAI	10.0	7.5	$0.138	1/1	7.71s
Total Tests 1 Wrong Tests 0 Total Cost $0.138 Response Time (avg) 7.71s
#5	GPT-5.6 Sol low	OpenAI	10.0	9.5	$0.971	1/1	7.56s
Total Tests 1 Wrong Tests 0 Total Cost $0.971 Response Time (avg) 7.56s
#61	Qwen3.5 Plus 2026-02-15 medium	Qwen	10.0	7.5	$0.437	1/1	7.54s
Total Tests 1 Wrong Tests 0 Total Cost $0.437 Response Time (avg) 7.54s
#198	Laguna M.1 none	Poolside	10.0	4.4	$0.009	1/1	7.54s
Total Tests 1 Wrong Tests 0 Total Cost $0.009 Response Time (avg) 7.54s
#44	Claude Sonnet 4.6 medium	Anthropic	10.0	7.8	$2.057	1/1	7.48s
Total Tests 1 Wrong Tests 0 Total Cost $2.057 Response Time (avg) 7.48s
#62	Qwen3.5-27B medium	Qwen	10.0	7.4	$1.627	1/1	7.45s
Total Tests 1 Wrong Tests 0 Total Cost $1.627 Response Time (avg) 7.45s
#86	DeepSeek V4 Pro none	DeepSeek	10.0	6.9	$0.096	1/1	7.40s
Total Tests 1 Wrong Tests 0 Total Cost $0.096 Response Time (avg) 7.40s
#107	MiMo-V2.5 medium	Xiaomi	10.0	6.5	$0.082	1/1	7.29s
Total Tests 1 Wrong Tests 0 Total Cost $0.082 Response Time (avg) 7.29s
#8	GPT-5.6 Sol high	OpenAI	10.0	9.4	$1.234	1/1	7.08s
Total Tests 1 Wrong Tests 0 Total Cost $1.234 Response Time (avg) 7.08s
#182	GLM 4.7 Flash none	Z.ai	2.8	4.9	$0.016	0/1	7.05s
Total Tests 1 Wrong Tests 1 Total Cost $0.016 Response Time (avg) 7.05s
#19	Muse Spark 1.1 medium	Meta	9.8	8.6	$1.357	1/1	6.99s
Total Tests 1 Wrong Tests 0 Total Cost $1.357 Response Time (avg) 6.99s

Tool Calling Ranking

Filter models

Top Models by Tool Calling Score

Tool Calling Score vs Total Cost

Top Models by Response Time (avg)