Tool Calling Model Ranking

See which AI models perform best on Tool Calling, which ones stay reliable, and where the biggest gaps appear.

Models Shown

Average Tool Calling Score

8.8

Best Model

Failure Reasons

With failure reason API error17 With failure reason Invalid tool call9 With failure reason Did not follow instructions8 With failure reason Wrong answer3 With failure reason No answer2

216/216

Rank	Model	Company	Tool Calling Score	Score	Total Cost	Tests Correct	Response Time (avg)
#188	KAT-Coder-Air V2.5 none	Kwaipilot	10.0	4.8	$0.067	1/1	5.13s
Total Tests 1 Wrong Tests 0 Total Cost $0.067 Response Time (avg) 5.13s
#189	Trinity Large Preview none	Arcee AI	10.0	4.8	$0.008	1/1	6.67s
Total Tests 1 Wrong Tests 0 Total Cost $0.008 Response Time (avg) 6.67s
#190	Hunter Alpha medium	OpenRouter	10.0	4.7	$0.000	1/1	17.3s
Total Tests 1 Wrong Tests 0 Total Cost $0.000 Response Time (avg) 17.3s
#192	Laguna M.1 medium	Poolside	10.0	4.7	$0.033	1/1	6.31s
Total Tests 1 Wrong Tests 0 Total Cost $0.033 Response Time (avg) 6.31s
#193	Qwen3 Coder Next medium	Qwen	10.0	4.7	$0.032	1/1	2.64s
Total Tests 1 Wrong Tests 0 Total Cost $0.032 Response Time (avg) 2.64s
#194	Cobuddy medium	Baidu	10.0	4.7	$0.000	1/1	11.2s
Total Tests 1 Wrong Tests 0 Total Cost $0.000 Response Time (avg) 11.2s
#195	Mercury 2 none	Inception	10.0	4.6	$0.030	1/1	1.27s
Total Tests 1 Wrong Tests 0 Total Cost $0.030 Response Time (avg) 1.27s
#196	MiniMax M2.5 medium	Minimax	10.0	4.6	$0.340	1/1	15.4s
Total Tests 1 Wrong Tests 0 Total Cost $0.340 Response Time (avg) 15.4s
#197	Grok 4.20 Beta none	X AI	10.0	4.4	$0.087	1/1	4.79s
Total Tests 1 Wrong Tests 0 Total Cost $0.087 Response Time (avg) 4.79s
#198	Laguna M.1 none	Poolside	10.0	4.4	$0.009	1/1	7.54s
Total Tests 1 Wrong Tests 0 Total Cost $0.009 Response Time (avg) 7.54s
#200	GLM 4.7 Flash medium	Z.ai	10.0	4.3	$0.166	1/1	15.9s
Total Tests 1 Wrong Tests 0 Total Cost $0.166 Response Time (avg) 15.9s
#202	Hunter Alpha none	OpenRouter	10.0	4.2	$0.000	1/1	6.02s
Total Tests 1 Wrong Tests 0 Total Cost $0.000 Response Time (avg) 6.02s
#203	Grok 4.20 none	X AI	10.0	4.1	$0.057	1/1	4.63s
Total Tests 1 Wrong Tests 0 Total Cost $0.057 Response Time (avg) 4.63s
#205	Hy3 preview none	Tencent	10.0	4.0	$0.003	1/1	33.8s
Total Tests 1 Wrong Tests 0 Total Cost $0.003 Response Time (avg) 33.8s
#206	MiMo-V2-Flash none	Xiaomi	10.0	4.0	$0.025	1/1	2.28s
Total Tests 1 Wrong Tests 0 Total Cost $0.025 Response Time (avg) 2.28s

Tool Calling Ranking

Filter models

Top Models by Tool Calling Score

Tool Calling Score vs Total Cost

Top Models by Response Time (avg)