Tool Calling Model Ranking

See which AI models perform best on Tool Calling, which ones stay reliable, and where the biggest gaps appear. Sort by: Response Time (avg) ↑.

Models Shown

Average Tool Calling Score

8.8

Best Model

Kimi K3 3.0

Failure Reasons

With failure reason API error17 With failure reason Invalid tool call9 With failure reason Did not follow instructions8 With failure reason Wrong answer3 With failure reason No answer2

216/216

Rank	Model	Company	Tool Calling Score	Score	Total Cost	Tests Correct	Response Time (avg)
#41	Qwen3.6 Plus medium	Qwen	10.0	7.8	$0.405	1/1	5.87s
Total Tests 1 Wrong Tests 0 Total Cost $0.405 Response Time (avg) 5.87s
#181	Qwen3.6 Plus Preview medium	Qwen	10.0	4.9	$0.000	1/1	5.87s
Total Tests 1 Wrong Tests 0 Total Cost $0.000 Response Time (avg) 5.87s
#97	KAT-Coder-Pro V2.5 none	Kwaipilot	10.0	6.7	$0.476	1/1	5.93s
Total Tests 1 Wrong Tests 0 Total Cost $0.476 Response Time (avg) 5.93s
#27	Muse Spark 1.1 low	Meta	9.8	8.3	$0.647	1/1	5.98s
Total Tests 1 Wrong Tests 0 Total Cost $0.647 Response Time (avg) 5.98s
#54	GPT-5.6 Luna medium	OpenAI	10.0	7.6	$0.352	1/1	6.02s
Total Tests 1 Wrong Tests 0 Total Cost $0.352 Response Time (avg) 6.02s
#202	Hunter Alpha none	OpenRouter	10.0	4.2	$0.000	1/1	6.02s
Total Tests 1 Wrong Tests 0 Total Cost $0.000 Response Time (avg) 6.02s
#28	Gemini 2.5 Flash medium	Google	10.0	8.2	$0.643	1/1	6.20s
Total Tests 1 Wrong Tests 0 Total Cost $0.643 Response Time (avg) 6.20s
#7	GPT-5.6 Sol medium	OpenAI	10.0	9.4	$1.316	1/1	6.30s
Total Tests 1 Wrong Tests 0 Total Cost $1.316 Response Time (avg) 6.30s
#192	Laguna M.1 medium	Poolside	10.0	4.7	$0.033	1/1	6.31s
Total Tests 1 Wrong Tests 0 Total Cost $0.033 Response Time (avg) 6.31s
#16	GPT-5.3-Codex medium	OpenAI	10.0	8.9	$0.920	1/1	6.37s
Total Tests 1 Wrong Tests 0 Total Cost $0.920 Response Time (avg) 6.37s
#149	Gemini 3.1 Flash Lite high	Google	10.0	5.6	$2.044	1/1	6.44s
Total Tests 1 Wrong Tests 0 Total Cost $2.044 Response Time (avg) 6.44s
#32	Inkling high	Thinkingmachines	3.0	8.0	$1.006	0/1	6.52s
Total Tests 1 Wrong Tests 1 Total Cost $1.006 Response Time (avg) 6.52s
#25	Grok 4.5 medium	X AI	10.0	8.3	$1.928	1/1	6.57s
Total Tests 1 Wrong Tests 0 Total Cost $1.928 Response Time (avg) 6.57s
#11	Qwen3.7 Max medium	Qwen	10.0	9.2	$1.116	1/1	6.63s
Total Tests 1 Wrong Tests 0 Total Cost $1.116 Response Time (avg) 6.63s
#117	LongCat 2.0 none	Meituan	10.0	6.3	$0.044	1/1	6.64s
Total Tests 1 Wrong Tests 0 Total Cost $0.044 Response Time (avg) 6.64s

Tool Calling Ranking

Filter models

Top Models by Tool Calling Score

Tool Calling Score vs Total Cost

Top Models by Response Time (avg)