Tool Calling Model Ranking

AI BENCHY Category

See which AI models perform best on Tool Calling, which ones stay reliable, and where the biggest gaps appear. Sort by: Metric ↑.

Models Shown

Average Tool Calling Score

8.7

Best Model

Failure Reasons

With failure reason API error15 With failure reason Invalid tool call7 With failure reason Did not follow instructions6 With failure reason No answer2 With failure reason Wrong answer2

Rank	Model	Company	Tool Calling Score	Score	Tests Correct	Response Time (avg)
#86	Grok 4.1 Fast medium	X AI	2.8	6.5	0/1	27.7s
#89	Hy3 preview low	Tencent	2.8	6.4	0/1	17.8s
#122	GLM 4.7 Flash none	Z.ai	2.8	5.5	0/1	7.05s
#157	Grok 4.1 Fast none	X AI	2.8	4.4	0/1	5.51s
#13	Grok 4.20 Beta medium	X AI	3.0	8.5	0/1	12.4s
#20	Gemini 3.5 Flash none	Google	3.0	8.1	0/1	0ms
#27	Gemma 4 31B medium	Google	3.0	7.8	0/1	0ms
#46	Qwen3.6 35B A3B medium	Qwen	3.0	7.4	0/1	0ms
#55	GLM 5.1 medium	Z.ai	3.0	7.3	0/1	0ms
#65	Grok 4.20 medium	X AI	3.0	7.1	0/1	13.7s
#83	Step 3.5 Flash none	Stepfun	3.0	6.6	0/1	0ms
#84	Grok 4.20 Multi Agent Beta medium	X AI	3.0	6.6	0/1	0ms
#85	Gemma 4 31B none	Google	3.0	6.5	0/1	0ms
#96	Ring-2.6-1T none	Inclusionai	3.0	6.2	0/1	0ms
#100	Grok Build 0.1 none	X AI	3.0	6.0	0/1	0ms

Tool Calling Ranking