AI BENCHY
Your ad here

AI BENCHY Category

Tool Calling Ranking

See which AI models perform best on Tool Calling, which ones stay reliable, and where the biggest gaps appear. Sort by: Score ↑.

Models Shown

15

Average Tool Calling Score

8.7

Rank Model Company Tool Calling Score Score Tests Correct Response Time (avg)
#98 LFM2-24B-A2B none Liquid 3.0 4.1 0/1 0ms
#97 Qwen3.5-9B medium Qwen 10.0 4.4 1/1 4.31s
#96 GPT-5.4 Nano none OpenAI 10.0 4.5 1/1 3.40s
#95 Grok 4.1 Fast none X AI 2.8 4.5 0/1 5.51s
#94 MiMo-V2-Flash none Xiaomi 10.0 4.5 1/1 2.28s
#93 GLM 4.7 Flash medium Z.ai 10.0 4.6 1/1 15.9s
#92 Qwen3 Coder Next medium Qwen 10.0 4.7 1/1 2.64s
#91 Mercury 2 none Inception 10.0 4.8 1/1 1.27s
#90 Qwen3.5-9B none Qwen 10.0 4.8 1/1 1.27s
#89 GPT-4o-mini none OpenAI 10.0 4.9 1/1 2.51s
#88 Nemotron 3 Super none NVIDIA 4.7 5.1 0/1 16.0s
#87 Qwen3 Coder Next none Qwen 10.0 5.1 1/1 2.47s
#86 GPT-5.4 Mini none OpenAI 3.0 5.1 0/1 2.32s
#85 Elephant none Openrouter 3.0 5.2 0/1 2.79s
#84 gpt-oss-120b none OpenAI 3.0 5.2 0/1 0ms

Top Models by Tool Calling Score

Tool Calling Score vs Total Cost

Top Models by Response Time (avg)