AI BENCHY
Advertise here

AI BENCHY Failures

Invalid tool call Failures

See which AI models run into Invalid tool call most often, so you can spot reliability risks before choosing one. Sort by: Response Time (avg) ↑.

Models Shown

9

Total Failures

26

Most Affected Model

Granite 4.1 8B 1
Rank Model Company Invalid tool call Count Score Tests Correct Response Time (avg)
#138 Ling-2.6-flash none Inclusionai 2 5.0 6/21 9.34s
#133 DeepSeek V3.2 none DeepSeek 1 5.2 6/21 13.8s
#59 GLM 5V Turbo medium Z.ai 2 7.2 11/21 23.1s
#139 DeepSeek V4 Flash none DeepSeek 1 5.0 5/21 26.8s
#158 GLM 4.7 Flash medium Z.ai 1 4.4 4/21 35.1s
#130 MiniMax M2.7 medium Minimax 1 5.3 5/21 38.2s
#119 Cobuddy medium Baidu 1 5.6 7/21 39.9s
#78 Qwen3.6 27B medium Qwen 1 6.8 10/21 59.7s
#129 MiniMax M2.5 medium Minimax 1 5.3 5/21 65.4s

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)