AI BENCHY
Advertise here

AI BENCHY Failures

Invalid tool call Failures

See which AI models run into Invalid tool call most often, so you can spot reliability risks before choosing one. Sort by: Tests Correct ↓.

Models Shown

9

Total Failures

26

Most Affected Model

Gemini 3.5 Flash 1
Rank Model Company Invalid tool call Count Score Tests Correct Response Time (avg)
#129 MiniMax M2.5 medium Minimax 1 5.3 5/21 65.4s
#130 MiniMax M2.7 medium Minimax 1 5.3 5/21 38.2s
#137 Elephant Alpha none Openrouter 1 5.1 5/21 1.22s
#139 DeepSeek V4 Flash none DeepSeek 1 5.0 5/21 26.8s
#145 Laguna M.1 none Poolside 1 4.8 4/19 2.89s
#154 Qwen3.5-9B none Qwen 1 4.6 4/21 1.89s
#158 GLM 4.7 Flash medium Z.ai 1 4.4 4/21 35.1s
#159 Ling-2.6-1T none Inclusionai 1 4.3 3/21 7.72s
#163 Granite 4.1 8B none IBM Granite 1 4.0 2/21 728ms

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)