AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Combined: Invalid tool call

Combined
Invalid tool call

See which AI models are most likely to hit Invalid tool call on Combined, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

4

Total Failures

19

Most Affected Model

Granite 4.1 8B 1
Rank Model Company Invalid tool call Count Category Score Tests Correct Response Time (avg)
#158 GLM 4.7 Flash medium Z.ai 1 2.8 0/1 65.6s
#78 Qwen3.6 27B medium Qwen 1 7.0 0/1 83.1s
#139 DeepSeek V4 Flash none DeepSeek 1 4.5 0/1 112.0s
#133 DeepSeek V3.2 none DeepSeek 1 6.5 0/1 115.9s

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost