AI BENCHY Category Failures
Tool Calling: API error
Tool Calling
API error
See which AI models are most likely to hit API error on Tool Calling, so you can spot weak points faster. Sort by: Tests Correct ↓.
Failure Reasons
| Rank | Model | Company | API error Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #20 | Gemini 3.5 Flash none | 1 | 3.0 | 0/1 | 0ms | |
| #27 | Gemma 4 31B medium | 1 | 3.0 | 0/1 | 0ms | |
| #46 | Qwen3.6 35B A3B medium | Qwen | 1 | 3.0 | 0/1 | 0ms |
| #55 | GLM 5.1 medium | Z.ai | 1 | 3.0 | 0/1 | 0ms |
| #83 | Step 3.5 Flash none | Stepfun | 1 | 3.0 | 0/1 | 0ms |
| #84 | Grok 4.20 Multi Agent Beta medium | X AI | 1 | 3.0 | 0/1 | 0ms |
| #85 | Gemma 4 31B none | 1 | 3.0 | 0/1 | 0ms | |
| #89 | Hy3 preview low | Tencent | 1 | 2.8 | 0/1 | 17.8s |
| #96 | Ring-2.6-1T none | Inclusionai | 1 | 3.0 | 0/1 | 0ms |
| #100 | Grok Build 0.1 none | X AI | 1 | 3.0 | 0/1 | 0ms |
| #126 | gpt-oss-120b none | OpenAI | 1 | 3.0 | 0/1 | 0ms |
| #149 | Nemotron 3 Nano Omni 30b A3b Reasoning medium | NVIDIA | 1 | 3.0 | 0/1 | 0ms |
| #153 | Qwen3.6 35B A3B none | Qwen | 1 | 3.0 | 0/1 | 0ms |
| #160 | LFM2-24B-A2B none | Liquid | 1 | 3.0 | 0/1 | 0ms |
| #162 | Nemotron 3 Nano Omni 30b A3b Reasoning none | NVIDIA | 1 | 3.0 | 0/1 | 0ms |