Fallos AI BENCHY
Fallos por Llamada de herramienta no válida
Mira qué modelos de IA se encuentran con Llamada de herramienta no válida con más frecuencia para detectar riesgos de fiabilidad antes de elegir. Ordenar por: Pruebas correctas ↓.
| Rango | Modelo | Empresa | Cantidad de Llamada de herramienta no válida | Puntuación | Pruebas correctas | Tiempo de respuesta (promedio) |
|---|---|---|---|---|---|---|
| #32 | Gemini 3.5 Flash minimal | 1 | 7.7 | 14/21 | 1.57s | |
| #59 | GLM 5V Turbo medium | Z.ai | 2 | 7.2 | 11/21 | 23.1s |
| #78 | Qwen3.6 27B medium | Qwen | 1 | 6.8 | 10/21 | 59.7s |
| #106 | Grok 4.20 Beta none | X AI | 1 | 5.8 | 6/18 | 1.19s |
| #112 | GLM 5.1 none | Z.ai | 1 | 5.7 | 7/21 | 4.10s |
| #118 | Qwen3.6 27B none | Qwen | 1 | 5.6 | 7/21 | 3.72s |
| #119 | Cobuddy medium | Baidu | 1 | 5.6 | 7/21 | 39.9s |
| #127 | Grok 4.20 none | X AI | 1 | 5.4 | 6/18 | 1.11s |
| #128 | Qwen3.6 Flash none | Qwen | 1 | 5.4 | 7/21 | 1.60s |
| #107 | Laguna Xs.2 medium | Poolside | 1 | 5.8 | 6/19 | 6.73s |
| #122 | GLM 4.7 Flash none | Z.ai | 1 | 5.5 | 6/21 | 2.86s |
| #133 | DeepSeek V3.2 none | DeepSeek | 1 | 5.2 | 6/21 | 13.8s |
| #136 | Elephant Alpha medium | Openrouter | 1 | 5.1 | 6/21 | 1.27s |
| #138 | Ling-2.6-flash none | Inclusionai | 2 | 5.0 | 6/21 | 9.34s |
| #146 | Laguna Xs.2 none | Poolside | 1 | 4.8 | 5/19 | 806ms |