Fallos AI BENCHY
Fallos por No siguió las instrucciones
Mira qué modelos de IA se encuentran con No siguió las instrucciones con más frecuencia para detectar riesgos de fiabilidad antes de elegir. Ordenar por: Tiempo de respuesta (promedio) ↓.
Categorías relacionadas
| Rango | Modelo | Empresa | Cantidad de No siguió las instrucciones | Puntaje prom. | Pruebas correctas | Tiempo de respuesta (promedio) |
|---|---|---|---|---|---|---|
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 6.9 | 10/16 | 70.8s |
| #28 | Kimi K2.5 medium | Moonshot AI | 2 | 6.4 | 9/16 | 69.8s |
| #8 | Gemini 3.1 Flash Lite Preview high | 1 | 8.2 | 12/16 | 68.8s | |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.9 | 10/16 | 65.1s |
| #7 | Qwen3.5-27B medium | Qwen | 2 | 8.2 | 12/16 | 52.1s |
| #34 | GPT-5 Nano medium | OpenAI | 3 | 5.5 | 7/16 | 47.9s |
| #43 | MiniMax M2.5 medium | Minimax | 3 | 4.7 | 5/16 | 43.0s |
| #18 | DeepSeek V3.2 medium | DeepSeek | 1 | 7.3 | 11/16 | 39.5s |
| #52 | GLM 4.7 Flash medium | Z.ai | 2 | 3.1 | 4/16 | 36.8s |
| #13 | Step 3.5 Flash medium | Stepfun | 3 | 7.4 | 10/16 | 29.1s |
| #30 | Grok 4.1 Fast medium | X AI | 3 | 6.2 | 9/16 | 26.3s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 1 | 7.2 | 11/16 | 25.3s |
| #32 | GPT-5 Mini medium | OpenAI | 4 | 6.0 | 8/16 | 25.1s |
| #9 | GPT-5.4 medium | OpenAI | 2 | 8.0 | 12/16 | 20.1s |
| #39 | gpt-oss-120b medium | OpenAI | 4 | 5.1 | 7/16 | 16.7s |
| #3 | GPT-5.3-Codex medium | OpenAI | 2 | 8.4 | 12/16 | 16.6s |
| #14 | GLM 5 medium | Z.ai | 1 | 7.4 | 11/16 | 16.2s |
| #27 | GPT-5.2 medium | OpenAI | 3 | 6.5 | 10/16 | 15.3s |
| #50 | Qwen3 Coder Next medium | Qwen | 5 | 3.5 | 3/16 | 12.5s |
| #16 | Gemini 2.5 Flash medium | 1 | 7.4 | 11/16 | 12.4s | |
| #48 | Qwen3 Coder Next none | Qwen | 1 | 4.0 | 4/16 | 11.7s |
| #15 | GPT-5.2 Chat none | OpenAI | 1 | 7.4 | 11/16 | 7.03s |
| #19 | GPT-5.3 Chat none | OpenAI | 2 | 7.3 | 10/16 | 5.96s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 1 | 6.8 | 10/16 | 5.57s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 2 | 4.7 | 6/16 | 4.10s |
| #12 | Gemini 3.1 Flash Lite Preview medium | 1 | 7.5 | 11/16 | 3.83s | |
| #40 | Qwen3.5-122B-A10B none | Qwen | 1 | 5.0 | 6/16 | 3.72s |
| #37 | Qwen3.5-Flash none | Qwen | 1 | 5.2 | 7/16 | 3.54s |
| #17 | Gemini 3.1 Flash Lite Preview low | 1 | 7.3 | 11/16 | 3.36s | |
| #45 | Trinity Large Preview none | Arcee AI | 2 | 4.2 | 5/16 | 3.15s |
| #49 | GLM 4.7 Flash none | Z.ai | 2 | 3.9 | 4/16 | 2.99s |
| #54 | MiMo-V2-Flash none | Xiaomi | 1 | 2.9 | 3/16 | 2.97s |
| #36 | Mercury 2 medium | Inception | 4 | 5.3 | 7/16 | 2.36s |
| #47 | GPT-4o-mini none | OpenAI | 1 | 4.0 | 4/16 | 2.07s |
| #53 | Grok 4.1 Fast none | X AI | 2 | 2.9 | 3/16 | 1.90s |
| #41 | Qwen3.5-27B none | Qwen | 2 | 4.9 | 5/16 | 1.75s |
| #44 | GPT-5.4 none | OpenAI | 1 | 4.5 | 6/16 | 1.48s |
| #22 | Gemini 3.1 Flash Lite Preview none | 2 | 7.1 | 10/16 | 1.33s | |
| #38 | Gemini 2.5 Flash none | 1 | 5.2 | 6/16 | 923ms | |
| #55 | LFM2-24B-A2B none | Liquid | 2 | 2.6 | 1/16 | 811ms |
| #51 | Mercury 2 none | Inception | 1 | 3.4 | 4/16 | 596ms |