Falhas AI BENCHY
Falhas por Não seguiu as instruções
Veja quais modelos de IA encontram Não seguiu as instruções com mais frequência para identificar riscos de confiabilidade antes de escolher. Ordenar por: Tempo de resposta (médio) ↓.
Categorias relacionadas
| Posição | Modelo | Empresa | Contagem de Não seguiu as instruções | Pontuação média | Testes corretos | Tempo de resposta (médio) |
|---|---|---|---|---|---|---|
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 6.9 | 10/16 | 70.8s |
| #28 | Kimi K2.5 medium | Moonshot AI | 2 | 6.4 | 9/16 | 69.8s |
| #8 | Gemini 3.1 Flash Lite Preview high | 1 | 8.2 | 12/16 | 68.8s | |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.9 | 10/16 | 65.1s |
| #7 | Qwen3.5-27B medium | Qwen | 2 | 8.2 | 12/16 | 52.1s |
| #34 | GPT-5 Nano medium | OpenAI | 3 | 5.5 | 7/16 | 47.9s |
| #43 | MiniMax M2.5 medium | Minimax | 3 | 4.7 | 5/16 | 43.0s |
| #18 | DeepSeek V3.2 medium | DeepSeek | 1 | 7.3 | 11/16 | 39.5s |
| #52 | GLM 4.7 Flash medium | Z.ai | 2 | 3.1 | 4/16 | 36.8s |
| #13 | Step 3.5 Flash medium | Stepfun | 3 | 7.4 | 10/16 | 29.1s |
| #30 | Grok 4.1 Fast medium | X AI | 3 | 6.2 | 9/16 | 26.3s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 1 | 7.2 | 11/16 | 25.3s |
| #32 | GPT-5 Mini medium | OpenAI | 4 | 6.0 | 8/16 | 25.1s |
| #9 | GPT-5.4 medium | OpenAI | 2 | 8.0 | 12/16 | 20.1s |
| #39 | gpt-oss-120b medium | OpenAI | 4 | 5.1 | 7/16 | 16.7s |
| #3 | GPT-5.3-Codex medium | OpenAI | 2 | 8.4 | 12/16 | 16.6s |
| #14 | GLM 5 medium | Z.ai | 1 | 7.4 | 11/16 | 16.2s |
| #27 | GPT-5.2 medium | OpenAI | 3 | 6.5 | 10/16 | 15.3s |
| #50 | Qwen3 Coder Next medium | Qwen | 5 | 3.5 | 3/16 | 12.5s |
| #16 | Gemini 2.5 Flash medium | 1 | 7.4 | 11/16 | 12.4s | |
| #48 | Qwen3 Coder Next none | Qwen | 1 | 4.0 | 4/16 | 11.7s |
| #15 | GPT-5.2 Chat none | OpenAI | 1 | 7.4 | 11/16 | 7.03s |
| #19 | GPT-5.3 Chat none | OpenAI | 2 | 7.3 | 10/16 | 5.96s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 1 | 6.8 | 10/16 | 5.57s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 2 | 4.7 | 6/16 | 4.10s |
| #12 | Gemini 3.1 Flash Lite Preview medium | 1 | 7.5 | 11/16 | 3.83s | |
| #40 | Qwen3.5-122B-A10B none | Qwen | 1 | 5.0 | 6/16 | 3.72s |
| #37 | Qwen3.5-Flash none | Qwen | 1 | 5.2 | 7/16 | 3.54s |
| #17 | Gemini 3.1 Flash Lite Preview low | 1 | 7.3 | 11/16 | 3.36s | |
| #45 | Trinity Large Preview none | Arcee AI | 2 | 4.2 | 5/16 | 3.15s |
| #49 | GLM 4.7 Flash none | Z.ai | 2 | 3.9 | 4/16 | 2.99s |
| #54 | MiMo-V2-Flash none | Xiaomi | 1 | 2.9 | 3/16 | 2.97s |
| #36 | Mercury 2 medium | Inception | 4 | 5.3 | 7/16 | 2.36s |
| #47 | GPT-4o-mini none | OpenAI | 1 | 4.0 | 4/16 | 2.07s |
| #53 | Grok 4.1 Fast none | X AI | 2 | 2.9 | 3/16 | 1.90s |
| #41 | Qwen3.5-27B none | Qwen | 2 | 4.9 | 5/16 | 1.75s |
| #44 | GPT-5.4 none | OpenAI | 1 | 4.5 | 6/16 | 1.48s |
| #22 | Gemini 3.1 Flash Lite Preview none | 2 | 7.1 | 10/16 | 1.33s | |
| #38 | Gemini 2.5 Flash none | 1 | 5.2 | 6/16 | 923ms | |
| #55 | LFM2-24B-A2B none | Liquid | 2 | 2.6 | 1/16 | 811ms |
| #51 | Mercury 2 none | Inception | 1 | 3.4 | 4/16 | 596ms |