Échecs par catégorie AI BENCHY
Intelligence générale
N'a pas suivi les instructions
Intelligence générale
N'a pas suivi les instructions
Voyez quels modèles d'IA ont le plus de chances de rencontrer N'a pas suivi les instructions sur Intelligence générale, pour repérer plus vite les points faibles. Trier par: Tests corrects ↑.
Raisons d'échec liées
| Rang | Modèle | Entreprise | Nombre de N'a pas suivi les instructions | Score de catégorie | Tests corrects | Temps de réponse (moy.) |
|---|---|---|---|---|---|---|
| #3 | GPT-5.3-Codex medium | OpenAI | 1 | 4.0 | 0/1 | 4.87s |
| #7 | Qwen3.5-27B medium | Qwen | 1 | 5.0 | 0/1 | 101.4s |
| #9 | GPT-5.4 medium | OpenAI | 1 | 5.0 | 0/1 | 4.92s |
| #13 | Step 3.5 Flash medium | Stepfun | 1 | 6.0 | 0/1 | 6.54s |
| #14 | GLM 5 medium | Z.ai | 1 | 5.0 | 0/1 | 14.7s |
| #15 | GPT-5.2 Chat none | OpenAI | 1 | 4.0 | 0/1 | 3.20s |
| #16 | Gemini 2.5 Flash medium | 1 | 4.0 | 0/1 | 4.86s | |
| #17 | Gemini 3.1 Flash Lite Preview low | 1 | 3.0 | 0/1 | 1.54s | |
| #18 | DeepSeek V3.2 medium | DeepSeek | 1 | 3.0 | 0/1 | 31.3s |
| #19 | GPT-5.3 Chat none | OpenAI | 1 | 4.0 | 0/1 | 1.99s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 1 | 3.0 | 0/1 | 4.20s |
| #22 | Gemini 3.1 Flash Lite Preview none | 1 | 3.0 | 0/1 | 741ms | |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.0 | 0/1 | 36.7s |
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 5.0 | 0/1 | 40.1s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 1 | 5.0 | 0/1 | 2.56s |
| #27 | GPT-5.2 medium | OpenAI | 1 | 10.0 | 0/1 | 4.32s |
| #28 | Kimi K2.5 medium | Moonshot AI | 1 | 6.0 | 0/1 | 69.7s |
| #30 | Grok 4.1 Fast medium | X AI | 1 | 3.0 | 0/1 | 16.2s |
| #32 | GPT-5 Mini medium | OpenAI | 1 | 4.0 | 0/1 | 13.5s |
| #34 | GPT-5 Nano medium | OpenAI | 1 | 3.0 | 0/1 | 17.5s |
| #36 | Mercury 2 medium | Inception | 1 | 4.0 | 0/1 | 821ms |
| #39 | gpt-oss-120b medium | OpenAI | 1 | 3.0 | 0/1 | 7.90s |
| #40 | Qwen3.5-122B-A10B none | Qwen | 1 | 5.0 | 0/1 | 1.12s |
| #41 | Qwen3.5-27B none | Qwen | 1 | 5.0 | 0/1 | 2.51s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 1 | 6.0 | 0/1 | 1.19s |
| #43 | MiniMax M2.5 medium | Minimax | 1 | 3.0 | 0/1 | 6.63s |
| #45 | Trinity Large Preview none | Arcee AI | 1 | 3.0 | 0/1 | 2.86s |
| #50 | Qwen3 Coder Next medium | Qwen | 1 | 6.0 | 0/1 | 1.39s |
| #51 | Mercury 2 none | Inception | 1 | 4.0 | 0/1 | 628ms |
| #53 | Grok 4.1 Fast none | X AI | 1 | 3.0 | 0/1 | 1.08s |
| #54 | MiMo-V2-Flash none | Xiaomi | 1 | 4.0 | 0/1 | 1.67s |
| #55 | LFM2-24B-A2B none | Liquid | 1 | 3.0 | 0/1 | 395ms |