Kegagalan kategori AI BENCHY
Kecerdasan umum
Tidak mengikuti instruksi
Kecerdasan umum
Tidak mengikuti instruksi
Lihat model AI mana yang paling mungkin mengalami Tidak mengikuti instruksi di Kecerdasan umum, agar Anda bisa menemukan titik lemahnya lebih cepat. Urutkan berdasarkan: Waktu respons (rata-rata) ↑.
Alasan kegagalan terkait
Kategori terkait
| Peringkat | Model | Perusahaan | Jumlah Tidak mengikuti instruksi | Skor kategori | Tes benar | Waktu respons (rata-rata) |
|---|---|---|---|---|---|---|
| #55 | LFM2-24B-A2B none | Liquid | 1 | 3.0 | 0/1 | 395ms |
| #51 | Mercury 2 none | Inception | 1 | 4.0 | 0/1 | 628ms |
| #22 | Gemini 3.1 Flash Lite Preview none | 1 | 3.0 | 0/1 | 741ms | |
| #36 | Mercury 2 medium | Inception | 1 | 4.0 | 0/1 | 821ms |
| #53 | Grok 4.1 Fast none | X AI | 1 | 3.0 | 0/1 | 1.08s |
| #40 | Qwen3.5-122B-A10B none | Qwen | 1 | 5.0 | 0/1 | 1.12s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 1 | 6.0 | 0/1 | 1.19s |
| #50 | Qwen3 Coder Next medium | Qwen | 1 | 6.0 | 0/1 | 1.39s |
| #17 | Gemini 3.1 Flash Lite Preview low | 1 | 3.0 | 0/1 | 1.54s | |
| #54 | MiMo-V2-Flash none | Xiaomi | 1 | 4.0 | 0/1 | 1.67s |
| #19 | GPT-5.3 Chat none | OpenAI | 1 | 4.0 | 0/1 | 1.99s |
| #41 | Qwen3.5-27B none | Qwen | 1 | 5.0 | 0/1 | 2.51s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 1 | 5.0 | 0/1 | 2.56s |
| #45 | Trinity Large Preview none | Arcee AI | 1 | 3.0 | 0/1 | 2.86s |
| #15 | GPT-5.2 Chat none | OpenAI | 1 | 4.0 | 0/1 | 3.20s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 1 | 3.0 | 0/1 | 4.20s |
| #27 | GPT-5.2 medium | OpenAI | 1 | 10.0 | 0/1 | 4.32s |
| #16 | Gemini 2.5 Flash medium | 1 | 4.0 | 0/1 | 4.86s | |
| #3 | GPT-5.3-Codex medium | OpenAI | 1 | 4.0 | 0/1 | 4.87s |
| #9 | GPT-5.4 medium | OpenAI | 1 | 5.0 | 0/1 | 4.92s |
| #13 | Step 3.5 Flash medium | Stepfun | 1 | 6.0 | 0/1 | 6.54s |
| #43 | MiniMax M2.5 medium | Minimax | 1 | 3.0 | 0/1 | 6.63s |
| #39 | gpt-oss-120b medium | OpenAI | 1 | 3.0 | 0/1 | 7.90s |
| #32 | GPT-5 Mini medium | OpenAI | 1 | 4.0 | 0/1 | 13.5s |
| #14 | GLM 5 medium | Z.ai | 1 | 5.0 | 0/1 | 14.7s |
| #30 | Grok 4.1 Fast medium | X AI | 1 | 3.0 | 0/1 | 16.2s |
| #34 | GPT-5 Nano medium | OpenAI | 1 | 3.0 | 0/1 | 17.5s |
| #18 | DeepSeek V3.2 medium | DeepSeek | 1 | 3.0 | 0/1 | 31.3s |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.0 | 0/1 | 36.7s |
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 5.0 | 0/1 | 40.1s |
| #28 | Kimi K2.5 medium | Moonshot AI | 1 | 6.0 | 0/1 | 69.7s |
| #7 | Qwen3.5-27B medium | Qwen | 1 | 5.0 | 0/1 | 101.4s |