AI BENCHY Category Failures
General Intelligence
Did not follow instructions
General Intelligence
Did not follow instructions
See which AI models are most likely to hit Did not follow instructions on General Intelligence, so you can spot weak points faster. Sort by: Response Time (avg) ↑.
Related Failure Reasons
Related Categories
| Rank | Model | Company | Did not follow instructions Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #55 | LFM2-24B-A2B none | Liquid | 1 | 3.0 | 0/1 | 395ms |
| #51 | Mercury 2 none | Inception | 1 | 4.0 | 0/1 | 628ms |
| #22 | Gemini 3.1 Flash Lite Preview none | 1 | 3.0 | 0/1 | 741ms | |
| #36 | Mercury 2 medium | Inception | 1 | 4.0 | 0/1 | 821ms |
| #53 | Grok 4.1 Fast none | X AI | 1 | 3.0 | 0/1 | 1.08s |
| #40 | Qwen3.5-122B-A10B none | Qwen | 1 | 5.0 | 0/1 | 1.12s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 1 | 6.0 | 0/1 | 1.19s |
| #50 | Qwen3 Coder Next medium | Qwen | 1 | 6.0 | 0/1 | 1.39s |
| #17 | Gemini 3.1 Flash Lite Preview low | 1 | 3.0 | 0/1 | 1.54s | |
| #54 | MiMo-V2-Flash none | Xiaomi | 1 | 4.0 | 0/1 | 1.67s |
| #19 | GPT-5.3 Chat none | OpenAI | 1 | 4.0 | 0/1 | 1.99s |
| #41 | Qwen3.5-27B none | Qwen | 1 | 5.0 | 0/1 | 2.51s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 1 | 5.0 | 0/1 | 2.56s |
| #45 | Trinity Large Preview none | Arcee AI | 1 | 3.0 | 0/1 | 2.86s |
| #15 | GPT-5.2 Chat none | OpenAI | 1 | 4.0 | 0/1 | 3.20s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 1 | 3.0 | 0/1 | 4.20s |
| #27 | GPT-5.2 medium | OpenAI | 1 | 10.0 | 0/1 | 4.32s |
| #16 | Gemini 2.5 Flash medium | 1 | 4.0 | 0/1 | 4.86s | |
| #3 | GPT-5.3-Codex medium | OpenAI | 1 | 4.0 | 0/1 | 4.87s |
| #9 | GPT-5.4 medium | OpenAI | 1 | 5.0 | 0/1 | 4.92s |
| #13 | Step 3.5 Flash medium | Stepfun | 1 | 6.0 | 0/1 | 6.54s |
| #43 | MiniMax M2.5 medium | Minimax | 1 | 3.0 | 0/1 | 6.63s |
| #39 | gpt-oss-120b medium | OpenAI | 1 | 3.0 | 0/1 | 7.90s |
| #32 | GPT-5 Mini medium | OpenAI | 1 | 4.0 | 0/1 | 13.5s |
| #14 | GLM 5 medium | Z.ai | 1 | 5.0 | 0/1 | 14.7s |
| #30 | Grok 4.1 Fast medium | X AI | 1 | 3.0 | 0/1 | 16.2s |
| #34 | GPT-5 Nano medium | OpenAI | 1 | 3.0 | 0/1 | 17.5s |
| #18 | DeepSeek V3.2 medium | DeepSeek | 1 | 3.0 | 0/1 | 31.3s |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.0 | 0/1 | 36.7s |
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 5.0 | 0/1 | 40.1s |
| #28 | Kimi K2.5 medium | Moonshot AI | 1 | 6.0 | 0/1 | 69.7s |
| #7 | Qwen3.5-27B medium | Qwen | 1 | 5.0 | 0/1 | 101.4s |