AI BENCHY Failures
Did not follow instructions Failures
See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one.
| Rank | Model | Company | Did not follow instructions Count | Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #60 | GLM 5V Turbo medium | Z.ai | 1 | 7.4 | 11/20 | 20.3s |
| #63 | Claude Opus 4.6 medium | Anthropic | 1 | 7.2 | 12/20 | 25.4s |
| #67 | MiMo-V2-Flash medium | Xiaomi | 1 | 7.1 | 11/20 | 20.3s |
| #68 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 7.1 | 11/20 | 79.2s |
| #69 | Claude Sonnet 4.6 none | Anthropic | 1 | 7.0 | 11/20 | 5.33s |
| #74 | Laguna M.1 medium | Poolside | 1 | 6.9 | 12/19 | 14.4s |
| #76 | Gemma 4 31B none | 1 | 6.7 | 10/20 | 3.84s | |
| #83 | Qwen3.6 27B medium | Qwen | 1 | 6.6 | 9/20 | 57.7s |
| #85 | Gemini 3.1 Flash Lite none | 1 | 6.6 | 9/20 | 1.09s | |
| #92 | Gemini 2.5 Flash none | 1 | 6.2 | 8/20 | 893ms | |
| #93 | MiMo-V2-Omni none | Xiaomi | 1 | 6.2 | 8/20 | 2.44s |
| #109 | GLM 4.7 Flash none | Z.ai | 1 | 5.6 | 6/20 | 2.98s |
| #112 | GPT-5.4 none | OpenAI | 1 | 5.6 | 7/20 | 1.46s |
| #113 | GLM 5.1 none | Z.ai | 1 | 5.6 | 6/20 | 4.16s |
| #116 | Qwen3.6 Flash none | Qwen | 1 | 5.5 | 7/20 | 1.64s |