AI BENCHY Failures
Did not follow instructions Failures
See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one.
| Rank | Model | Company | Did not follow instructions Count | Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #129 | GPT-4o-mini none | OpenAI | 1 | 4.9 | 5/19 | 1.90s |
| #133 | Mercury 2 none | Inception | 1 | 4.7 | 4/19 | 610ms |
| #136 | Nemotron 3 Nano Omni 30b A3b Reasoning none | NVIDIA | 1 | 4.6 | 8/19 | 726ms |
| #139 | MiMo-V2-Flash none | Xiaomi | 1 | 4.5 | 3/19 | 2.73s |
| #142 | Qwen3.5-9B medium | Qwen | 1 | 4.3 | 3/19 | 80.1s |