AI BENCHY
Your ad here

AI BENCHY Failures

Did not follow instructions Failures

See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one. Sort by: Response Time (avg) ↑.

Models Shown

15

Total Failures

180

Most Affected Model

Mercury 2 1
Rank Model Company Did not follow instructions Count Score Tests Correct Response Time (avg)
#38 GPT-5.4 Nano medium OpenAI 3 7.6 11/18 11.2s
#84 gpt-oss-120b none OpenAI 5 5.2 4/18 12.0s
#15 Gemini 2.5 Flash medium Google 1 8.2 13/18 12.1s
#23 MiMo-V2-Pro medium Xiaomi 1 8.1 12/18 12.3s
#9 Qwen3.6 Plus Preview medium Qwen 1 8.5 13/17 13.9s
#40 GPT-5.2 medium OpenAI 3 7.5 11/18 14.0s
#31 GLM 5V Turbo medium Z.ai 2 7.8 11/18 15.0s
#44 GPT-5.4 Mini medium OpenAI 5 7.3 9/18 15.2s
#20 Qwen3.6 Plus medium Qwen 1 8.1 13/18 15.3s
#7 GPT-5.3-Codex medium OpenAI 2 8.6 13/18 15.4s
#68 gpt-oss-120b medium OpenAI 4 5.8 7/18 16.1s
#35 MiMo-V2-Omni medium Xiaomi 2 7.7 11/18 16.8s
#18 GLM 5 Turbo medium Z.ai 2 8.1 12/18 17.7s
#16 GPT-5.4 medium OpenAI 2 8.2 13/18 18.6s
#51 Nemotron 3 Super medium NVIDIA 4 6.7 9/18 19.1s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)