AI BENCHY
Advertise here

AI BENCHY Failures

Did not follow instructions Failures

See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one.

Models Shown

15

Total Failures

225

Most Affected Model

MiniMax M2.7 5
Rank Model Company Did not follow instructions Count Score Tests Correct Response Time (avg)
#60 GLM 5V Turbo medium Z.ai 1 7.4 11/20 20.3s
#63 Claude Opus 4.6 medium Anthropic 1 7.2 12/20 25.4s
#67 MiMo-V2-Flash medium Xiaomi 1 7.1 11/20 20.3s
#68 Seed-2.0-Mini medium Bytedance Seed 1 7.1 11/20 79.2s
#69 Claude Sonnet 4.6 none Anthropic 1 7.0 11/20 5.33s
#74 Laguna M.1 medium Poolside 1 6.9 12/19 14.4s
#76 Gemma 4 31B none Google 1 6.7 10/20 3.84s
#83 Qwen3.6 27B medium Qwen 1 6.6 9/20 57.7s
#85 Gemini 3.1 Flash Lite none Google 1 6.6 9/20 1.09s
#92 Gemini 2.5 Flash none Google 1 6.2 8/20 893ms
#93 MiMo-V2-Omni none Xiaomi 1 6.2 8/20 2.44s
#109 GLM 4.7 Flash none Z.ai 1 5.6 6/20 2.98s
#112 GPT-5.4 none OpenAI 1 5.6 7/20 1.46s
#113 GLM 5.1 none Z.ai 1 5.6 6/20 4.16s
#116 Qwen3.6 Flash none Qwen 1 5.5 7/20 1.64s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)