AI BENCHY
Advertise here

AI BENCHY Failures

Did not follow instructions Failures

See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one. Sort by: Score ↓.

Models Shown

15

Total Failures

215

Most Affected Model

Gemini 3.5 Flash 1
Rank Model Company Did not follow instructions Count Score Tests Correct Response Time (avg)
#118 Qwen3.6 27B none Qwen 2 5.6 7/21 3.72s
#119 Cobuddy medium Baidu 3 5.6 7/21 39.9s
#120 Mimo V2 PRO none Xiaomi 2 5.6 7/21 2.27s
#121 Owl Alpha none Openrouter 3 5.5 7/21 9.88s
#122 GLM 4.7 Flash none Z.ai 1 5.5 6/21 2.86s
#123 MiMo-V2.5-Pro none Xiaomi 4 5.5 6/21 1.78s
#124 Kimi K2.6 none Moonshot AI 3 5.5 7/21 13.3s
#125 GPT-5.4 none OpenAI 1 5.5 7/21 1.42s
#126 gpt-oss-120b none OpenAI 2 5.4 6/19 21.6s
#128 Qwen3.6 Flash none Qwen 1 5.4 7/21 1.60s
#129 MiniMax M2.5 medium Minimax 3 5.3 5/21 65.4s
#130 MiniMax M2.7 medium Minimax 5 5.3 5/21 38.2s
#131 Qwen3.5-122B-A10B none Qwen 2 5.3 6/21 3.41s
#132 Mistral Small 4 medium Mistral 2 5.3 5/21 9.40s
#133 DeepSeek V3.2 none DeepSeek 1 5.2 6/21 13.8s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)