AI BENCHY
Advertise here

AI BENCHY Failures

Did not follow instructions Failures

See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one. Sort by: Failure Count ↑.

Models Shown

15

Total Failures

215

Most Affected Model

Gemini 3.5 Flash 1
Rank Model Company Did not follow instructions Count Score Tests Correct Response Time (avg)
#114 Qwen3.5 Plus 2026-04-20 none Qwen 2 5.7 7/21 4.39s
#115 Qwen3.5-27B none Qwen 2 5.7 7/21 1.68s
#116 Hunter Alpha none OpenRouter 2 5.7 6/18 4.70s
#117 Qwen3.5-35B-A3B none Qwen 2 5.6 7/21 3.37s
#118 Qwen3.6 27B none Qwen 2 5.6 7/21 3.72s
#120 Mimo V2 PRO none Xiaomi 2 5.6 7/21 2.27s
#126 gpt-oss-120b none OpenAI 2 5.4 6/19 21.6s
#131 Qwen3.5-122B-A10B none Qwen 2 5.3 6/21 3.41s
#132 Mistral Small 4 medium Mistral 2 5.3 5/21 9.40s
#134 GLM 5 Turbo none Z.ai 2 5.2 6/21 2.82s
#136 Elephant Alpha medium Openrouter 2 5.1 6/21 1.27s
#138 Ling-2.6-flash none Inclusionai 2 5.0 6/21 9.34s
#141 Nemotron 3 Super none NVIDIA 2 4.9 5/21 5.30s
#148 GPT-5.4 Nano none OpenAI 2 4.7 4/21 1.48s
#152 MiMo-V2-Flash none Xiaomi 2 4.6 4/21 2.76s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)