AI BENCHY
Advertise here

AI BENCHY Failures

Did not follow instructions Failures

See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one. Sort by: Response Time (avg) ↓.

Models Shown

15

Total Failures

215

Most Affected Model

Kimi K2.5 2
Rank Model Company Did not follow instructions Count Score Tests Correct Response Time (avg)
#39 Qwen3.6 Flash medium Qwen 1 7.5 12/21 19.2s
#46 Qwen3.6 35B A3B medium Qwen 1 7.4 13/21 18.1s
#149 Nemotron 3 Nano Omni 30b A3b Reasoning medium NVIDIA 1 4.6 4/19 17.1s
#42 GPT-5.2 medium OpenAI 3 7.5 13/21 16.9s
#33 Hy3 preview medium Tencent 1 7.7 14/21 16.3s
#15 GPT-5.3-Codex medium OpenAI 2 8.4 15/21 16.2s
#28 Gemini 2.5 Flash medium Google 1 7.8 14/21 15.5s
#92 Laguna M.1 medium Poolside 1 6.4 9/19 14.7s
#133 DeepSeek V3.2 none DeepSeek 1 5.2 6/21 13.8s
#124 Kimi K2.6 none Moonshot AI 3 5.5 7/21 13.3s
#156 Hy3 preview none Tencent 4 4.4 4/21 12.9s
#113 DeepSeek V4 Pro none DeepSeek 2 5.7 7/21 12.4s
#70 GPT-5.4 Nano medium OpenAI 2 7.0 11/21 12.0s
#111 Owl Alpha medium Openrouter 2 5.7 8/21 11.9s
#79 Hunter Alpha medium OpenRouter 2 6.7 8/18 10.3s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)