AI BENCHY
Your ad here

AI BENCHY Category Failures

General Intelligence: Did not follow instructions

General Intelligence
Did not follow instructions

See which AI models are most likely to hit Did not follow instructions on General Intelligence, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

15

Total Failures

58

Most Affected Model

Qwen3.5-27B 1
Rank Model Company Did not follow instructions Count Category Score Tests Correct Response Time (avg)
#10 Qwen3.5-27B medium Qwen 1 6.1 0/1 101.4s
#46 Kimi K2.5 medium Moonshot AI 1 6.5 0/1 69.7s
#32 Qwen3.5-Flash medium Qwen 1 6.1 0/1 40.1s
#80 MiniMax M2.7 medium Minimax 1 3.9 0/1 38.7s
#39 Seed-2.0-Mini medium Bytedance Seed 1 5.1 0/1 36.7s
#27 DeepSeek V3.2 medium DeepSeek 1 5.4 0/1 31.3s
#51 Nemotron 3 Super medium NVIDIA 1 3.8 0/1 27.9s
#9 Qwen3.6 Plus Preview medium Qwen 1 5.1 0/1 27.1s
#20 Qwen3.6 Plus medium Qwen 1 5.1 0/1 27.1s
#88 Nemotron 3 Super none NVIDIA 1 4.2 0/1 25.0s
#6 Seed-2.0-Lite medium Bytedance Seed 1 6.7 0/1 18.2s
#57 GPT-5 Nano medium OpenAI 1 4.1 0/1 17.5s
#52 Grok 4.1 Fast medium X AI 1 4.2 0/1 16.2s
#13 GLM 5 medium Z.ai 1 6.1 0/1 14.7s
#45 GPT-5 Mini medium OpenAI 1 4.5 0/1 13.5s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost