AI BENCHY
Advertise here

AI BENCHY Category Failures

General Intelligence: Did not follow instructions

General Intelligence
Did not follow instructions

See which AI models are most likely to hit Did not follow instructions on General Intelligence, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

15

Total Failures

74

Most Affected Model

LFM2-24B-A2B 1
Rank Model Company Did not follow instructions Count Category Score Tests Correct Response Time (avg)
#129 MiniMax M2.5 medium Minimax 1 3.8 0/1 6.63s
#105 Nemotron 3 Super medium NVIDIA 1 4.1 0/1 6.91s
#99 gpt-oss-120b medium OpenAI 1 4.3 0/1 7.90s
#46 Qwen3.6 35B A3B medium Qwen 1 4.4 0/1 8.66s
#54 GPT-5 Mini medium OpenAI 1 4.5 0/1 13.5s
#83 Step 3.5 Flash none Stepfun 1 4.0 0/1 14.4s
#17 GLM 5 medium Z.ai 1 6.1 0/1 14.7s
#156 Hy3 preview none Tencent 1 4.1 0/1 16.1s
#86 Grok 4.1 Fast medium X AI 1 4.2 0/1 16.2s
#94 GPT-5 Nano medium OpenAI 1 4.1 0/1 17.5s
#19 Seed-2.0-Lite medium Bytedance Seed 1 6.7 0/1 18.2s
#159 Ling-2.6-1T none Inclusionai 1 5.0 0/1 20.3s
#62 Step 3.5 Flash medium Stepfun 1 5.5 0/1 22.4s
#119 Cobuddy medium Baidu 1 4.2 0/1 23.2s
#65 Grok 4.20 medium X AI 1 3.9 0/1 24.5s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost