AI BENCHY
Advertise here

AI BENCHY Category Failures

General Intelligence: Did not follow instructions

General Intelligence
Did not follow instructions

See which AI models are most likely to hit Did not follow instructions on General Intelligence, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

14

Total Failures

74

Most Affected Model

LFM2-24B-A2B 1
Rank Model Company Did not follow instructions Count Category Score Tests Correct Response Time (avg)
#38 Grok 4.3 medium X AI 1 5.4 0/1 24.7s
#103 DeepSeek V4 Pro high DeepSeek 1 6.1 0/1 25.1s
#31 DeepSeek V4 Flash high DeepSeek 1 6.1 0/1 25.2s
#26 Qwen3.6 Plus medium Qwen 1 5.1 0/1 27.1s
#67 MiniMax M3 medium Minimax 1 5.1 0/1 33.3s
#73 Seed-2.0-Mini medium Bytedance Seed 1 5.1 0/1 36.7s
#130 MiniMax M2.7 medium Minimax 1 3.9 0/1 38.7s
#78 Qwen3.6 27B medium Qwen 1 6.5 0/1 39.5s
#49 Qwen3.5-Flash medium Qwen 1 6.1 0/1 40.1s
#53 Gemini 3.1 Flash Lite high Google 1 5.0 0/1 45.7s
#75 Ring-2.6-1T medium Inclusionai 1 4.1 0/1 58.3s
#111 Owl Alpha medium Openrouter 1 4.3 0/1 58.6s
#76 Kimi K2.5 medium Moonshot AI 1 6.5 0/1 69.7s
#30 Qwen3.5-27B medium Qwen 1 6.1 0/1 101.4s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost