AI BENCHY
Your ad here

AI BENCHY Category Failures

General Intelligence: Did not follow instructions

General Intelligence
Did not follow instructions

See which AI models are most likely to hit Did not follow instructions on General Intelligence, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

15

Total Failures

58

Most Affected Model

LFM2-24B-A2B 1
Rank Model Company Did not follow instructions Count Category Score Tests Correct Response Time (avg)
#69 Kimi K2.6 none Moonshot AI 1 5.4 0/1 1.55s
#94 MiMo-V2-Flash none Xiaomi 1 4.6 0/1 1.67s
#86 GPT-5.4 Mini none OpenAI 1 4.8 0/1 1.82s
#36 GPT-5.3 Chat none OpenAI 1 4.6 0/1 1.99s
#73 Mistral Small 4 medium Mistral 1 4.8 0/1 2.05s
#77 GLM 5 Turbo none Z.ai 1 4.2 0/1 2.18s
#58 GLM 5V Turbo none Z.ai 1 4.6 0/1 2.22s
#65 MiMo-V2-Pro none Xiaomi 1 4.3 0/1 2.44s
#67 Qwen3.5-27B none Qwen 1 5.0 0/1 2.51s
#42 Claude Sonnet 4.6 none Anthropic 1 6.1 0/1 2.56s
#72 Hunter Alpha none OpenRouter 1 6.1 0/1 2.71s
#84 gpt-oss-120b none OpenAI 1 4.6 0/1 2.83s
#78 Trinity Large Preview none Arcee AI 1 4.4 0/1 2.86s
#28 GPT-5.2 Chat none OpenAI 1 4.4 0/1 3.20s
#60 Gemma 4 26B A4B none Google 1 4.0 0/1 3.54s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost