AI BENCHY
Your ad here

AI BENCHY Category Failures

General Intelligence: Did not follow instructions

General Intelligence
Did not follow instructions

See which AI models are most likely to hit Did not follow instructions on General Intelligence, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

15

Total Failures

58

Most Affected Model

Qwen3.5-27B 1
Rank Model Company Did not follow instructions Count Category Score Tests Correct Response Time (avg)
#78 Trinity Large Preview none Arcee AI 1 4.4 0/1 2.86s
#84 gpt-oss-120b none OpenAI 1 4.6 0/1 2.83s
#72 Hunter Alpha none OpenRouter 1 6.1 0/1 2.71s
#42 Claude Sonnet 4.6 none Anthropic 1 6.1 0/1 2.56s
#67 Qwen3.5-27B none Qwen 1 5.0 0/1 2.51s
#65 MiMo-V2-Pro none Xiaomi 1 4.3 0/1 2.44s
#58 GLM 5V Turbo none Z.ai 1 4.6 0/1 2.22s
#77 GLM 5 Turbo none Z.ai 1 4.2 0/1 2.18s
#73 Mistral Small 4 medium Mistral 1 4.8 0/1 2.05s
#36 GPT-5.3 Chat none OpenAI 1 4.6 0/1 1.99s
#86 GPT-5.4 Mini none OpenAI 1 4.8 0/1 1.82s
#94 MiMo-V2-Flash none Xiaomi 1 4.6 0/1 1.67s
#69 Kimi K2.6 none Moonshot AI 1 5.4 0/1 1.55s
#22 Gemini 3.1 Flash Lite Preview low Google 1 4.0 0/1 1.54s
#92 Qwen3 Coder Next medium Qwen 1 6.3 0/1 1.39s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost