AI BENCHY
Advertise here

AI BENCHY Category Failures

Puzzle Solving: Did not follow instructions

Puzzle Solving
Did not follow instructions

See which AI models are most likely to hit Did not follow instructions on Puzzle Solving, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

15

Total Failures

78

Most Affected Model

Qwen3.5-27B 1
Rank Model Company Did not follow instructions Count Category Score Tests Correct Response Time (avg)
#147 GPT-4o-mini none OpenAI 1 3.5 0/3 1.21s
#122 GLM 4.7 Flash none Z.ai 1 6.4 1/3 1.20s
#153 Qwen3.6 35B A3B none Qwen 2 3.2 0/3 1.07s
#104 Nemotron 3 Ultra 550b A55b none NVIDIA 1 5.9 1/3 1.06s
#131 Qwen3.5-122B-A10B none Qwen 1 3.8 0/3 1.00s
#81 Mercury 2 medium Inception 1 5.4 1/3 949ms
#136 Elephant Alpha medium Openrouter 1 5.3 1/3 868ms
#144 GPT-5.4 Mini none OpenAI 1 5.4 1/3 836ms
#137 Elephant Alpha none Openrouter 1 4.2 0/3 807ms
#102 Gemma 4 26B A4B none Google 1 6.2 1/3 744ms
#90 Gemini 3.1 Flash Lite none Google 1 6.3 1/3 720ms
#154 Qwen3.5-9B none Qwen 1 3.2 0/3 621ms
#163 Granite 4.1 8B none IBM Granite 1 3.2 0/3 608ms
#162 Nemotron 3 Nano Omni 30b A3b Reasoning none NVIDIA 1 3.0 0/3 532ms
#142 Mistral Small 4 none Mistral 1 3.1 0/3 399ms

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost