AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Instructions following: Did not follow instructions

Instructions following
Did not follow instructions

See which AI models are most likely to hit Did not follow instructions on Instructions following, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

11

Total Failures

11

Most Affected Model

Granite 4.1 8B 1
Rank Model Company Did not follow instructions Count Category Score Tests Correct Response Time (avg)
#163 Granite 4.1 8B none IBM Granite 1 3.6 0/2 344ms
#162 Nemotron 3 Nano Omni 30b A3b Reasoning none NVIDIA 1 4.8 0/2 541ms
#129 MiniMax M2.5 medium Minimax 1 7.5 1/2 621ms
#157 Grok 4.1 Fast none X AI 1 3.0 0/2 685ms
#151 Trinity Large Preview none Arcee AI 1 3.5 0/2 822ms
#149 Nemotron 3 Nano Omni 30b A3b Reasoning medium NVIDIA 1 7.3 1/2 1.37s
#86 Grok 4.1 Fast medium X AI 1 6.5 1/2 4.63s
#62 Step 3.5 Flash medium Stepfun 1 8.3 1/2 4.78s
#80 Mimo V2 Omni medium Xiaomi 1 8.3 1/2 4.99s
#105 Nemotron 3 Super medium NVIDIA 1 7.3 1/2 6.97s
#130 MiniMax M2.7 medium Minimax 1 3.8 0/2 12.8s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost