AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Coding: Did not follow instructions

Coding
Did not follow instructions

See which AI models are most likely to hit Did not follow instructions on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

15

Total Failures

16

Most Affected Model

Granite 4.1 8B 1
Rank Model Company Did not follow instructions Count Category Score Tests Correct Response Time (avg)
#153 Granite 4.1 8B none IBM Granite 1 5.2 0/2 706ms
#115 MiMo-V2.5-Pro none Xiaomi 1 5.0 0/2 1.80s
#149 MiMo-V2-Flash none Xiaomi 1 4.9 0/2 2.04s
#101 Qwen3.5 Plus 2026-04-20 none Qwen 1 4.4 0/2 2.08s
#24 Gemini 3.5 Flash minimal Google 1 7.0 1/2 3.39s
#6 Gemini 3.5 Flash medium Google 1 6.8 1/2 9.91s
#100 Owl Alpha medium Openrouter 1 6.6 1/2 19.1s
#114 DeepSeek V3.2 none DeepSeek 1 3.1 0/2 20.9s
#87 Grok 4.1 Fast medium X AI 1 2.3 0/1 23.6s
#63 Claude Opus 4.6 medium Anthropic 1 7.2 1/2 29.4s
#74 Laguna M.1 medium Poolside 1 4.3 0/1 35.6s
#80 DeepSeek V4 Pro high DeepSeek 1 2.8 0/2 51.8s
#96 Nemotron 3 Super medium NVIDIA 1 3.1 0/2 62.4s
#105 Cobuddy medium Baidu 1 4.1 0/2 79.2s
#110 Kimi K2.6 none Moonshot AI 1 6.8 1/2 122.8s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost