AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Failures

Did not follow instructions Failures

See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one. Sort by: Response Time (avg) ↓.

Models Shown

15

Total Failures

215

Most Affected Model

Kimi K2.5 2
Rank Model Company Did not follow instructions Count Score Tests Correct Response Time (avg)
#114 Qwen3.5 Plus 2026-04-20 none Qwen 2 5.7 7/21 4.39s
#85 Gemma 4 31B none Google 1 6.5 10/21 4.05s
#40 Gemini 3.1 Flash Lite Preview medium Google 1 7.5 13/21 3.96s
#153 Qwen3.6 35B A3B none Qwen 2 4.6 4/21 3.73s
#118 Qwen3.6 27B none Qwen 2 5.6 7/21 3.72s
#68 Claude Opus 4.8 none Anthropic 1 7.0 12/21 3.47s
#131 Qwen3.5-122B-A10B none Qwen 2 5.3 6/21 3.41s
#117 Qwen3.5-35B-A3B none Qwen 2 5.6 7/21 3.37s
#44 Gemini 3.1 Flash Lite medium Google 1 7.5 13/21 3.23s
#109 GLM 5V Turbo none Z.ai 2 5.8 8/21 2.99s
#151 Trinity Large Preview none Arcee AI 3 4.6 4/21 2.98s
#122 GLM 4.7 Flash none Z.ai 1 5.5 6/21 2.86s
#88 Qwen3.7 Plus none Qwen 1 6.4 10/21 2.85s
#134 GLM 5 Turbo none Z.ai 2 5.2 6/21 2.82s
#50 Gemini 3.1 Flash Lite Preview low Google 1 7.4 13/21 2.77s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)