AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Failures

Did not follow instructions Failures

See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one. Sort by: Response Time (avg) ↓.

Models Shown

15

Total Failures

215

Most Affected Model

Kimi K2.5 2
Rank Model Company Did not follow instructions Count Score Tests Correct Response Time (avg)
#76 Kimi K2.5 medium Moonshot AI 2 6.8 10/21 98.4s
#161 Qwen3.5-9B medium Qwen 1 4.2 3/21 82.2s
#73 Seed-2.0-Mini medium Bytedance Seed 1 6.9 11/21 80.2s
#62 Step 3.5 Flash medium Stepfun 3 7.2 11/20 72.5s
#60 Kimi K2.6 medium Moonshot AI 2 7.2 12/21 71.7s
#72 DeepSeek V3.2 medium DeepSeek 1 7.0 11/21 68.7s
#30 Qwen3.5-27B medium Qwen 2 7.8 13/21 68.4s
#67 MiniMax M3 medium Minimax 2 7.1 11/21 68.2s
#12 Gemini 3.1 Flash Lite Preview high Google 1 8.6 13/16 68.1s
#129 MiniMax M2.5 medium Minimax 3 5.3 5/21 65.4s
#103 DeepSeek V4 Pro high DeepSeek 1 6.0 8/21 65.2s
#49 Qwen3.5-Flash medium Qwen 1 7.4 12/21 63.3s
#53 Gemini 3.1 Flash Lite high Google 3 7.3 10/18 62.0s
#75 Ring-2.6-1T medium Inclusionai 2 6.9 11/21 61.3s
#78 Qwen3.6 27B medium Qwen 1 6.8 10/21 59.7s

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)