Did not follow instructions Failure Ranking

AI BENCHY Failures

See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one. Sort by: Response Time (avg) ↓.

Models Shown

Total Failures

215

Most Affected Model

Kimi K2.5 2

Categories

In category Puzzle Solving78 In category General Intelligence74 In category Anti-AI Tricks30 In category Coding14 In category Instructions following11 In category Tool Calling6 In category Combined1 In category Domain specific1

Rank	Model	Company	Did not follow instructions Count	Score	Tests Correct	Response Time (avg)
#76	Kimi K2.5 medium	Moonshot AI	2	6.8	10/21	98.4s
#161	Qwen3.5-9B medium	Qwen	1	4.2	3/21	82.2s
#73	Seed-2.0-Mini medium	Bytedance Seed	1	6.9	11/21	80.2s
#62	Step 3.5 Flash medium	Stepfun	3	7.2	11/20	72.5s
#60	Kimi K2.6 medium	Moonshot AI	2	7.2	12/21	71.7s
#72	DeepSeek V3.2 medium	DeepSeek	1	7.0	11/21	68.7s
#30	Qwen3.5-27B medium	Qwen	2	7.8	13/21	68.4s
#67	MiniMax M3 medium	Minimax	2	7.1	11/21	68.2s
#12	Gemini 3.1 Flash Lite Preview high	Google	1	8.6	13/16	68.1s
#129	MiniMax M2.5 medium	Minimax	3	5.3	5/21	65.4s
#103	DeepSeek V4 Pro high	DeepSeek	1	6.0	8/21	65.2s
#49	Qwen3.5-Flash medium	Qwen	1	7.4	12/21	63.3s
#53	Gemini 3.1 Flash Lite high	Google	3	7.3	10/18	62.0s
#75	Ring-2.6-1T medium	Inclusionai	2	6.9	11/21	61.3s
#78	Qwen3.6 27B medium	Qwen	1	6.8	10/21	59.7s

Did not follow instructions Failures

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)