Instructions following Model Ranking

AI BENCHY Category

See which AI models perform best on Instructions following, which ones stay reliable, and where the biggest gaps appear. Sort by: Response Time (avg) ↓.

Models Shown

Average Instructions following Score

8.5

Best Model

Kimi K2.5 10.0

Failure Reasons

With failure reason Wrong answer53 With failure reason Did not follow instructions11 With failure reason Extra formatting2 With failure reason No answer2 With failure reason API error1

Rank	Model	Company	Instructions following Score	Score	Tests Correct	Response Time (avg)
#62	Step 3.5 Flash medium	Stepfun	8.3	7.2	1/2	4.78s
#86	Grok 4.1 Fast medium	X AI	6.5	6.5	1/2	4.63s
#92	Laguna M.1 medium	Poolside	10.0	6.4	2/2	4.30s
#64	MiMo-V2-Flash medium	Xiaomi	10.0	7.2	2/2	4.28s
#65	Grok 4.20 medium	X AI	9.8	7.1	2/2	4.26s
#101	Mimo V2 Omni none	Xiaomi	6.5	6.0	1/2	4.26s
#79	Hunter Alpha medium	OpenRouter	9.9	6.7	2/2	4.18s
#1	Gemini 3 Flash Preview medium	Google	10.0	9.8	2/2	4.04s
#6	GPT-5.5 low	OpenAI	9.9	9.0	2/2	3.74s
#59	GLM 5V Turbo medium	Z.ai	9.9	7.2	2/2	3.74s
#84	Grok 4.20 Multi Agent Beta medium	X AI	9.8	6.6	2/2	3.52s
#63	GPT-5.3 Chat none	OpenAI	9.8	7.2	2/2	3.51s
#93	Qwen3.6 Plus Preview medium	Qwen	6.5	6.3	1/2	3.40s
#20	Gemini 3.5 Flash none	Google	9.8	8.1	2/2	3.38s
#9	GPT-5.5 medium	OpenAI	10.0	8.8	2/2	3.36s

Instructions following Ranking

Top Models by Instructions following Score

Instructions following Score vs Total Cost

Top Models by Response Time (avg)