Instructions following Model Ranking

AI BENCHY Category

See which AI models perform best on Instructions following, which ones stay reliable, and where the biggest gaps appear. Sort by: Metric ↑.

Models Shown

Average Instructions following Score

8.5

Best Model

Failure Reasons

With failure reason Wrong answer53 With failure reason Did not follow instructions11 With failure reason Extra formatting2 With failure reason No answer2 With failure reason API error1

Rank	Model	Company	Instructions following Score	Score	Tests Correct	Response Time (avg)
#71	Step 3.7 Flash high	Stepfun	9.8	7.0	2/2	1.52s
#74	Qwen3.6 Max Preview none	Qwen	9.8	6.9	2/2	1.40s
#75	Ring-2.6-1T medium	Inclusionai	9.8	6.9	2/2	11.8s
#84	Grok 4.20 Multi Agent Beta medium	X AI	9.8	6.6	2/2	3.52s
#94	GPT-5 Nano medium	OpenAI	9.8	6.3	2/2	15.6s
#96	Ring-2.6-1T none	Inclusionai	9.8	6.2	2/2	27.4s
#100	Grok Build 0.1 none	X AI	9.8	6.0	2/2	7.36s
#112	GLM 5.1 none	Z.ai	9.8	5.7	2/2	1.98s
#126	gpt-oss-120b none	OpenAI	9.8	5.4	2/2	5.06s
#20	Gemini 3.5 Flash none	Google	9.8	8.1	2/2	3.38s
#38	Grok 4.3 medium	X AI	9.8	7.6	2/2	18.6s
#47	Grok Build 0.1 medium	X AI	9.8	7.4	2/2	12.4s
#67	MiniMax M3 medium	Minimax	9.8	7.1	2/2	6.14s
#70	GPT-5.4 Nano medium	OpenAI	9.8	7.0	2/2	1.88s
#119	Cobuddy medium	Baidu	9.8	5.6	2/2	11.6s

Instructions following Ranking