Instructions following Model Ranking

AI BENCHY Category

See which AI models perform best on Instructions following, which ones stay reliable, and where the biggest gaps appear. Sort by: Metric ↑.

Models Shown

Average Instructions following Score

8.5

Best Model

Failure Reasons

With failure reason Wrong answer53 With failure reason Did not follow instructions11 With failure reason Extra formatting2 With failure reason No answer2 With failure reason API error1

Rank	Model	Company	Instructions following Score	Score	Tests Correct	Response Time (avg)
#121	Owl Alpha none	Openrouter	6.4	5.5	1/2	2.63s
#123	MiMo-V2.5-Pro none	Xiaomi	6.4	5.5	1/2	1.03s
#159	Ling-2.6-1T none	Inclusionai	6.4	4.3	1/2	5.36s
#32	Gemini 3.5 Flash minimal	Google	6.4	7.7	1/2	893ms
#48	Gemini 3 Flash Preview none	Google	6.4	7.4	1/2	1.58s
#55	GLM 5.1 medium	Z.ai	6.4	7.3	1/2	7.47s
#77	Claude Sonnet 4.6 none	Anthropic	6.5	6.8	1/2	1.96s
#85	Gemma 4 31B none	Google	6.5	6.5	1/2	2.84s
#86	Grok 4.1 Fast medium	X AI	6.5	6.5	1/2	4.63s
#93	Qwen3.6 Plus Preview medium	Qwen	6.5	6.3	1/2	3.40s
#101	Mimo V2 Omni none	Xiaomi	6.5	6.0	1/2	4.26s
#109	GLM 5V Turbo none	Z.ai	6.5	5.8	1/2	1.97s
#111	Owl Alpha medium	Openrouter	6.5	5.7	1/2	10.2s
#120	Mimo V2 PRO none	Xiaomi	6.5	5.6	1/2	2.51s
#122	GLM 4.7 Flash none	Z.ai	6.5	5.5	1/2	888ms

Instructions following Ranking