Instructions following Model Ranking

AI BENCHY Category

See which AI models perform best on Instructions following, which ones stay reliable, and where the biggest gaps appear. Sort by: Response Time (avg) ↓.

Models Shown

Average Instructions following Score

8.5

Best Model

Kimi K2.5 10.0

Failure Reasons

With failure reason Wrong answer53 With failure reason Did not follow instructions11 With failure reason Extra formatting2 With failure reason No answer2 With failure reason API error1

Rank	Model	Company	Instructions following Score	Score	Tests Correct	Response Time (avg)
#17	GLM 5 medium	Z.ai	10.0	8.3	2/2	7.25s
#16	Gemini 3 Flash Preview low	Google	9.9	8.4	2/2	7.02s
#105	Nemotron 3 Super medium	NVIDIA	7.3	5.8	1/2	6.97s
#41	Nemotron 3 Ultra 550b A55b medium	NVIDIA	9.8	7.5	2/2	6.35s
#33	Hy3 preview medium	Tencent	10.0	7.7	2/2	6.16s
#67	MiniMax M3 medium	Minimax	9.8	7.1	2/2	6.14s
#39	Qwen3.6 Flash medium	Qwen	10.0	7.5	2/2	6.05s
#161	Qwen3.5-9B medium	Qwen	6.5	4.2	1/2	5.75s
#138	Ling-2.6-flash none	Inclusionai	9.8	5.0	2/2	5.52s
#24	GPT-5.2 Chat none	OpenAI	9.8	7.9	2/2	5.51s
#23	GLM 5 Turbo medium	Z.ai	10.0	8.0	2/2	5.38s
#159	Ling-2.6-1T none	Inclusionai	6.4	4.3	1/2	5.36s
#126	gpt-oss-120b none	OpenAI	9.8	5.4	2/2	5.06s
#80	Mimo V2 Omni medium	Xiaomi	8.3	6.7	1/2	4.99s
#13	Grok 4.20 Beta medium	X AI	9.8	8.5	2/2	4.89s

Instructions following Ranking

Top Models by Instructions following Score

Instructions following Score vs Total Cost

Top Models by Response Time (avg)