Instructions following Model Ranking

AI BENCHY Category

See which AI models perform best on Instructions following, which ones stay reliable, and where the biggest gaps appear. Sort by: Tests Correct ↑.

Models Shown

Average Instructions following Score

8.5

Best Model

Failure Reasons

With failure reason Wrong answer53 With failure reason Did not follow instructions11 With failure reason Extra formatting2 With failure reason No answer2 With failure reason API error1

Rank	Model	Company	Instructions following Score	Score	Tests Correct	Response Time (avg)
#27	Gemma 4 31B medium	Google	10.0	7.8	2/2	12.8s
#28	Gemini 2.5 Flash medium	Google	9.8	7.8	2/2	2.62s
#29	Qwen3.5-122B-A10B medium	Qwen	10.0	7.8	2/2	9.88s
#30	Qwen3.5-27B medium	Qwen	10.0	7.8	2/2	19.7s
#31	DeepSeek V4 Flash high	DeepSeek	10.0	7.7	2/2	15.4s
#33	Hy3 preview medium	Tencent	10.0	7.7	2/2	6.16s
#34	Qwen3.7 Max none	Qwen	10.0	7.7	2/2	943ms
#35	Gemini 3 PRO Preview medium	Google	9.8	7.6	2/2	3.26s
#36	Qwen3.5 Plus 2026-04-20 medium	Qwen	10.0	7.6	2/2	20.2s
#37	Gemma 4 26B A4B medium	Google	10.0	7.6	2/2	17.5s
#38	Grok 4.3 medium	X AI	9.8	7.6	2/2	18.6s
#39	Qwen3.6 Flash medium	Qwen	10.0	7.5	2/2	6.05s
#40	Gemini 3.1 Flash Lite Preview medium	Google	10.0	7.5	2/2	1.91s
#41	Nemotron 3 Ultra 550b A55b medium	NVIDIA	9.8	7.5	2/2	6.35s
#42	GPT-5.2 medium	OpenAI	9.9	7.5	2/2	3.12s

Instructions following Ranking