Instructions following Model Ranking

See which AI models perform best on Instructions following, which ones stay reliable, and where the biggest gaps appear. Sort by: Tests Correct ↑.

Models Shown

Average Instructions following Score

8.6

Best Model

Laguna XS 2.1 3.8

Failure Reasons

With failure reason Wrong answer61 With failure reason Did not follow instructions19 With failure reason Extra formatting3 With failure reason No answer2 With failure reason API error1 With failure reason Timed out1

216/216

Rank	Model	Company	Instructions following Score	Score	Total Cost	Tests Correct	Response Time (avg)
#158	Qwen3.6 27B none	Qwen	6.2	5.5	$0.087	1/2	1.92s
Total Tests 2 Wrong Tests 1 Total Cost $0.087 Response Time (avg) 1.92s
#160	MiMo-V2.5-Pro none	Xiaomi	6.4	5.5	$0.068	1/2	1.03s
Total Tests 2 Wrong Tests 1 Total Cost $0.068 Response Time (avg) 1.03s
#161	Kimi K2.5 none	Moonshot AI	6.5	5.5	$0.127	1/2	2.67s
Total Tests 2 Wrong Tests 1 Total Cost $0.127 Response Time (avg) 2.67s
#162	Gemma 4 26B A4B none	Google	6.3	5.5	$0.015	1/2	690ms
Total Tests 2 Wrong Tests 1 Total Cost $0.015 Response Time (avg) 690ms
#163	Mimo V2 Omni none	Xiaomi	6.5	5.5	$0.021	1/2	4.26s
Total Tests 2 Wrong Tests 1 Total Cost $0.021 Response Time (avg) 4.26s
#165	GPT-5.6 Luna none	OpenAI	7.1	5.4	$0.142	1/2	1.23s
Total Tests 2 Wrong Tests 1 Total Cost $0.142 Response Time (avg) 1.23s
#167	Qwen3.6 35B A3B none	Qwen	6.2	5.3	$0.061	1/2	1.86s
Total Tests 2 Wrong Tests 1 Total Cost $0.061 Response Time (avg) 1.86s
#168	Ling-2.6-1T none	Inclusionai	6.4	5.3	$0.016	1/2	5.36s
Total Tests 2 Wrong Tests 1 Total Cost $0.016 Response Time (avg) 5.36s
#170	Inkling none	Thinkingmachines	6.3	5.2	$0.147	1/2	1.72s
Total Tests 2 Wrong Tests 1 Total Cost $0.147 Response Time (avg) 1.72s
#171	Mistral Small 4 none	Mistral	6.5	5.1	$0.022	1/2	380ms
Total Tests 2 Wrong Tests 1 Total Cost $0.022 Response Time (avg) 380ms
#172	Qwen3 Coder Next none	Qwen	6.3	5.1	$0.025	1/2	7.78s
Total Tests 2 Wrong Tests 1 Total Cost $0.025 Response Time (avg) 7.78s
#173	Mistral Small 4 medium	Mistral	7.3	5.1	$0.096	1/2	1.38s
Total Tests 2 Wrong Tests 1 Total Cost $0.096 Response Time (avg) 1.38s
#174	MiMo-V2.5 none	Xiaomi	6.5	5.1	$0.025	1/2	751ms
Total Tests 2 Wrong Tests 1 Total Cost $0.025 Response Time (avg) 751ms
#175	Qwen3.5-9B none	Qwen	6.5	5.1	$0.021	1/2	514ms
Total Tests 2 Wrong Tests 1 Total Cost $0.021 Response Time (avg) 514ms
#176	GLM 5 Turbo none	Z.ai	6.5	5.1	$0.047	1/2	2.13s
Total Tests 2 Wrong Tests 1 Total Cost $0.047 Response Time (avg) 2.13s

Instructions following Ranking

Filter models

Top Models by Instructions following Score

Instructions following Score vs Total Cost

Top Models by Response Time (avg)