Instructions following x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Instructions following, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

Most Affected Model

LongCat 2.0 1

Failure Reasons

Wrong answer61 Did not follow instructions18 Extra formatting3 No answer2 API error1 Timed out1

Categories

Domain specific412 Anti-AI Tricks293 Coding252 Puzzle Solving201 Trivia168 Combined68 Instructions following61 General Intelligence59 Data parsing and extraction41 Tool Calling3

61/61

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#187	Qwen3 Coder Next medium	Qwen	1	6.3	$0.032	1/2	7.49s
Total Tests 2 Wrong Tests 1 Total Cost $0.032 Response Time (avg) 7.49s
#189	Mercury 2 none	Inception	1	6.5	$0.030	1/2	551ms
Total Tests 2 Wrong Tests 1 Total Cost $0.030 Response Time (avg) 551ms
#191	Grok 4.20 Beta none	X AI	1	6.3	$0.087	1/2	649ms
Total Tests 2 Wrong Tests 1 Total Cost $0.087 Response Time (avg) 649ms
#192	Laguna M.1 none	Poolside	1	6.3	$0.009	1/2	683ms
Total Tests 2 Wrong Tests 1 Total Cost $0.009 Response Time (avg) 683ms
#194	GLM 4.7 Flash medium	Z.ai	1	6.2	$0.166	1/2	2.97s
Total Tests 2 Wrong Tests 1 Total Cost $0.166 Response Time (avg) 2.97s
#196	Hunter Alpha none	OpenRouter	1	6.4	$0.000	1/2	2.82s
Total Tests 2 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 2.82s
#197	Grok 4.20 none	X AI	1	6.3	$0.057	1/2	445ms
Total Tests 2 Wrong Tests 1 Total Cost $0.057 Response Time (avg) 445ms
#200	MiMo-V2-Flash none	Xiaomi	1	6.5	$0.025	1/2	857ms
Total Tests 2 Wrong Tests 1 Total Cost $0.025 Response Time (avg) 857ms
#205	Laguna Xs.2 none	Poolside	1	6.5	$0.004	1/2	439ms
Total Tests 2 Wrong Tests 1 Total Cost $0.004 Response Time (avg) 439ms
#210	LFM2-24B-A2B none	Liquid	1	6.3	$0.001	1/2	752ms
Total Tests 2 Wrong Tests 1 Total Cost $0.001 Response Time (avg) 752ms
#160	Laguna XS 2.1 none	Poolside	1	3.8	$0.008	0/2	364ms
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 364ms
#172	MiniMax M2.7 medium	Minimax	1	3.8	$0.163	0/2	12.8s
Total Tests 2 Wrong Tests 2 Total Cost $0.163 Response Time (avg) 12.8s
#183	Trinity Large Preview none	Arcee AI	1	3.5	$0.008	0/2	822ms
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 822ms
#201	Granite 4.1 8B none	IBM Granite	1	3.6	$0.007	0/2	344ms
Total Tests 2 Wrong Tests 2 Total Cost $0.007 Response Time (avg) 344ms
#203	Grok 4.1 Fast none	X AI	1	3.0	$0.008	0/2	685ms
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 685ms

←

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Instructions following: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost