Did not follow instructions Failure Ranking

See which AI models run into Did not follow instructions most often, so you can spot reliability risks before choosing one. Sort by: Tests Correct ↓.

Models Shown

Total Failures

245

Most Affected Model

Gemini 3.5 Flash 1

Categories

In category Puzzle Solving90 In category General Intelligence78 In category Anti-AI Tricks33 In category Instructions following18 In category Coding16 In category Tool Calling8 In category Combined1 In category Domain specific1

140/140

Rank	Model	Company	Did not follow instructions Count	Score	Total Cost	Tests Correct	Response Time (avg)
#29	Step 3.7 Flash medium	Stepfun	1	8.0	$0.515	14/22	26.4s
Total Tests 22 Wrong Tests 8 Total Cost $0.515 Response Time (avg) 26.4s
#30	GPT-5.2 Chat none	OpenAI	1	8.0	$0.604	14/22	7.65s
Total Tests 22 Wrong Tests 8 Total Cost $0.604 Response Time (avg) 7.65s
#31	GLM 5.2 high	Z.ai	1	8.0	$0.970	14/22	62.7s
Total Tests 22 Wrong Tests 8 Total Cost $0.970 Response Time (avg) 62.7s
#35	Seed-2.0-Lite medium	Bytedance Seed	2	7.9	$0.234	14/22	48.5s
Total Tests 22 Wrong Tests 8 Total Cost $0.234 Response Time (avg) 48.5s
#88	Gemini 3.5 Flash minimal	Google	1	6.8	$0.300	14/22	2.65s
Total Tests 22 Wrong Tests 8 Total Cost $0.300 Response Time (avg) 2.65s
#24	Muse Spark 1.1 low	Meta	2	8.3	$0.647	13/22	11.5s
Total Tests 22 Wrong Tests 9 Total Cost $0.647 Response Time (avg) 11.5s
#43	Claude Opus 4.6 medium	Anthropic	1	7.7	$3.059	13/22	34.3s
Total Tests 22 Wrong Tests 9 Total Cost $3.059 Response Time (avg) 34.3s
#45	DeepSeek V4 Flash high	DeepSeek	2	7.7	$0.042	13/22	49.7s
Total Tests 22 Wrong Tests 9 Total Cost $0.042 Response Time (avg) 49.7s
#54	GPT-5.3 Chat none	OpenAI	2	7.5	$0.571	13/22	6.88s
Total Tests 22 Wrong Tests 9 Total Cost $0.571 Response Time (avg) 6.88s
#58	Qwen3.5-27B medium	Qwen	2	7.4	$1.627	13/22	111.9s
Total Tests 22 Wrong Tests 9 Total Cost $1.627 Response Time (avg) 111.9s
#64	Gemini 3.1 Flash Lite Preview medium	Google	1	7.3	$0.115	13/22	4.61s
Total Tests 22 Wrong Tests 9 Total Cost $0.115 Response Time (avg) 4.61s
#65	Gemini 3.1 Flash Lite medium	Google	1	7.3	$0.117	13/22	4.27s
Total Tests 22 Wrong Tests 9 Total Cost $0.117 Response Time (avg) 4.27s
#66	Claude Opus 4.8 none	Anthropic	1	7.3	$1.166	13/22	4.91s
Total Tests 22 Wrong Tests 9 Total Cost $1.166 Response Time (avg) 4.91s
#73	Grok 4.3 medium	X AI	2	7.1	$0.779	13/22	47.4s
Total Tests 22 Wrong Tests 9 Total Cost $0.779 Response Time (avg) 47.4s
#90	Qwen3.6 35B A3B medium	Qwen	1	6.7	$0.746	13/22	58.1s
Total Tests 22 Wrong Tests 9 Total Cost $0.746 Response Time (avg) 58.1s

Did not follow instructions Failures

Filter models

Top Models by Did not follow instructions Count

Did not follow instructions Count vs Score

Top Models by Response Time (avg)