Extra formatting Failure Ranking

See which AI models run into Extra formatting most often, so you can spot reliability risks before choosing one.

Models Shown

Total Failures

Most Affected Model

Categories

In category Anti-AI Tricks20 In category Coding18 In category Domain specific17 In category Puzzle Solving7 In category Data parsing and extraction6 In category Instructions following3 In category Combined1

41/41

Rank	Model	Company	Extra formatting Count	Score	Total Cost	Tests Correct	Response Time (avg)
#43	Claude Opus 4.6 medium	Anthropic	5	7.7	$3.059	13/22	34.3s
Total Tests 22 Wrong Tests 9 Total Cost $3.059 Response Time (avg) 34.3s
#62	Claude Sonnet 4.6 none	Anthropic	4	7.3	$0.661	12/22	8.12s
Total Tests 22 Wrong Tests 10 Total Cost $0.661 Response Time (avg) 8.12s
#108	Claude Sonnet 5 none	Anthropic	4	6.3	$0.548	8/22	6.04s
Total Tests 22 Wrong Tests 14 Total Cost $0.548 Response Time (avg) 6.04s
#154	KAT-Coder-Air V2.5 low	Kwaipilot	4	5.4	$0.041	7/22	10.1s
Total Tests 22 Wrong Tests 15 Total Cost $0.041 Response Time (avg) 10.1s
#40	Claude Sonnet 4.6 medium	Anthropic	3	7.8	$2.057	14/22	25.9s
Total Tests 22 Wrong Tests 8 Total Cost $2.057 Response Time (avg) 25.9s
#48	Grok Build 0.1 medium	X AI	3	7.6	$1.097	14/22	52.1s
Total Tests 22 Wrong Tests 8 Total Cost $1.097 Response Time (avg) 52.1s
#65	Claude Opus 4.8 none	Anthropic	3	7.3	$1.166	13/22	4.91s
Total Tests 22 Wrong Tests 9 Total Cost $1.166 Response Time (avg) 4.91s
#83	MiMo-V2.5-Pro medium	Xiaomi	3	6.9	$0.187	12/22	33.9s
Total Tests 22 Wrong Tests 10 Total Cost $0.187 Response Time (avg) 33.9s
#140	KAT-Coder-Air V2.5 high	Kwaipilot	3	5.6	$0.077	7/22	15.9s
Total Tests 22 Wrong Tests 15 Total Cost $0.077 Response Time (avg) 15.9s
#178	KAT-Coder-Air V2.5 none	Kwaipilot	3	4.8	$0.067	5/22	12.2s
Total Tests 22 Wrong Tests 17 Total Cost $0.067 Response Time (avg) 12.2s
#98	MiMo-V2.5 medium	Xiaomi	2	6.5	$0.082	12/22	32.2s
Total Tests 22 Wrong Tests 10 Total Cost $0.082 Response Time (avg) 32.2s
#133	North Mini Code medium	Cohere	2	5.9	$0.000	9/22	137.1s
Total Tests 22 Wrong Tests 13 Total Cost $0.000 Response Time (avg) 137.1s
#146	DeepSeek V4 Flash none	DeepSeek	2	5.6	$0.044	5/22	36.8s
Total Tests 22 Wrong Tests 17 Total Cost $0.044 Response Time (avg) 36.8s
#167	North Mini Code none	Cohere	2	5.1	$0.000	4/22	29.9s
Total Tests 22 Wrong Tests 18 Total Cost $0.000 Response Time (avg) 29.9s
#169	DeepSeek V3.2 none	DeepSeek	2	5.0	$0.054	6/22	18.3s
Total Tests 22 Wrong Tests 16 Total Cost $0.054 Response Time (avg) 18.3s

Extra formatting Failures

Filter models

Top Models by Extra formatting Count

Extra formatting Count vs Score

Top Models by Response Time (avg)