Puzzle Solving x Extra formatting Ranking

See which AI models are most likely to hit Extra formatting on Puzzle Solving, so you can spot weak points faster. Sort by: Failure Count ↑.

Models Shown

Total Failures

Most Affected Model

Failure Reasons

Wrong answer201 Did not follow instructions90 API error12 Extra formatting8 Timed out5 No answer3

Categories

Anti-AI Tricks20 Coding18 Domain specific17 Puzzle Solving8 Data parsing and extraction6 Instructions following3 Combined1

8/8

Rank	Model	Company	Extra formatting Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#63	Claude Sonnet 4.6 none	Anthropic	1	7.7	$0.661	2/3	2.53s
Total Tests 3 Wrong Tests 1 Total Cost $0.661 Response Time (avg) 2.53s
#66	Claude Opus 4.8 none	Anthropic	1	7.7	$1.166	2/3	2.74s
Total Tests 3 Wrong Tests 1 Total Cost $1.166 Response Time (avg) 2.74s
#109	Mimo V2 PRO medium	Xiaomi	1	6.4	$0.333	1/3	5.08s
Total Tests 3 Wrong Tests 2 Total Cost $0.333 Response Time (avg) 5.08s
#111	LongCat 2.0 none	Meituan	1	4.0	$0.044	0/3	2.74s
Total Tests 3 Wrong Tests 3 Total Cost $0.044 Response Time (avg) 2.74s
#112	Claude Sonnet 5 none	Anthropic	1	6.0	$0.548	1/3	3.22s
Total Tests 3 Wrong Tests 2 Total Cost $0.548 Response Time (avg) 3.22s
#150	DeepSeek V4 Flash none	DeepSeek	1	3.1	$0.044	0/3	23.7s
Total Tests 3 Wrong Tests 3 Total Cost $0.044 Response Time (avg) 23.7s
#159	GPT-5.6 Luna none	OpenAI	1	5.3	$0.142	1/3	790ms
Total Tests 3 Wrong Tests 2 Total Cost $0.142 Response Time (avg) 790ms
#164	Inkling none	Thinkingmachines	1	5.6	$0.147	1/3	931ms
Total Tests 3 Wrong Tests 2 Total Cost $0.147 Response Time (avg) 931ms

Filter models