Puzzle Solving x Extra formatting Ranking

See which AI models are most likely to hit Extra formatting on Puzzle Solving, so you can spot weak points faster.

Models Shown

Total Failures

Most Affected Model

Failure Reasons

Wrong answer201 Did not follow instructions90 API error12 Extra formatting8 Timed out5 No answer3

Categories

Anti-AI Tricks20 Coding18 Domain specific17 Puzzle Solving8 Data parsing and extraction6 Instructions following3 Combined1

8/8

Rank	Model	Company	Extra formatting Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#63	Claude Sonnet 4.6 none	Anthropic	1	7.7	$0.661	2/3	2.53s
Total Tests 3 Wrong Tests 1 Total Cost $0.661 Response Time (avg) 2.53s
#66	Claude Opus 4.8 none	Anthropic	1	7.7	$1.166	2/3	2.74s
Total Tests 3 Wrong Tests 1 Total Cost $1.166 Response Time (avg) 2.74s
#109	Mimo V2 PRO medium	Xiaomi	1	6.4	$0.333	1/3	5.08s
Total Tests 3 Wrong Tests 2 Total Cost $0.333 Response Time (avg) 5.08s
#111	LongCat 2.0 none	Meituan	1	4.0	$0.044	0/3	2.74s
Total Tests 3 Wrong Tests 3 Total Cost $0.044 Response Time (avg) 2.74s
#112	Claude Sonnet 5 none	Anthropic	1	6.0	$0.548	1/3	3.22s
Total Tests 3 Wrong Tests 2 Total Cost $0.548 Response Time (avg) 3.22s
#150	DeepSeek V4 Flash none	DeepSeek	1	3.1	$0.044	0/3	23.7s
Total Tests 3 Wrong Tests 3 Total Cost $0.044 Response Time (avg) 23.7s
#159	GPT-5.6 Luna none	OpenAI	1	5.3	$0.142	1/3	790ms
Total Tests 3 Wrong Tests 2 Total Cost $0.142 Response Time (avg) 790ms
#164	Inkling none	Thinkingmachines	1	5.6	$0.147	1/3	931ms
Total Tests 3 Wrong Tests 2 Total Cost $0.147 Response Time (avg) 931ms

Filter models