Puzzle Solving Model Ranking

See which AI models perform best on Puzzle Solving, which ones stay reliable, and where the biggest gaps appear.

Models Shown

Average Puzzle Solving Score

6.7

Best Model

Failure Reasons

With failure reason Wrong answer193 With failure reason Did not follow instructions88 With failure reason API error12 With failure reason Extra formatting7 With failure reason Timed out5 With failure reason No answer3

206/206

Rank	Model	Company	Puzzle Solving Score	Score	Total Cost	Tests Correct	Response Time (avg)
#1	Gemini 3 Flash Preview medium	Google	10.0	9.6	$0.742	3/3	4.05s
Total Tests 3 Wrong Tests 0 Total Cost $0.742 Response Time (avg) 4.05s
#2	Gemini 3.5 Flash high	Google	10.0	9.5	$1.976	3/3	3.23s
Total Tests 3 Wrong Tests 0 Total Cost $1.976 Response Time (avg) 3.23s
#5	GPT-5.6 Sol high	OpenAI	10.0	9.4	$1.234	3/3	4.10s
Total Tests 3 Wrong Tests 0 Total Cost $1.234 Response Time (avg) 4.10s
#6	GPT-5.5 low	OpenAI	10.0	9.3	$1.253	3/3	4.74s
Total Tests 3 Wrong Tests 0 Total Cost $1.253 Response Time (avg) 4.74s
#7	Gemini 3.1 Pro Preview medium	Google	10.0	9.2	$1.361	3/3	6.90s
Total Tests 3 Wrong Tests 0 Total Cost $1.361 Response Time (avg) 6.90s
#8	Qwen3.7 Max medium	Qwen	10.0	9.2	$1.116	3/3	8.84s
Total Tests 3 Wrong Tests 0 Total Cost $1.116 Response Time (avg) 8.84s
#10	GPT-5.5 medium	OpenAI	10.0	9.0	$4.137	3/3	6.76s
Total Tests 3 Wrong Tests 0 Total Cost $4.137 Response Time (avg) 6.76s
#11	Gemini 3.5 Flash low	Google	10.0	8.9	$0.433	3/3	2.35s
Total Tests 3 Wrong Tests 0 Total Cost $0.433 Response Time (avg) 2.35s
#12	Grok 4.5 high	X AI	10.0	8.9	$1.707	3/3	7.88s
Total Tests 3 Wrong Tests 0 Total Cost $1.707 Response Time (avg) 7.88s
#14	Claude Opus 4.8 medium	Anthropic	10.0	8.8	$1.931	3/3	3.95s
Total Tests 3 Wrong Tests 0 Total Cost $1.931 Response Time (avg) 3.95s
#15	Claude Opus 4.7 medium	Anthropic	10.0	8.7	$1.477	3/3	2.43s
Total Tests 3 Wrong Tests 0 Total Cost $1.477 Response Time (avg) 2.43s
#19	Qwen3.6 Max Preview medium	Qwen	10.0	8.4	$1.143	3/3	24.3s
Total Tests 3 Wrong Tests 0 Total Cost $1.143 Response Time (avg) 24.3s
#20	Grok 4.5 low	X AI	10.0	8.4	$0.935	3/3	3.20s
Total Tests 3 Wrong Tests 0 Total Cost $0.935 Response Time (avg) 3.20s
#22	Grok 4.5 medium	X AI	10.0	8.3	$1.928	3/3	7.75s
Total Tests 3 Wrong Tests 0 Total Cost $1.928 Response Time (avg) 7.75s
#32	Inkling medium	Thinkingmachines	10.0	8.0	$0.391	3/3	5.18s
Total Tests 3 Wrong Tests 0 Total Cost $0.391 Response Time (avg) 5.18s

Puzzle Solving Ranking

Filter models

Top Models by Puzzle Solving Score

Puzzle Solving Score vs Total Cost

Top Models by Response Time (avg)