Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

230

Most Affected Model

Laguna XS 2.1 3

Failure Reasons

Wrong answer230 API error43 Timed out25 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific368 Anti-AI Tricks270 Coding230 Puzzle Solving173 Trivia150 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#145	Qwen3.5-122B-A10B none	Qwen	3	3.7	$0.020	0/3	2.77s
Total Tests 3 Wrong Tests 3 Total Cost $0.020 Response Time (avg) 2.77s
#107	Seed-2.0-Lite none	Bytedance Seed	2	5.6	$0.019	1/3	2.83s
Total Tests 3 Wrong Tests 2 Total Cost $0.019 Response Time (avg) 2.83s
#170	Laguna M.1 none	Poolside	1	2.5	$0.009	0/1	2.93s
Total Tests 1 Wrong Tests 1 Total Cost $0.009 Response Time (avg) 2.93s
#116	Qwen3.6 Max Preview none	Qwen	3	3.8	$0.075	0/3	3.12s
Total Tests 3 Wrong Tests 3 Total Cost $0.075 Response Time (avg) 3.12s
#122	GLM 5V Turbo none	Z.ai	2	5.5	$0.052	1/3	3.13s
Total Tests 3 Wrong Tests 2 Total Cost $0.052 Response Time (avg) 3.13s
#154	MiMo-V2.5 none	Xiaomi	2	5.5	$0.006	1/3	3.24s
Total Tests 3 Wrong Tests 2 Total Cost $0.006 Response Time (avg) 3.24s
#70	Claude Opus 4.8 none	Anthropic	1	5.5	$0.539	1/3	3.29s
Total Tests 3 Wrong Tests 2 Total Cost $0.539 Response Time (avg) 3.29s
#132	Claude Sonnet 5 none	Anthropic	3	4.6	$0.287	0/3	3.67s
Total Tests 3 Wrong Tests 3 Total Cost $0.287 Response Time (avg) 3.67s
#43	Gemini 3.1 Flash Lite medium	Google	2	5.5	$0.071	1/3	3.81s
Total Tests 3 Wrong Tests 2 Total Cost $0.071 Response Time (avg) 3.81s
#41	Gemini 3.1 Flash Lite Preview medium	Google	2	5.5	$0.068	1/3	4.09s
Total Tests 3 Wrong Tests 2 Total Cost $0.068 Response Time (avg) 4.09s
#139	Gemma 4 26B A4B none	Google	2	3.7	$0.004	0/3	4.16s
Total Tests 3 Wrong Tests 3 Total Cost $0.004 Response Time (avg) 4.16s
#138	Qwen3.6 27B none	Qwen	2	5.5	$0.025	1/3	4.16s
Total Tests 3 Wrong Tests 2 Total Cost $0.025 Response Time (avg) 4.16s
#178	Hy3 preview none	Tencent	1	2.7	$0.003	0/3	4.56s
Total Tests 3 Wrong Tests 3 Total Cost $0.003 Response Time (avg) 4.56s
#108	GPT-5.6 Luna low	OpenAI	2	5.5	$0.141	1/3	4.61s
Total Tests 3 Wrong Tests 2 Total Cost $0.141 Response Time (avg) 4.61s
#134	GLM 5.1 none	Z.ai	3	3.9	$0.057	0/3	4.96s
Total Tests 3 Wrong Tests 3 Total Cost $0.057 Response Time (avg) 4.96s

←

1 3 4 5 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost