Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

230

Most Affected Model

Laguna XS 2.1 3

Failure Reasons

Wrong answer230 API error43 Timed out25 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific368 Anti-AI Tricks270 Coding230 Puzzle Solving173 Trivia150 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#58	GPT-5.3 Chat none	OpenAI	2	5.6	$0.433	1/3	10.5s
Total Tests 3 Wrong Tests 2 Total Cost $0.433 Response Time (avg) 10.5s
#167	Ling-2.6-1T none	Inclusionai	1	3.8	$0.005	0/3	10.6s
Total Tests 3 Wrong Tests 3 Total Cost $0.005 Response Time (avg) 10.6s
#75	MiMo-V2-Flash medium	Xiaomi	1	6.0	$0.043	1/3	10.7s
Total Tests 3 Wrong Tests 2 Total Cost $0.043 Response Time (avg) 10.7s
#114	Gemma 4 31B none	Google	2	5.5	$0.004	1/3	11.2s
Total Tests 3 Wrong Tests 2 Total Cost $0.004 Response Time (avg) 11.2s
#16	Claude Opus 4.7 medium	Anthropic	1	7.6	$0.679	2/3	13.0s
Total Tests 3 Wrong Tests 1 Total Cost $0.679 Response Time (avg) 13.0s
#71	DeepSeek V4 Pro none	DeepSeek	1	5.6	$0.034	1/3	13.4s
Total Tests 3 Wrong Tests 2 Total Cost $0.034 Response Time (avg) 13.4s
#157	Trinity Large Preview none	Arcee AI	1	3.7	$0.008	0/3	14.3s
Total Tests 3 Wrong Tests 3 Total Cost $0.008 Response Time (avg) 14.3s
#176	Laguna Xs.2 medium	Poolside	1	2.1	$0.015	0/1	14.4s
Total Tests 1 Wrong Tests 1 Total Cost $0.015 Response Time (avg) 14.4s
#146	DeepSeek V3.2 none	DeepSeek	2	3.1	$0.016	0/3	14.5s
Total Tests 3 Wrong Tests 3 Total Cost $0.016 Response Time (avg) 14.5s
#51	GPT-5.6 Luna high	OpenAI	2	5.5	$0.924	1/3	15.6s
Total Tests 3 Wrong Tests 2 Total Cost $0.924 Response Time (avg) 15.6s
#135	DeepSeek V4 Flash none	DeepSeek	3	4.2	$0.007	0/3	17.1s
Total Tests 3 Wrong Tests 3 Total Cost $0.007 Response Time (avg) 17.1s
#36	Claude Sonnet 5 medium	Anthropic	1	9.0	$0.550	2/3	17.3s
Total Tests 3 Wrong Tests 1 Total Cost $0.550 Response Time (avg) 17.3s
#125	Owl Alpha medium	Openrouter	1	5.4	$0.000	1/3	18.7s
Total Tests 3 Wrong Tests 2 Total Cost $0.000 Response Time (avg) 18.7s
#59	GPT-5.4 Nano medium	OpenAI	2	6.1	$0.107	1/3	19.1s
Total Tests 3 Wrong Tests 2 Total Cost $0.107 Response Time (avg) 19.1s
#151	North Mini Code none	Cohere	3	3.9	$0.000	0/3	22.0s
Total Tests 3 Wrong Tests 3 Total Cost $0.000 Response Time (avg) 22.0s

←

1 5 6 7 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost