Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

230

Most Affected Model

Laguna XS 2.1 3

Failure Reasons

Wrong answer230 API error43 Timed out25 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific368 Anti-AI Tricks270 Coding230 Puzzle Solving173 Trivia150 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#118	GLM 5 none	Z.ai	3	4.0	$0.027	0/3	5.12s
Total Tests 3 Wrong Tests 3 Total Cost $0.027 Response Time (avg) 5.12s
#68	Claude Sonnet 4.6 none	Anthropic	1	5.5	$0.316	1/3	5.19s
Total Tests 3 Wrong Tests 2 Total Cost $0.316 Response Time (avg) 5.19s
#155	Qwen3.5-9B none	Qwen	3	3.9	$0.006	0/3	5.60s
Total Tests 3 Wrong Tests 3 Total Cost $0.006 Response Time (avg) 5.60s
#65	Gemini 3 Flash Preview low	Google	2	5.8	$0.111	1/3	6.00s
Total Tests 3 Wrong Tests 2 Total Cost $0.111 Response Time (avg) 6.00s
#8	Gemini 3.5 Flash low	Google	1	7.8	$0.349	2/3	6.71s
Total Tests 3 Wrong Tests 1 Total Cost $0.349 Response Time (avg) 6.71s
#38	GPT-5.6 Terra medium	OpenAI	2	6.1	$0.496	1/3	7.19s
Total Tests 3 Wrong Tests 2 Total Cost $0.496 Response Time (avg) 7.19s
#74	GLM 5.2 none	Z.ai	2	3.7	$0.042	0/3	7.55s
Total Tests 3 Wrong Tests 3 Total Cost $0.042 Response Time (avg) 7.55s
#46	Claude Opus 4.8 low	Anthropic	1	6.6	$1.270	1/3	7.58s
Total Tests 3 Wrong Tests 2 Total Cost $1.270 Response Time (avg) 7.58s
#148	Qwen3.6 35B A3B none	Qwen	2	5.5	$0.031	1/3	8.77s
Total Tests 3 Wrong Tests 2 Total Cost $0.031 Response Time (avg) 8.77s
#29	GPT-5.6 Terra high	OpenAI	1	7.6	$0.852	2/3	9.14s
Total Tests 3 Wrong Tests 1 Total Cost $0.852 Response Time (avg) 9.14s
#50	Step 3.7 Flash low	Stepfun	1	8.2	$0.341	2/3	9.46s
Total Tests 3 Wrong Tests 1 Total Cost $0.341 Response Time (avg) 9.46s
#48	GPT-5.6 Terra low	OpenAI	2	6.6	$0.343	1/3	9.56s
Total Tests 3 Wrong Tests 2 Total Cost $0.343 Response Time (avg) 9.56s
#184	gpt-oss-120b none	OpenAI	1	1.5	$0.010	0/1	9.57s
Total Tests 1 Wrong Tests 1 Total Cost $0.010 Response Time (avg) 9.57s
#22	GPT-5.2 Chat none	OpenAI	1	8.8	$0.393	2/3	9.82s
Total Tests 3 Wrong Tests 1 Total Cost $0.393 Response Time (avg) 9.82s
#55	GPT-5.6 Luna medium	OpenAI	2	5.4	$0.258	1/3	10.4s
Total Tests 3 Wrong Tests 2 Total Cost $0.258 Response Time (avg) 10.4s

←

1 4 5 6 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost