Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

230

Most Affected Model

Gemini 3 Flash Preview 1

Failure Reasons

Wrong answer230 API error43 Timed out25 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific368 Anti-AI Tricks270 Coding230 Puzzle Solving173 Trivia150 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#45	Qwen3.5-122B-A10B medium	Qwen	1	6.0	$0.588	1/3	114.5s
Total Tests 3 Wrong Tests 2 Total Cost $0.588 Response Time (avg) 114.5s
#46	Claude Opus 4.8 low	Anthropic	1	6.6	$1.270	1/3	7.58s
Total Tests 3 Wrong Tests 2 Total Cost $1.270 Response Time (avg) 7.58s
#47	Grok 4.3 medium	X AI	1	5.9	$0.614	1/3	41.2s
Total Tests 3 Wrong Tests 2 Total Cost $0.614 Response Time (avg) 41.2s
#48	GPT-5.6 Terra low	OpenAI	2	6.6	$0.343	1/3	9.56s
Total Tests 3 Wrong Tests 2 Total Cost $0.343 Response Time (avg) 9.56s
#51	GPT-5.6 Luna high	OpenAI	2	5.5	$0.924	1/3	15.6s
Total Tests 3 Wrong Tests 2 Total Cost $0.924 Response Time (avg) 15.6s
#54	Grok Build 0.1 medium	X AI	1	5.7	$0.927	1/3	108.5s
Total Tests 3 Wrong Tests 2 Total Cost $0.927 Response Time (avg) 108.5s
#55	GPT-5.6 Luna medium	OpenAI	2	5.4	$0.258	1/3	10.4s
Total Tests 3 Wrong Tests 2 Total Cost $0.258 Response Time (avg) 10.4s
#58	GPT-5.3 Chat none	OpenAI	2	5.6	$0.433	1/3	10.5s
Total Tests 3 Wrong Tests 2 Total Cost $0.433 Response Time (avg) 10.5s
#59	GPT-5.4 Nano medium	OpenAI	2	6.1	$0.107	1/3	19.1s
Total Tests 3 Wrong Tests 2 Total Cost $0.107 Response Time (avg) 19.1s
#61	DeepSeek V3.2 medium	DeepSeek	1	6.0	$0.042	1/3	248.7s
Total Tests 3 Wrong Tests 2 Total Cost $0.042 Response Time (avg) 248.7s
#63	Seed-2.0-Mini medium	Bytedance Seed	1	5.5	$0.044	1/3	220.5s
Total Tests 3 Wrong Tests 2 Total Cost $0.044 Response Time (avg) 220.5s
#65	Gemini 3 Flash Preview low	Google	2	5.8	$0.111	1/3	6.00s
Total Tests 3 Wrong Tests 2 Total Cost $0.111 Response Time (avg) 6.00s
#66	Grok 4.20 medium	X AI	2	6.3	$0.609	1/3	109.9s
Total Tests 3 Wrong Tests 2 Total Cost $0.609 Response Time (avg) 109.9s
#68	Claude Sonnet 4.6 none	Anthropic	1	5.5	$0.316	1/3	5.19s
Total Tests 3 Wrong Tests 2 Total Cost $0.316 Response Time (avg) 5.19s
#69	GLM 5V Turbo medium	Z.ai	2	6.0	$0.457	1/3	63.4s
Total Tests 3 Wrong Tests 2 Total Cost $0.457 Response Time (avg) 63.4s

←

1 2 3 4 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost