Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

230

Most Affected Model

Gemini 3 Flash Preview 1

Failure Reasons

Wrong answer230 API error43 Timed out25 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific368 Anti-AI Tricks270 Coding230 Puzzle Solving173 Trivia150 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#102	GPT-5.6 Sol none	OpenAI	2	5.5	$0.225	1/3	1.39s
Total Tests 3 Wrong Tests 2 Total Cost $0.225 Response Time (avg) 1.39s
#105	GPT-5.5 none	OpenAI	2	5.5	$0.231	1/3	1.35s
Total Tests 3 Wrong Tests 2 Total Cost $0.231 Response Time (avg) 1.35s
#107	Seed-2.0-Lite none	Bytedance Seed	2	5.6	$0.019	1/3	2.83s
Total Tests 3 Wrong Tests 2 Total Cost $0.019 Response Time (avg) 2.83s
#108	GPT-5.6 Luna low	OpenAI	2	5.5	$0.141	1/3	4.61s
Total Tests 3 Wrong Tests 2 Total Cost $0.141 Response Time (avg) 4.61s
#109	Gemini 2.5 Flash none	Google	2	5.5	$0.016	1/3	736ms
Total Tests 3 Wrong Tests 2 Total Cost $0.016 Response Time (avg) 736ms
#110	Gemini 3.1 Flash Lite minimal	Google	2	5.5	$0.013	1/3	831ms
Total Tests 3 Wrong Tests 2 Total Cost $0.013 Response Time (avg) 831ms
#112	Gemini 3.1 Flash Lite none	Google	2	5.5	$0.013	1/3	938ms
Total Tests 3 Wrong Tests 2 Total Cost $0.013 Response Time (avg) 938ms
#113	Qwen3.5-Flash none	Qwen	2	5.5	$0.005	1/3	850ms
Total Tests 3 Wrong Tests 2 Total Cost $0.005 Response Time (avg) 850ms
#114	Gemma 4 31B none	Google	2	5.5	$0.004	1/3	11.2s
Total Tests 3 Wrong Tests 2 Total Cost $0.004 Response Time (avg) 11.2s
#115	Nemotron 3 Ultra 550b A55b none	NVIDIA	2	5.5	$0.027	1/3	1.02s
Total Tests 3 Wrong Tests 2 Total Cost $0.027 Response Time (avg) 1.02s
#117	GPT-5.6 Terra none	OpenAI	2	5.5	$0.130	1/3	1.00s
Total Tests 3 Wrong Tests 2 Total Cost $0.130 Response Time (avg) 1.00s
#119	Qwen3.6 Flash none	Qwen	2	5.4	$0.015	1/3	1.79s
Total Tests 3 Wrong Tests 2 Total Cost $0.015 Response Time (avg) 1.79s
#120	Qwen3.5-35B-A3B none	Qwen	2	5.5	$0.012	1/3	1.39s
Total Tests 3 Wrong Tests 2 Total Cost $0.012 Response Time (avg) 1.39s
#121	Qwen3.5-27B none	Qwen	2	5.8	$0.015	1/3	1.80s
Total Tests 3 Wrong Tests 2 Total Cost $0.015 Response Time (avg) 1.80s
#122	GLM 5V Turbo none	Z.ai	2	5.5	$0.052	1/3	3.13s
Total Tests 3 Wrong Tests 2 Total Cost $0.052 Response Time (avg) 3.13s

←

1 4 5 6 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost