Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster.

Models Shown

Total Failures

230

Most Affected Model

Qwen3.6 Flash 3

Failure Reasons

Wrong answer230 API error43 Timed out23 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific367 Anti-AI Tricks270 Coding230 Puzzle Solving172 Trivia149 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#152	Mistral Small 4 medium	Mistral	3	4.4	$0.068	0/3	40.0s
Total Tests 3 Wrong Tests 3 Total Cost $0.068 Response Time (avg) 40.0s
#154	Qwen3.5-9B none	Qwen	3	3.9	$0.006	0/3	5.60s
Total Tests 3 Wrong Tests 3 Total Cost $0.006 Response Time (avg) 5.60s
#158	GPT-4o-mini none	OpenAI	3	3.2	$0.006	0/3	1.63s
Total Tests 3 Wrong Tests 3 Total Cost $0.006 Response Time (avg) 1.63s
#160	GLM 4.7 Flash none	Z.ai	3	4.3	$0.004	0/3	2.54s
Total Tests 3 Wrong Tests 3 Total Cost $0.004 Response Time (avg) 2.54s
#161	Nemotron 3 Super none	NVIDIA	3	3.3	$0.006	0/3	2.64s
Total Tests 3 Wrong Tests 3 Total Cost $0.006 Response Time (avg) 2.64s
#164	GPT-5.4 Nano none	OpenAI	3	4.6	$0.011	0/3	2.22s
Total Tests 3 Wrong Tests 3 Total Cost $0.011 Response Time (avg) 2.22s
#170	Mercury 2 none	Inception	3	3.4	$0.011	0/3	1.03s
Total Tests 3 Wrong Tests 3 Total Cost $0.011 Response Time (avg) 1.03s
#34	Qwen3.5-27B medium	Qwen	2	6.2	$0.536	1/3	160.7s
Total Tests 3 Wrong Tests 2 Total Cost $0.536 Response Time (avg) 160.7s
#37	GPT-5.6 Terra medium	OpenAI	2	6.1	$0.496	1/3	7.19s
Total Tests 3 Wrong Tests 2 Total Cost $0.496 Response Time (avg) 7.19s
#40	Gemini 3.1 Flash Lite Preview medium	Google	2	5.5	$0.068	1/3	4.09s
Total Tests 3 Wrong Tests 2 Total Cost $0.068 Response Time (avg) 4.09s
#41	Qwen3.5 Plus 2026-04-20 medium	Qwen	2	6.2	$0.317	1/3	125.3s
Total Tests 3 Wrong Tests 2 Total Cost $0.317 Response Time (avg) 125.3s
#42	Gemini 3.1 Flash Lite medium	Google	2	5.5	$0.071	1/3	3.81s
Total Tests 3 Wrong Tests 2 Total Cost $0.071 Response Time (avg) 3.81s
#47	GPT-5.6 Terra low	OpenAI	2	6.6	$0.343	1/3	9.56s
Total Tests 3 Wrong Tests 2 Total Cost $0.343 Response Time (avg) 9.56s
#50	GPT-5.6 Luna high	OpenAI	2	5.5	$0.924	1/3	15.6s
Total Tests 3 Wrong Tests 2 Total Cost $0.924 Response Time (avg) 15.6s
#54	GPT-5.6 Luna medium	OpenAI	2	5.4	$0.258	1/3	10.4s
Total Tests 3 Wrong Tests 2 Total Cost $0.258 Response Time (avg) 10.4s

←

1 2 3 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost