Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

230

Most Affected Model

Laguna XS 2.1 3

Failure Reasons

Wrong answer230 API error43 Timed out25 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific368 Anti-AI Tricks270 Coding230 Puzzle Solving173 Trivia150 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#121	Qwen3.5-27B none	Qwen	2	5.8	$0.015	1/3	1.80s
Total Tests 3 Wrong Tests 2 Total Cost $0.015 Response Time (avg) 1.80s
#81	Gemini 3 Flash Preview none	Google	2	5.5	$0.025	1/3	1.80s
Total Tests 3 Wrong Tests 2 Total Cost $0.025 Response Time (avg) 1.80s
#182	Laguna Xs.2 none	Poolside	1	8.3	$0.004	0/1	1.96s
Total Tests 1 Wrong Tests 1 Total Cost $0.004 Response Time (avg) 1.96s
#57	Mercury 2 medium	Inception	1	8.2	$0.058	2/3	2.04s
Total Tests 3 Wrong Tests 1 Total Cost $0.058 Response Time (avg) 2.04s
#123	Qwen3.5 Plus 2026-02-15 none	Qwen	3	4.3	$0.016	0/3	2.05s
Total Tests 3 Wrong Tests 3 Total Cost $0.016 Response Time (avg) 2.05s
#73	Qwen3.7 Plus none	Qwen	2	5.5	$0.023	1/3	2.15s
Total Tests 3 Wrong Tests 2 Total Cost $0.023 Response Time (avg) 2.15s
#165	GPT-5.4 Nano none	OpenAI	3	4.6	$0.011	0/3	2.22s
Total Tests 3 Wrong Tests 3 Total Cost $0.011 Response Time (avg) 2.22s
#150	Qwen3 Coder Next none	Qwen	3	4.6	$0.009	0/3	2.22s
Total Tests 3 Wrong Tests 3 Total Cost $0.009 Response Time (avg) 2.22s
#141	GLM 5 Turbo none	Z.ai	3	3.9	$0.047	0/3	2.41s
Total Tests 3 Wrong Tests 3 Total Cost $0.047 Response Time (avg) 2.41s
#161	GLM 4.7 Flash none	Z.ai	3	4.3	$0.004	0/3	2.54s
Total Tests 3 Wrong Tests 3 Total Cost $0.004 Response Time (avg) 2.54s
#179	MiMo-V2-Flash none	Xiaomi	2	4.3	$0.025	0/3	2.64s
Total Tests 3 Wrong Tests 3 Total Cost $0.025 Response Time (avg) 2.64s
#162	Nemotron 3 Super none	NVIDIA	3	3.3	$0.006	0/3	2.64s
Total Tests 3 Wrong Tests 3 Total Cost $0.006 Response Time (avg) 2.64s
#126	Mimo V2 PRO none	Xiaomi	1	5.5	$0.045	1/3	2.65s
Total Tests 3 Wrong Tests 2 Total Cost $0.045 Response Time (avg) 2.65s
#85	Gemini 3.5 Flash minimal	Google	1	5.6	$0.108	1/3	2.75s
Total Tests 3 Wrong Tests 2 Total Cost $0.108 Response Time (avg) 2.75s
#131	Mimo V2 Omni none	Xiaomi	1	4.4	$0.021	0/3	2.75s
Total Tests 3 Wrong Tests 3 Total Cost $0.021 Response Time (avg) 2.75s

←

1 2 3 4 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost