Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

230

Most Affected Model

Gemini 3 Flash Preview 1

Failure Reasons

Wrong answer230 API error43 Timed out25 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific368 Anti-AI Tricks270 Coding230 Puzzle Solving173 Trivia150 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#116	Qwen3.6 Max Preview none	Qwen	3	3.8	$0.075	0/3	3.12s
Total Tests 3 Wrong Tests 3 Total Cost $0.075 Response Time (avg) 3.12s
#118	GLM 5 none	Z.ai	3	4.0	$0.027	0/3	5.12s
Total Tests 3 Wrong Tests 3 Total Cost $0.027 Response Time (avg) 5.12s
#123	Qwen3.5 Plus 2026-02-15 none	Qwen	3	4.3	$0.016	0/3	2.05s
Total Tests 3 Wrong Tests 3 Total Cost $0.016 Response Time (avg) 2.05s
#124	North Mini Code medium	Cohere	3	4.5	$0.000	0/3	320.4s
Total Tests 3 Wrong Tests 3 Total Cost $0.000 Response Time (avg) 320.4s
#131	Mimo V2 Omni none	Xiaomi	1	4.4	$0.021	0/3	2.75s
Total Tests 3 Wrong Tests 3 Total Cost $0.021 Response Time (avg) 2.75s
#132	Claude Sonnet 5 none	Anthropic	3	4.6	$0.287	0/3	3.67s
Total Tests 3 Wrong Tests 3 Total Cost $0.287 Response Time (avg) 3.67s
#134	GLM 5.1 none	Z.ai	3	3.9	$0.057	0/3	4.96s
Total Tests 3 Wrong Tests 3 Total Cost $0.057 Response Time (avg) 4.96s
#135	DeepSeek V4 Flash none	DeepSeek	3	4.2	$0.007	0/3	17.1s
Total Tests 3 Wrong Tests 3 Total Cost $0.007 Response Time (avg) 17.1s
#137	MiMo-V2.5-Pro none	Xiaomi	2	4.3	$0.017	0/3	1.41s
Total Tests 3 Wrong Tests 3 Total Cost $0.017 Response Time (avg) 1.41s
#139	Gemma 4 26B A4B none	Google	2	3.7	$0.004	0/3	4.16s
Total Tests 3 Wrong Tests 3 Total Cost $0.004 Response Time (avg) 4.16s
#140	Qwen3.5 Plus 2026-04-20 none	Qwen	2	3.9	$0.032	0/3	1.69s
Total Tests 3 Wrong Tests 3 Total Cost $0.032 Response Time (avg) 1.69s
#141	GLM 5 Turbo none	Z.ai	3	3.9	$0.047	0/3	2.41s
Total Tests 3 Wrong Tests 3 Total Cost $0.047 Response Time (avg) 2.41s
#142	Laguna XS 2.1 none	Poolside	3	4.3	$0.003	0/3	623ms
Total Tests 3 Wrong Tests 3 Total Cost $0.003 Response Time (avg) 623ms
#143	GPT-5.6 Luna none	OpenAI	3	3.8	$0.047	0/3	980ms
Total Tests 3 Wrong Tests 3 Total Cost $0.047 Response Time (avg) 980ms
#145	Qwen3.5-122B-A10B none	Qwen	3	3.7	$0.020	0/3	2.77s
Total Tests 3 Wrong Tests 3 Total Cost $0.020 Response Time (avg) 2.77s

←

1 6 7 8 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost