Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

Total Failures

230

Most Affected Model

North Mini Code 3

Failure Reasons

Wrong answer230 API error43 Timed out25 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific368 Anti-AI Tricks270 Coding230 Puzzle Solving173 Trivia150 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#30	Qwen3.7 Plus medium	Qwen	1	6.1	$0.177	1/3	108.6s
Total Tests 3 Wrong Tests 2 Total Cost $0.177 Response Time (avg) 108.6s
#54	Grok Build 0.1 medium	X AI	1	5.7	$0.927	1/3	108.5s
Total Tests 3 Wrong Tests 2 Total Cost $0.927 Response Time (avg) 108.5s
#90	MiMo-V2.5 medium	Xiaomi	2	6.2	$0.061	1/3	97.1s
Total Tests 3 Wrong Tests 2 Total Cost $0.061 Response Time (avg) 97.1s
#91	Mimo V2 PRO medium	Xiaomi	1	6.0	$0.333	1/3	94.2s
Total Tests 3 Wrong Tests 2 Total Cost $0.333 Response Time (avg) 94.2s
#2	Gemini 3 Flash Preview medium	Google	1	8.6	$0.667	2/3	84.4s
Total Tests 3 Wrong Tests 1 Total Cost $0.667 Response Time (avg) 84.4s
#128	Kimi K2.6 none	Moonshot AI	1	5.5	$0.078	1/3	82.6s
Total Tests 3 Wrong Tests 2 Total Cost $0.078 Response Time (avg) 82.6s
#160	Cobuddy medium	Baidu	1	3.7	$0.000	0/3	79.2s
Total Tests 3 Wrong Tests 3 Total Cost $0.000 Response Time (avg) 79.2s
#78	Laguna XS 2.1 medium	Poolside	2	5.5	$0.036	1/3	70.3s
Total Tests 3 Wrong Tests 2 Total Cost $0.036 Response Time (avg) 70.3s
#69	GLM 5V Turbo medium	Z.ai	2	6.0	$0.457	1/3	63.4s
Total Tests 3 Wrong Tests 2 Total Cost $0.457 Response Time (avg) 63.4s
#12	GPT-5.5 medium	OpenAI	1	8.8	$3.679	2/3	59.8s
Total Tests 3 Wrong Tests 1 Total Cost $3.679 Response Time (avg) 59.8s
#84	Qwen3.5-Flash medium	Qwen	2	3.7	$0.080	0/3	58.9s
Total Tests 3 Wrong Tests 3 Total Cost $0.080 Response Time (avg) 58.9s
#33	GPT-5.4 Mini medium	OpenAI	1	8.4	$0.526	2/3	57.9s
Total Tests 3 Wrong Tests 1 Total Cost $0.526 Response Time (avg) 57.9s
#27	DeepSeek V4 Flash high	DeepSeek	1	7.8	$0.027	2/3	50.6s
Total Tests 3 Wrong Tests 1 Total Cost $0.027 Response Time (avg) 50.6s
#20	GPT-5.4 medium	OpenAI	1	8.8	$1.210	2/3	44.4s
Total Tests 3 Wrong Tests 1 Total Cost $1.210 Response Time (avg) 44.4s
#60	Qwen3.6 Flash medium	Qwen	3	5.0	$0.288	0/3	42.9s
Total Tests 3 Wrong Tests 3 Total Cost $0.288 Response Time (avg) 42.9s

←

1 2 3 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost