Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

230

Most Affected Model

Gemini 3 Flash Preview 1

Failure Reasons

Wrong answer230 API error43 Timed out25 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific368 Anti-AI Tricks270 Coding230 Puzzle Solving173 Trivia150 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#125	Owl Alpha medium	Openrouter	1	5.4	$0.000	1/3	18.7s
Total Tests 3 Wrong Tests 2 Total Cost $0.000 Response Time (avg) 18.7s
#126	Mimo V2 PRO none	Xiaomi	1	5.5	$0.045	1/3	2.65s
Total Tests 3 Wrong Tests 2 Total Cost $0.045 Response Time (avg) 2.65s
#127	Owl Alpha none	Openrouter	1	5.6	$0.000	1/3	36.9s
Total Tests 3 Wrong Tests 2 Total Cost $0.000 Response Time (avg) 36.9s
#128	Kimi K2.6 none	Moonshot AI	1	5.5	$0.078	1/3	82.6s
Total Tests 3 Wrong Tests 2 Total Cost $0.078 Response Time (avg) 82.6s
#129	GPT-5.4 none	OpenAI	2	5.5	$0.122	1/3	1.62s
Total Tests 3 Wrong Tests 2 Total Cost $0.122 Response Time (avg) 1.62s
#136	Kimi K2.5 none	Moonshot AI	2	5.5	$0.027	1/3	24.6s
Total Tests 3 Wrong Tests 2 Total Cost $0.027 Response Time (avg) 24.6s
#138	Qwen3.6 27B none	Qwen	2	5.5	$0.025	1/3	4.16s
Total Tests 3 Wrong Tests 2 Total Cost $0.025 Response Time (avg) 4.16s
#144	GPT-5.4 Mini none	OpenAI	2	5.5	$0.038	1/3	913ms
Total Tests 3 Wrong Tests 2 Total Cost $0.038 Response Time (avg) 913ms
#148	Qwen3.6 35B A3B none	Qwen	2	5.5	$0.031	1/3	8.77s
Total Tests 3 Wrong Tests 2 Total Cost $0.031 Response Time (avg) 8.77s
#154	MiMo-V2.5 none	Xiaomi	2	5.5	$0.006	1/3	3.24s
Total Tests 3 Wrong Tests 2 Total Cost $0.006 Response Time (avg) 3.24s
#60	Qwen3.6 Flash medium	Qwen	3	5.0	$0.288	0/3	42.9s
Total Tests 3 Wrong Tests 3 Total Cost $0.288 Response Time (avg) 42.9s
#74	GLM 5.2 none	Z.ai	2	3.7	$0.042	0/3	7.55s
Total Tests 3 Wrong Tests 3 Total Cost $0.042 Response Time (avg) 7.55s
#76	Step 3.7 Flash high	Stepfun	1	4.0	$1.148	0/3	206.2s
Total Tests 3 Wrong Tests 3 Total Cost $1.148 Response Time (avg) 206.2s
#84	Qwen3.5-Flash medium	Qwen	2	3.7	$0.080	0/3	58.9s
Total Tests 3 Wrong Tests 3 Total Cost $0.080 Response Time (avg) 58.9s
#87	Mimo V2 Omni medium	Xiaomi	1	3.3	$0.683	0/3	183.9s
Total Tests 3 Wrong Tests 3 Total Cost $0.683 Response Time (avg) 183.9s

←

1 5 6 7 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost