Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

230

Most Affected Model

Gemini 3 Flash Preview 1

Failure Reasons

Wrong answer230 API error43 Timed out23 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific367 Anti-AI Tricks270 Coding230 Puzzle Solving172 Trivia149 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#31	Nemotron 3 Ultra 550b A55b medium	NVIDIA	1	8.4	$0.158	2/3	26.5s
Total Tests 3 Wrong Tests 1 Total Cost $0.158 Response Time (avg) 26.5s
#32	GPT-5.4 Mini medium	OpenAI	1	8.4	$0.526	2/3	57.9s
Total Tests 3 Wrong Tests 1 Total Cost $0.526 Response Time (avg) 57.9s
#35	Claude Sonnet 5 medium	Anthropic	1	9.0	$0.550	2/3	17.3s
Total Tests 3 Wrong Tests 1 Total Cost $0.550 Response Time (avg) 17.3s
#49	Step 3.7 Flash low	Stepfun	1	8.2	$0.341	2/3	9.46s
Total Tests 3 Wrong Tests 1 Total Cost $0.341 Response Time (avg) 9.46s
#56	Mercury 2 medium	Inception	1	8.2	$0.058	2/3	2.04s
Total Tests 3 Wrong Tests 1 Total Cost $0.058 Response Time (avg) 2.04s
#79	Gemini 3.5 Flash none	Google	1	8.8	$1.079	2/3	34.7s
Total Tests 3 Wrong Tests 1 Total Cost $1.079 Response Time (avg) 34.7s
#30	Qwen3.7 Plus medium	Qwen	1	6.1	$0.177	1/3	108.6s
Total Tests 3 Wrong Tests 2 Total Cost $0.177 Response Time (avg) 108.6s
#33	Qwen3.5 Plus 2026-02-15 medium	Qwen	1	6.6	$0.310	1/3	180.7s
Total Tests 3 Wrong Tests 2 Total Cost $0.310 Response Time (avg) 180.7s
#34	Qwen3.5-27B medium	Qwen	2	6.2	$0.536	1/3	160.7s
Total Tests 3 Wrong Tests 2 Total Cost $0.536 Response Time (avg) 160.7s
#36	Qwen3.6 Plus medium	Qwen	1	6.1	$0.294	1/3	153.1s
Total Tests 3 Wrong Tests 2 Total Cost $0.294 Response Time (avg) 153.1s
#37	GPT-5.6 Terra medium	OpenAI	2	6.1	$0.496	1/3	7.19s
Total Tests 3 Wrong Tests 2 Total Cost $0.496 Response Time (avg) 7.19s
#38	Claude Sonnet 4.6 medium	Anthropic	1	5.7	$1.418	1/3	33.3s
Total Tests 3 Wrong Tests 2 Total Cost $1.418 Response Time (avg) 33.3s
#40	Gemini 3.1 Flash Lite Preview medium	Google	2	5.5	$0.068	1/3	4.09s
Total Tests 3 Wrong Tests 2 Total Cost $0.068 Response Time (avg) 4.09s
#41	Qwen3.5 Plus 2026-04-20 medium	Qwen	2	6.2	$0.317	1/3	125.3s
Total Tests 3 Wrong Tests 2 Total Cost $0.317 Response Time (avg) 125.3s
#42	Gemini 3.1 Flash Lite medium	Google	2	5.5	$0.071	1/3	3.81s
Total Tests 3 Wrong Tests 2 Total Cost $0.071 Response Time (avg) 3.81s

←

1 2 3 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost