Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

230

Most Affected Model

Laguna XS 2.1 3

Failure Reasons

Wrong answer230 API error43 Timed out25 No answer18 Did not follow instructions16 Extra formatting12

Categories

Domain specific368 Anti-AI Tricks270 Coding230 Puzzle Solving173 Trivia150 Combined58 Instructions following56 General Intelligence49 Data parsing and extraction36 Tool Calling3

134/134

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#136	Kimi K2.5 none	Moonshot AI	2	5.5	$0.027	1/3	24.6s
Total Tests 3 Wrong Tests 2 Total Cost $0.027 Response Time (avg) 24.6s
#31	Nemotron 3 Ultra 550b A55b medium	NVIDIA	1	8.4	$0.158	2/3	26.5s
Total Tests 3 Wrong Tests 1 Total Cost $0.158 Response Time (avg) 26.5s
#23	Step 3.7 Flash medium	Stepfun	1	8.8	$0.376	2/3	27.4s
Total Tests 3 Wrong Tests 1 Total Cost $0.376 Response Time (avg) 27.4s
#39	Claude Sonnet 4.6 medium	Anthropic	1	5.7	$1.418	1/3	33.3s
Total Tests 3 Wrong Tests 2 Total Cost $1.418 Response Time (avg) 33.3s
#80	Gemini 3.5 Flash none	Google	1	8.8	$1.079	2/3	34.7s
Total Tests 3 Wrong Tests 1 Total Cost $1.079 Response Time (avg) 34.7s
#127	Owl Alpha none	Openrouter	1	5.6	$0.000	1/3	36.9s
Total Tests 3 Wrong Tests 2 Total Cost $0.000 Response Time (avg) 36.9s
#186	Nemotron 3 Nano Omni 30b A3b Reasoning medium	NVIDIA	1	1.1	$0.000	0/1	38.1s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 38.1s
#92	gpt-oss-120b medium	OpenAI	2	5.9	$0.013	1/3	38.4s
Total Tests 3 Wrong Tests 2 Total Cost $0.013 Response Time (avg) 38.4s
#153	Mistral Small 4 medium	Mistral	3	4.4	$0.068	0/3	40.0s
Total Tests 3 Wrong Tests 3 Total Cost $0.068 Response Time (avg) 40.0s
#10	Gemini 3.1 Pro Preview medium	Google	1	7.9	$1.054	2/3	40.2s
Total Tests 3 Wrong Tests 1 Total Cost $1.054 Response Time (avg) 40.2s
#17	GLM 5.2 medium	Z.ai	1	8.2	$0.179	2/3	41.0s
Total Tests 3 Wrong Tests 1 Total Cost $0.179 Response Time (avg) 41.0s
#28	Gemini 2.5 Flash medium	Google	1	7.8	$0.379	2/3	41.0s
Total Tests 3 Wrong Tests 1 Total Cost $0.379 Response Time (avg) 41.0s
#47	Grok 4.3 medium	X AI	1	5.9	$0.614	1/3	41.2s
Total Tests 3 Wrong Tests 2 Total Cost $0.614 Response Time (avg) 41.2s
#93	GPT-5 Nano medium	OpenAI	2	7.0	$0.081	1/3	41.6s
Total Tests 3 Wrong Tests 2 Total Cost $0.081 Response Time (avg) 41.6s
#60	Qwen3.6 Flash medium	Qwen	3	5.0	$0.288	0/3	42.9s
Total Tests 3 Wrong Tests 3 Total Cost $0.288 Response Time (avg) 42.9s

←

1 6 7 8 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Coding: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost