Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Failure Count ↑.

Models Shown

Total Failures

120

Most Affected Model

Failure Reasons

Wrong answer120 No answer18 Did not follow instructions16 Timed out12 Extra formatting7 API error6

Categories

Domain specific298 Anti-AI Tricks235 Puzzle Solving148 Trivia127 Coding120 Instructions following52 Combined51 Data parsing and extraction32 General Intelligence27 Tool Calling2

Rank	Model	Company	Wrong answer Count	Category Score	Tests Correct	Response Time (avg)
#1	Gemini 3 Flash Preview medium	Google	1	7.9	1/2	96.0s
#3	Gemini 3.5 Flash low	Google	1	6.8	1/2	5.54s
#4	Gemini 3.1 Pro Preview medium	Google	1	7.0	1/2	54.3s
#9	Gemini 3.5 Flash none	Google	1	8.2	1/2	39.6s
#11	GPT-5.5 medium	OpenAI	1	8.2	1/2	69.7s
#12	Gemini 3 Flash Preview low	Google	1	7.3	1/2	6.66s
#14	Qwen3.6 Max Preview medium	Qwen	1	8.2	1/2	178.0s
#20	Qwen3.5 Plus 2026-02-15 medium	Qwen	1	7.6	1/2	193.8s
#21	Seed-2.0-Lite medium	Bytedance Seed	1	7.0	1/2	107.7s
#25	Qwen3.5-27B medium	Qwen	1	7.0	1/2	123.9s
#26	Qwen3.7 Max none	Qwen	1	6.8	1/2	1.39s
#27	GPT-5.4 medium	OpenAI	1	8.2	1/2	55.0s
#33	Qwen3.6 Plus medium	Qwen	1	4.1	0/2	201.7s
#34	Gemini 3.1 Flash Lite Preview medium	Google	1	6.8	1/2	3.98s
#35	Gemini 3.1 Flash Lite medium	Google	1	6.8	1/2	3.59s

1 2 3 4 5 6 7

→

Top Models by Wrong answer Count