Coding x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

Most Affected Model

Failure Reasons

Wrong answer26 Timed out12 API error6 Did not follow instructions2

Categories

Domain specific173 Anti-AI Tricks156 Puzzle Solving80 Instructions following43 Combined34 Coding26 Data parsing and extraction17 General Intelligence10 Tool Calling2

Rank	Model	Company	Wrong answer Count	Category Score	Tests Correct	Response Time (avg)
#84	Qwen3.5-9B none	Qwen	1	5.2	0/1	5.69s
#61	DeepSeek V3.2 none	DeepSeek	1	2.4	0/1	7.63s
#50	GLM 5 none	Z.ai	1	5.6	0/1	8.84s
#79	gpt-oss-120b none	OpenAI	1	4.3	0/1	9.57s
#71	GLM 5.1 none	Z.ai	1	5.1	0/1	9.79s
#44	Grok 4.20 medium	X AI	1	4.3	0/1	24.3s
#65	gpt-oss-120b medium	OpenAI	1	4.3	0/1	26.3s
#69	Mistral Small 4 medium	Mistral	1	6.7	0/1	30.5s
#74	Trinity Large Preview none	Arcee AI	1	6.3	0/1	39.5s
#54	GPT-5 Nano medium	OpenAI	1	6.7	0/1	40.7s
#32	MiMo-V2-Omni medium	Xiaomi	1	4.0	0/1	68.5s

Top Models by Wrong answer Count