Domain specific x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

Total Failures

314

Most Affected Model

Failure Reasons

Wrong answer314 Timed out34 Extra formatting12 API error6 No answer5 Did not follow instructions1

Categories

Domain specific314 Anti-AI Tricks245 Coding194 Puzzle Solving147 Trivia130 Instructions following53 Combined52 Data parsing and extraction35 General Intelligence32 Tool Calling2

Rank	Model	Company	Wrong answer Count	Category Score	Tests Correct	Response Time (avg)
#129	MiniMax M2.5 medium	Minimax	2	2.9	0/3	237.3s
#103	DeepSeek V4 Pro high	DeepSeek	1	2.9	0/3	205.7s
#94	GPT-5 Nano medium	OpenAI	1	5.2	1/3	204.0s
#38	Grok 4.3 medium	X AI	2	5.3	1/3	181.7s
#158	GLM 4.7 Flash medium	Z.ai	2	3.5	0/3	174.6s
#62	Step 3.5 Flash medium	Stepfun	2	5.3	1/3	170.5s
#9	GPT-5.5 medium	OpenAI	2	5.3	1/3	164.1s
#47	Grok Build 0.1 medium	X AI	1	5.3	1/3	158.0s
#71	Step 3.7 Flash high	Stepfun	2	4.1	0/3	149.6s
#49	Qwen3.5-Flash medium	Qwen	1	5.3	1/3	146.5s
#53	Gemini 3.1 Flash Lite high	Google	3	3.6	0/3	139.9s
#76	Kimi K2.5 medium	Moonshot AI	2	3.5	0/3	137.3s
#119	Cobuddy medium	Baidu	3	2.9	0/3	128.2s
#12	Gemini 3.1 Flash Lite Preview high	Google	2	5.3	1/3	127.6s
#86	Grok 4.1 Fast medium	X AI	1	5.8	1/3	121.8s

Top Models by Wrong answer Count