Domain specific x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

314

Most Affected Model

Failure Reasons

Wrong answer314 Timed out34 Extra formatting12 API error6 No answer5 Did not follow instructions1

Categories

Domain specific314 Anti-AI Tricks245 Coding194 Puzzle Solving147 Trivia130 Instructions following53 Combined52 Data parsing and extraction35 General Intelligence32 Tool Calling2

Rank	Model	Company	Wrong answer Count	Category Score	Tests Correct	Response Time (avg)
#17	GLM 5 medium	Z.ai	2	3.5	0/3	0ms
#52	Claude Sonnet 4.6 medium	Anthropic	1	2.9	0/3	0ms
#160	LFM2-24B-A2B none	Liquid	1	5.9	1/3	287ms
#163	Granite 4.1 8B none	IBM Granite	3	3.0	0/3	357ms
#142	Mistral Small 4 none	Mistral	2	5.3	1/3	367ms
#146	Laguna Xs.2 none	Poolside	2	5.3	1/3	371ms
#154	Qwen3.5-9B none	Qwen	3	3.0	0/3	464ms
#131	Qwen3.5-122B-A10B none	Qwen	2	5.3	1/3	465ms
#117	Qwen3.5-35B-A3B none	Qwen	1	7.7	2/3	485ms
#162	Nemotron 3 Nano Omni 30b A3b Reasoning none	NVIDIA	3	3.6	0/3	489ms
#97	Gemini 2.5 Flash none	Google	2	5.9	1/3	495ms
#155	Mercury 2 none	Inception	2	5.3	1/3	534ms
#115	Qwen3.5-27B none	Qwen	3	3.0	0/3	540ms
#152	MiMo-V2-Flash none	Xiaomi	2	5.3	1/3	564ms
#106	Grok 4.20 Beta none	X AI	3	3.0	0/3	611ms

Top Models by Wrong answer Count