Domain specific x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

314

Most Affected Model

Failure Reasons

Wrong answer314 Timed out34 Extra formatting12 API error6 No answer5 Did not follow instructions1

Categories

Domain specific314 Anti-AI Tricks245 Coding194 Puzzle Solving147 Trivia130 Instructions following53 Combined52 Data parsing and extraction35 General Intelligence32 Tool Calling2

Rank	Model	Company	Wrong answer Count	Category Score	Tests Correct	Response Time (avg)
#85	Gemma 4 31B none	Google	1	7.7	2/3	3.22s
#3	Gemini 3.5 Flash low	Google	1	7.7	2/3	3.39s
#77	Claude Sonnet 4.6 none	Anthropic	1	7.7	2/3	3.54s
#133	DeepSeek V3.2 none	DeepSeek	2	2.9	0/3	4.17s
#40	Gemini 3.1 Flash Lite Preview medium	Google	3	3.0	0/3	4.21s
#135	Kimi K2.5 none	Moonshot AI	2	5.3	1/3	4.38s
#114	Qwen3.5 Plus 2026-04-20 none	Qwen	2	5.3	1/3	4.43s
#138	Ling-2.6-flash none	Inclusionai	3	3.0	0/3	4.95s
#7	Gemini 3.5 Flash medium	Google	1	7.7	2/3	5.24s
#145	Laguna M.1 none	Poolside	3	3.6	0/3	5.50s
#132	Mistral Small 4 medium	Mistral	1	5.3	1/3	6.11s
#141	Nemotron 3 Super none	NVIDIA	3	3.6	0/3	6.23s
#81	Mercury 2 medium	Inception	3	2.9	0/3	6.48s
#35	Gemini 3 PRO Preview medium	Google	2	5.3	1/3	7.01s
#153	Qwen3.6 35B A3B none	Qwen	3	3.5	0/3	7.45s

Top Models by Wrong answer Count