Domain specific x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

314

Most Affected Model

Failure Reasons

Wrong answer314 Timed out34 Extra formatting12 API error6 No answer5 Did not follow instructions1

Categories

Domain specific314 Anti-AI Tricks245 Coding194 Puzzle Solving147 Trivia130 Instructions following53 Combined52 Data parsing and extraction35 General Intelligence32 Tool Calling2

Rank	Model	Company	Wrong answer Count	Category Score	Tests Correct	Response Time (avg)
#2	Gemini 3.5 Flash high	Google	1	7.6	2/3	14.1s
#3	Gemini 3.5 Flash low	Google	1	7.7	2/3	3.39s
#4	Gemini 3.1 Pro Preview medium	Google	1	7.7	2/3	32.7s
#7	Gemini 3.5 Flash medium	Google	1	7.7	2/3	5.24s
#8	Claude Opus 4.7 none	Anthropic	1	7.7	2/3	1.19s
#20	Gemini 3.5 Flash none	Google	1	7.6	2/3	10.6s
#22	Step 3.7 Flash medium	Stepfun	1	7.7	2/3	48.3s
#27	Gemma 4 31B medium	Google	1	7.7	2/3	38.5s
#34	Qwen3.7 Max none	Qwen	1	7.7	2/3	975ms
#48	Gemini 3 Flash Preview none	Google	1	7.7	2/3	963ms
#74	Qwen3.6 Max Preview none	Qwen	1	7.7	2/3	1.22s
#77	Claude Sonnet 4.6 none	Anthropic	1	7.7	2/3	3.54s
#85	Gemma 4 31B none	Google	1	7.7	2/3	3.22s
#108	Qwen3.5-Flash none	Qwen	1	7.7	2/3	905ms
#117	Qwen3.5-35B-A3B none	Qwen	1	7.7	2/3	485ms

Top Models by Wrong answer Count