Trivia x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Trivia, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

133

Most Affected Model

Qwen3.7 Max 1

Failure Reasons

Wrong answer133 API error13 No answer8

Categories

Domain specific325 Anti-AI Tricks250 Coding201 Puzzle Solving154 Trivia133 Instructions following54 Combined53 General Intelligence36 Data parsing and extraction35 Tool Calling2

133/133

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#59	Gemma 4 26B A4B medium	Google	1	3.0	$0.045	0/1	180.9s
Total Tests 1 Wrong Tests 1 Total Cost $0.045 Response Time (avg) 180.9s
#60	Qwen3.7 Plus none	Qwen	1	3.0	$0.023	0/1	1.21s
Total Tests 1 Wrong Tests 1 Total Cost $0.023 Response Time (avg) 1.21s
#61	GLM 5.2 none	Z.ai	1	3.0	$0.076	0/1	3.41s
Total Tests 1 Wrong Tests 1 Total Cost $0.076 Response Time (avg) 3.41s
#62	MiMo-V2-Flash medium	Xiaomi	1	3.0	$0.043	0/1	1.96s
Total Tests 1 Wrong Tests 1 Total Cost $0.043 Response Time (avg) 1.96s
#64	GLM 5.1 medium	Z.ai	1	3.0	$0.292	0/1	29.4s
Total Tests 1 Wrong Tests 1 Total Cost $0.292 Response Time (avg) 29.4s
#65	Kimi K2.7 Code medium	Moonshot AI	1	3.0	$0.583	0/1	341.8s
Total Tests 1 Wrong Tests 1 Total Cost $0.583 Response Time (avg) 341.8s
#66	Gemini 3.5 Flash none	Google	1	2.8	$1.079	0/1	4.87s
Total Tests 1 Wrong Tests 1 Total Cost $1.079 Response Time (avg) 4.87s
#67	Gemini 3 Flash Preview none	Google	1	3.0	$0.025	0/1	1.07s
Total Tests 1 Wrong Tests 1 Total Cost $0.025 Response Time (avg) 1.07s
#68	Qwen3.7 Max none	Qwen	1	3.0	$0.054	0/1	856ms
Total Tests 1 Wrong Tests 1 Total Cost $0.054 Response Time (avg) 856ms
#70	Qwen3.5-Flash medium	Qwen	1	3.0	$0.080	0/1	49.0s
Total Tests 1 Wrong Tests 1 Total Cost $0.080 Response Time (avg) 49.0s
#71	Gemini 3.5 Flash minimal	Google	1	3.0	$0.108	0/1	1.76s
Total Tests 1 Wrong Tests 1 Total Cost $0.108 Response Time (avg) 1.76s
#72	Ring-2.6-1T medium	Inclusionai	1	3.0	$0.033	0/1	113.9s
Total Tests 1 Wrong Tests 1 Total Cost $0.033 Response Time (avg) 113.9s
#73	Mimo V2 Omni medium	Xiaomi	1	3.0	$0.683	0/1	234.2s
Total Tests 1 Wrong Tests 1 Total Cost $0.683 Response Time (avg) 234.2s
#74	Hy3 preview high	Tencent	1	3.0	$0.059	0/1	47.7s
Total Tests 1 Wrong Tests 1 Total Cost $0.059 Response Time (avg) 47.7s
#75	Qwen3.6 35B A3B medium	Qwen	1	3.0	$0.146	0/1	32.9s
Total Tests 1 Wrong Tests 1 Total Cost $0.146 Response Time (avg) 32.9s

←

1 3 4 5 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Trivia: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost