Trivia x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Trivia, so you can spot weak points faster. Sort by: Total Cost ↑.

Models Shown

Total Failures

133

Most Affected Model

Owl Alpha 1

Failure Reasons

Wrong answer133 API error13 No answer8

Categories

Domain specific325 Anti-AI Tricks250 Coding201 Puzzle Solving154 Trivia133 Instructions following54 Combined53 General Intelligence36 Data parsing and extraction35 Tool Calling2

133/133

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#93	Gemini 2.5 Flash none	Google	1	3.0	$0.016	0/1	1.15s
Total Tests 1 Wrong Tests 1 Total Cost $0.016 Response Time (avg) 1.15s
#106	Qwen3.5 Plus 2026-02-15 none	Qwen	1	3.0	$0.016	0/1	1.11s
Total Tests 1 Wrong Tests 1 Total Cost $0.016 Response Time (avg) 1.11s
#119	MiMo-V2.5-Pro none	Xiaomi	1	3.0	$0.017	0/1	1.89s
Total Tests 1 Wrong Tests 1 Total Cost $0.017 Response Time (avg) 1.89s
#126	DeepSeek V3.2 none	DeepSeek	1	3.0	$0.017	0/1	17.2s
Total Tests 1 Wrong Tests 1 Total Cost $0.017 Response Time (avg) 17.2s
#84	Gemini 3.1 Flash Lite Preview none	Google	1	3.0	$0.018	0/1	814ms
Total Tests 1 Wrong Tests 1 Total Cost $0.018 Response Time (avg) 814ms
#86	Hy3 preview low	Tencent	1	3.0	$0.018	0/1	41.7s
Total Tests 1 Wrong Tests 1 Total Cost $0.018 Response Time (avg) 41.7s
#92	Seed-2.0-Lite none	Bytedance Seed	1	3.0	$0.019	0/1	1.96s
Total Tests 1 Wrong Tests 1 Total Cost $0.019 Response Time (avg) 1.96s
#125	Qwen3.5-122B-A10B none	Qwen	1	3.0	$0.020	0/1	295ms
Total Tests 1 Wrong Tests 1 Total Cost $0.020 Response Time (avg) 295ms
#168	Step 3.5 Flash none	Stepfun	1	3.0	$0.020	0/1	114.1s
Total Tests 1 Wrong Tests 1 Total Cost $0.020 Response Time (avg) 114.1s
#87	Nemotron 3 Super medium	NVIDIA	1	3.0	$0.021	0/1	55.3s
Total Tests 1 Wrong Tests 1 Total Cost $0.021 Response Time (avg) 55.3s
#114	Mimo V2 Omni none	Xiaomi	1	3.0	$0.021	0/1	1.30s
Total Tests 1 Wrong Tests 1 Total Cost $0.021 Response Time (avg) 1.30s
#54	Hy3 preview medium	Tencent	1	3.0	$0.021	0/1	39.9s
Total Tests 1 Wrong Tests 1 Total Cost $0.021 Response Time (avg) 39.9s
#60	Qwen3.7 Plus none	Qwen	1	3.0	$0.023	0/1	1.21s
Total Tests 1 Wrong Tests 1 Total Cost $0.023 Response Time (avg) 1.21s
#67	Gemini 3 Flash Preview none	Google	1	3.0	$0.025	0/1	1.07s
Total Tests 1 Wrong Tests 1 Total Cost $0.025 Response Time (avg) 1.07s
#159	MiMo-V2-Flash none	Xiaomi	1	3.0	$0.025	0/1	1.82s
Total Tests 1 Wrong Tests 1 Total Cost $0.025 Response Time (avg) 1.82s

←

1 2 3 4 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Trivia: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost