Trivia x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Trivia, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

Total Failures

133

Most Affected Model

Kimi K2.7 Code 1

Failure Reasons

Wrong answer133 API error13 No answer8

Categories

Domain specific325 Anti-AI Tricks250 Coding201 Puzzle Solving154 Trivia133 Instructions following54 Combined53 General Intelligence36 Data parsing and extraction35 Tool Calling2

133/133

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#48	DeepSeek V3.2 medium	DeepSeek	1	3.0	$0.044	0/1	84.0s
Total Tests 1 Wrong Tests 1 Total Cost $0.044 Response Time (avg) 84.0s
#43	Kimi K2.5 medium	Moonshot AI	1	3.0	$0.348	0/1	83.9s
Total Tests 1 Wrong Tests 1 Total Cost $0.348 Response Time (avg) 83.9s
#77	Mimo V2 PRO medium	Xiaomi	1	3.0	$0.333	0/1	82.7s
Total Tests 1 Wrong Tests 1 Total Cost $0.333 Response Time (avg) 82.7s
#81	Qwen3.6 27B medium	Qwen	1	3.0	$0.440	0/1	81.0s
Total Tests 1 Wrong Tests 1 Total Cost $0.440 Response Time (avg) 81.0s
#146	MiniMax M2.5 medium	Minimax	1	3.0	$0.303	0/1	80.8s
Total Tests 1 Wrong Tests 1 Total Cost $0.303 Response Time (avg) 80.8s
#15	GLM 5 medium	Z.ai	1	3.0	$0.228	0/1	67.4s
Total Tests 1 Wrong Tests 1 Total Cost $0.228 Response Time (avg) 67.4s
#53	Grok 4.20 medium	X AI	1	3.0	$0.609	0/1	63.5s
Total Tests 1 Wrong Tests 1 Total Cost $0.609 Response Time (avg) 63.5s
#38	Claude Opus 4.6 medium	Anthropic	1	3.0	$2.053	0/1	63.2s
Total Tests 1 Wrong Tests 1 Total Cost $2.053 Response Time (avg) 63.2s
#11	Qwen3.6 Max Preview medium	Qwen	1	3.0	$0.960	0/1	60.6s
Total Tests 1 Wrong Tests 1 Total Cost $0.960 Response Time (avg) 60.6s
#50	Seed-2.0-Mini medium	Bytedance Seed	1	3.0	$0.044	0/1	56.8s
Total Tests 1 Wrong Tests 1 Total Cost $0.044 Response Time (avg) 56.8s
#87	Nemotron 3 Super medium	NVIDIA	1	3.0	$0.021	0/1	55.3s
Total Tests 1 Wrong Tests 1 Total Cost $0.021 Response Time (avg) 55.3s
#23	DeepSeek V4 Flash high	DeepSeek	1	3.0	$0.027	0/1	54.5s
Total Tests 1 Wrong Tests 1 Total Cost $0.027 Response Time (avg) 54.5s
#42	Grok Build 0.1 medium	X AI	1	3.0	$0.927	0/1	53.5s
Total Tests 1 Wrong Tests 1 Total Cost $0.927 Response Time (avg) 53.5s
#36	Qwen3.5-122B-A10B medium	Qwen	1	3.0	$0.588	0/1	52.9s
Total Tests 1 Wrong Tests 1 Total Cost $0.588 Response Time (avg) 52.9s
#76	MiMo-V2.5 medium	Xiaomi	1	3.0	$0.063	0/1	51.3s
Total Tests 1 Wrong Tests 1 Total Cost $0.063 Response Time (avg) 51.3s

←

1 2 3 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Trivia: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost