Trivia x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Trivia, so you can spot weak points faster. Sort by: Total Cost ↓.

Models Shown

Total Failures

133

Most Affected Model

GPT-5.5 1

Failure Reasons

Wrong answer133 API error13 No answer8

Categories

Domain specific325 Anti-AI Tricks250 Coding201 Puzzle Solving154 Trivia133 Instructions following54 Combined53 General Intelligence36 Data parsing and extraction35 Tool Calling2

133/133

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#133	Mistral Small 4 medium	Mistral	1	3.0	$0.068	0/1	5.92s
Total Tests 1 Wrong Tests 1 Total Cost $0.068 Response Time (avg) 5.92s
#76	MiMo-V2.5 medium	Xiaomi	1	3.0	$0.063	0/1	51.3s
Total Tests 1 Wrong Tests 1 Total Cost $0.063 Response Time (avg) 51.3s
#74	Hy3 preview high	Tencent	1	3.0	$0.059	0/1	47.7s
Total Tests 1 Wrong Tests 1 Total Cost $0.059 Response Time (avg) 47.7s
#116	GLM 5.1 none	Z.ai	1	3.0	$0.058	0/1	2.34s
Total Tests 1 Wrong Tests 1 Total Cost $0.058 Response Time (avg) 2.34s
#44	Mercury 2 medium	Inception	1	3.0	$0.058	0/1	2.58s
Total Tests 1 Wrong Tests 1 Total Cost $0.058 Response Time (avg) 2.58s
#68	Qwen3.7 Max none	Qwen	1	3.0	$0.054	0/1	856ms
Total Tests 1 Wrong Tests 1 Total Cost $0.054 Response Time (avg) 856ms
#157	GLM 4.7 Flash medium	Z.ai	1	3.0	$0.054	0/1	11.1s
Total Tests 1 Wrong Tests 1 Total Cost $0.054 Response Time (avg) 11.1s
#105	GLM 5V Turbo none	Z.ai	1	3.0	$0.052	0/1	2.23s
Total Tests 1 Wrong Tests 1 Total Cost $0.052 Response Time (avg) 2.23s
#123	GLM 5 Turbo none	Z.ai	1	3.0	$0.047	0/1	2.37s
Total Tests 1 Wrong Tests 1 Total Cost $0.047 Response Time (avg) 2.37s
#109	Mimo V2 PRO none	Xiaomi	1	3.0	$0.045	0/1	1.63s
Total Tests 1 Wrong Tests 1 Total Cost $0.045 Response Time (avg) 1.63s
#59	Gemma 4 26B A4B medium	Google	1	3.0	$0.045	0/1	180.9s
Total Tests 1 Wrong Tests 1 Total Cost $0.045 Response Time (avg) 180.9s
#48	DeepSeek V3.2 medium	DeepSeek	1	3.0	$0.044	0/1	84.0s
Total Tests 1 Wrong Tests 1 Total Cost $0.044 Response Time (avg) 84.0s
#50	Seed-2.0-Mini medium	Bytedance Seed	1	3.0	$0.044	0/1	56.8s
Total Tests 1 Wrong Tests 1 Total Cost $0.044 Response Time (avg) 56.8s
#62	MiMo-V2-Flash medium	Xiaomi	1	3.0	$0.043	0/1	1.96s
Total Tests 1 Wrong Tests 1 Total Cost $0.043 Response Time (avg) 1.96s
#124	GPT-5.4 Mini none	OpenAI	1	3.0	$0.038	0/1	1.33s
Total Tests 1 Wrong Tests 1 Total Cost $0.038 Response Time (avg) 1.33s

←

1 4 5 6 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Trivia: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost