Trivia x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Trivia, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

133

Most Affected Model

Qwen3.5-122B-A10B 1

Failure Reasons

Wrong answer133 API error13 No answer8

Categories

Domain specific325 Anti-AI Tricks250 Coding201 Puzzle Solving154 Trivia133 Instructions following54 Combined53 General Intelligence36 Data parsing and extraction35 Tool Calling2

133/133

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#45	GPT-5.3 Chat none	OpenAI	1	3.0	$0.433	0/1	4.38s
Total Tests 1 Wrong Tests 1 Total Cost $0.433 Response Time (avg) 4.38s
#55	Claude Sonnet 4.6 none	Anthropic	1	3.0	$0.316	0/1	4.67s
Total Tests 1 Wrong Tests 1 Total Cost $0.316 Response Time (avg) 4.67s
#46	GPT-5.4 Nano medium	OpenAI	1	3.0	$0.107	0/1	4.81s
Total Tests 1 Wrong Tests 1 Total Cost $0.107 Response Time (avg) 4.81s
#66	Gemini 3.5 Flash none	Google	1	2.8	$1.079	0/1	4.87s
Total Tests 1 Wrong Tests 1 Total Cost $1.079 Response Time (avg) 4.87s
#90	GPT-5.5 none	OpenAI	1	3.0	$0.231	0/1	5.01s
Total Tests 1 Wrong Tests 1 Total Cost $0.231 Response Time (avg) 5.01s
#58	DeepSeek V4 Pro none	DeepSeek	1	3.0	$0.034	0/1	5.76s
Total Tests 1 Wrong Tests 1 Total Cost $0.034 Response Time (avg) 5.76s
#133	Mistral Small 4 medium	Mistral	1	3.0	$0.068	0/1	5.92s
Total Tests 1 Wrong Tests 1 Total Cost $0.068 Response Time (avg) 5.92s
#19	GPT-5.2 Chat none	OpenAI	1	3.0	$0.393	0/1	6.89s
Total Tests 1 Wrong Tests 1 Total Cost $0.393 Response Time (avg) 6.89s
#142	Nemotron 3 Super none	NVIDIA	1	3.0	$0.007	0/1	8.94s
Total Tests 1 Wrong Tests 1 Total Cost $0.007 Response Time (avg) 8.94s
#16	GPT-5 Mini medium	OpenAI	1	3.0	$0.159	0/1	9.99s
Total Tests 1 Wrong Tests 1 Total Cost $0.159 Response Time (avg) 9.99s
#4	GPT-5.5 low	OpenAI	1	3.0	$0.907	0/1	10.1s
Total Tests 1 Wrong Tests 1 Total Cost $0.907 Response Time (avg) 10.1s
#157	GLM 4.7 Flash medium	Z.ai	1	3.0	$0.054	0/1	11.1s
Total Tests 1 Wrong Tests 1 Total Cost $0.054 Response Time (avg) 11.1s
#51	MiMo-V2.5-Pro medium	Xiaomi	1	3.0	$0.106	0/1	12.5s
Total Tests 1 Wrong Tests 1 Total Cost $0.106 Response Time (avg) 12.5s
#17	GPT-5.4 medium	OpenAI	1	3.0	$1.210	0/1	14.0s
Total Tests 1 Wrong Tests 1 Total Cost $1.210 Response Time (avg) 14.0s
#10	GPT-5.3-Codex medium	OpenAI	1	2.8	$0.740	0/1	14.4s
Total Tests 1 Wrong Tests 1 Total Cost $0.740 Response Time (avg) 14.4s

←

1 4 5 6 9

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Trivia: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost