Domain specific x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Tests Correct ↑.

Models Shown

Total Failures

421

Most Affected Model

Grok 4.5 2

Failure Reasons

Wrong answer421 Timed out43 Extra formatting17 No answer8 API error7 Did not follow instructions1

Categories

Domain specific421 Anti-AI Tricks293 Coding259 Puzzle Solving204 Trivia172 Combined69 General Intelligence62 Instructions following61 Data parsing and extraction41 Tool Calling3

202/202

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#115	Mimo V2 PRO medium	Xiaomi	1	5.3	$0.333	1/3	8.82s
Total Tests 3 Wrong Tests 2 Total Cost $0.333 Response Time (avg) 8.82s
#118	Claude Sonnet 5 none	Anthropic	2	5.3	$0.548	1/3	3.28s
Total Tests 3 Wrong Tests 2 Total Cost $0.548 Response Time (avg) 3.28s
#119	MiMo-V2-Flash medium	Xiaomi	2	5.9	$0.043	1/3	96.0s
Total Tests 3 Wrong Tests 2 Total Cost $0.043 Response Time (avg) 96.0s
#120	Qwen3.5-Flash medium	Qwen	1	5.3	$0.139	1/3	146.5s
Total Tests 3 Wrong Tests 2 Total Cost $0.139 Response Time (avg) 146.5s
#124	Gemini 2.5 Flash none	Google	2	5.9	$0.017	1/3	495ms
Total Tests 3 Wrong Tests 2 Total Cost $0.017 Response Time (avg) 495ms
#129	Inkling low	Thinkingmachines	2	5.3	$0.187	1/3	1.99s
Total Tests 3 Wrong Tests 2 Total Cost $0.187 Response Time (avg) 1.99s
#130	Qwen3.6 Flash none	Qwen	2	5.3	$0.062	1/3	1.11s
Total Tests 3 Wrong Tests 2 Total Cost $0.062 Response Time (avg) 1.11s
#132	Qwen3.5 Plus 2026-04-20 none	Qwen	2	5.3	$0.122	1/3	4.43s
Total Tests 3 Wrong Tests 2 Total Cost $0.122 Response Time (avg) 4.43s
#134	GPT-5 Nano medium	OpenAI	1	5.2	$0.114	1/3	204.0s
Total Tests 3 Wrong Tests 2 Total Cost $0.114 Response Time (avg) 204.0s
#135	Nemotron 3 Ultra none	NVIDIA	2	5.3	$0.095	1/3	698ms
Total Tests 3 Wrong Tests 2 Total Cost $0.095 Response Time (avg) 698ms
#136	Step 3.5 Flash medium	Stepfun	2	5.3	$0.108	1/3	170.5s
Total Tests 3 Wrong Tests 2 Total Cost $0.108 Response Time (avg) 170.5s
#137	Grok 4.20 Beta medium	X AI	2	5.3	$0.750	1/3	21.3s
Total Tests 3 Wrong Tests 2 Total Cost $0.750 Response Time (avg) 21.3s
#138	GPT-5.6 Terra none	OpenAI	2	5.3	$0.349	1/3	757ms
Total Tests 3 Wrong Tests 2 Total Cost $0.349 Response Time (avg) 757ms
#139	Gemini 3 PRO Preview medium	Google	2	5.3	$0.385	1/3	7.01s
Total Tests 3 Wrong Tests 2 Total Cost $0.385 Response Time (avg) 7.01s
#141	Hy3 preview high	Tencent	2	5.3	$0.048	1/3	109.0s
Total Tests 3 Wrong Tests 2 Total Cost $0.048 Response Time (avg) 109.0s

←

1 9 10 11 14

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Domain specific: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost