Domain specific x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

421

Most Affected Model

Gemini 3.6 Flash 1

Failure Reasons

Wrong answer421 Timed out43 Extra formatting17 No answer8 API error7 Did not follow instructions1

Categories

Domain specific421 Anti-AI Tricks293 Coding259 Puzzle Solving204 Trivia172 Combined69 General Intelligence62 Instructions following61 Data parsing and extraction41 Tool Calling3

202/202

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#75	Qwen3.7 Plus none	Qwen	3	3.0	$0.106	0/3	868ms
Total Tests 3 Wrong Tests 3 Total Cost $0.106 Response Time (avg) 868ms
#76	Qwen3.5-122B-A10B medium	Qwen	3	2.9	$1.046	0/3	63.4s
Total Tests 3 Wrong Tests 3 Total Cost $1.046 Response Time (avg) 63.4s
#80	DeepSeek V3.2 medium	DeepSeek	2	2.9	$0.078	0/3	24.3s
Total Tests 3 Wrong Tests 3 Total Cost $0.078 Response Time (avg) 24.3s
#81	Kimi K2.5 medium	Moonshot AI	2	3.5	$0.600	0/3	137.3s
Total Tests 3 Wrong Tests 3 Total Cost $0.600 Response Time (avg) 137.3s
#82	Mercury 2 medium	Inception	3	2.9	$0.093	0/3	6.48s
Total Tests 3 Wrong Tests 3 Total Cost $0.093 Response Time (avg) 6.48s
#85	KAT-Coder-Pro V2.5 medium	Kwaipilot	3	2.9	$0.467	0/3	29.0s
Total Tests 3 Wrong Tests 3 Total Cost $0.467 Response Time (avg) 29.0s
#87	GPT-5.6 Sol none	OpenAI	3	3.6	$0.524	0/3	1.43s
Total Tests 3 Wrong Tests 3 Total Cost $0.524 Response Time (avg) 1.43s
#89	Qwen3.6 Flash medium	Qwen	3	3.5	$0.738	0/3	14.6s
Total Tests 3 Wrong Tests 3 Total Cost $0.738 Response Time (avg) 14.6s
#90	Step 3.7 Flash high	Stepfun	2	4.1	$1.207	0/3	149.6s
Total Tests 3 Wrong Tests 3 Total Cost $1.207 Response Time (avg) 149.6s
#91	GPT-5.5 none	OpenAI	3	2.9	$0.544	0/3	1.31s
Total Tests 3 Wrong Tests 3 Total Cost $0.544 Response Time (avg) 1.31s
#95	Gemini 3.5 Flash-Lite low	Google	3	3.6	$0.145	0/3	3.63s
Total Tests 3 Wrong Tests 3 Total Cost $0.145 Response Time (avg) 3.63s
#96	LongCat 2.0 low	Meituan	2	3.0	$0.391	0/3	86.1s
Total Tests 3 Wrong Tests 3 Total Cost $0.391 Response Time (avg) 86.1s
#97	KAT-Coder-Pro V2.5 none	Kwaipilot	3	3.6	$0.476	0/3	21.6s
Total Tests 3 Wrong Tests 3 Total Cost $0.476 Response Time (avg) 21.6s
#100	Gemma 4 26B A4B medium	Google	2	2.9	$0.089	0/3	23.6s
Total Tests 3 Wrong Tests 3 Total Cost $0.089 Response Time (avg) 23.6s
#102	LongCat 2.0 high	Meituan	1	3.6	$0.469	0/3	400.3s
Total Tests 3 Wrong Tests 3 Total Cost $0.469 Response Time (avg) 400.3s

←

1 9 10 11 14

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Domain specific: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost