General Intelligence x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on General Intelligence, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

Most Affected Model

Granite 4.1 8B 1

Failure Reasons

Did not follow instructions78 Wrong answer62 API error12 Timed out4

Categories

Domain specific421 Anti-AI Tricks293 Coding259 Puzzle Solving204 Trivia172 Combined69 General Intelligence62 Instructions following61 Data parsing and extraction41 Tool Calling3

62/62

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#182	GLM 4.7 Flash none	Z.ai	1	4.0	$0.016	0/1	1.59s
Total Tests 1 Wrong Tests 1 Total Cost $0.016 Response Time (avg) 1.59s
#103	Qwen3.6 Max Preview none	Qwen	1	4.3	$0.231	0/1	1.62s
Total Tests 1 Wrong Tests 1 Total Cost $0.231 Response Time (avg) 1.62s
#95	Gemini 3.5 Flash-Lite low	Google	1	6.1	$0.145	0/1	1.71s
Total Tests 1 Wrong Tests 1 Total Cost $0.145 Response Time (avg) 1.71s
#145	GPT-5.4 none	OpenAI	1	4.4	$0.397	0/1	1.78s
Total Tests 1 Wrong Tests 1 Total Cost $0.397 Response Time (avg) 1.78s
#86	DeepSeek V4 Pro none	DeepSeek	1	5.0	$0.096	0/1	2.05s
Total Tests 1 Wrong Tests 1 Total Cost $0.096 Response Time (avg) 2.05s
#123	GPT-5.6 Luna low	OpenAI	1	5.0	$0.249	0/1	2.25s
Total Tests 1 Wrong Tests 1 Total Cost $0.249 Response Time (avg) 2.25s
#113	Qwen3.5 Plus 2026-02-15 none	Qwen	1	4.4	$0.073	0/1	2.26s
Total Tests 1 Wrong Tests 1 Total Cost $0.073 Response Time (avg) 2.26s
#66	KAT-Coder-Pro V2.5 low	Kwaipilot	1	4.1	$0.387	0/1	2.32s
Total Tests 1 Wrong Tests 1 Total Cost $0.387 Response Time (avg) 2.32s
#163	Mimo V2 Omni none	Xiaomi	1	4.1	$0.021	0/1	2.33s
Total Tests 1 Wrong Tests 1 Total Cost $0.021 Response Time (avg) 2.33s
#43	GPT-5.6 Terra medium	OpenAI	1	5.5	$0.676	0/1	2.37s
Total Tests 1 Wrong Tests 1 Total Cost $0.676 Response Time (avg) 2.37s
#160	MiMo-V2.5-Pro none	Xiaomi	1	4.0	$0.068	0/1	2.58s
Total Tests 1 Wrong Tests 1 Total Cost $0.068 Response Time (avg) 2.58s
#117	LongCat 2.0 none	Meituan	1	5.0	$0.044	0/1	2.76s
Total Tests 1 Wrong Tests 1 Total Cost $0.044 Response Time (avg) 2.76s
#104	Gemini 3.5 Flash-Lite medium	Google	1	5.4	$0.369	0/1	2.93s
Total Tests 1 Wrong Tests 1 Total Cost $0.369 Response Time (avg) 2.93s
#38	GPT-5.6 Terra high	OpenAI	1	5.1	$1.055	0/1	3.03s
Total Tests 1 Wrong Tests 1 Total Cost $1.055 Response Time (avg) 3.03s
#73	KAT-Coder-Pro V2.5 high	Kwaipilot	1	5.1	$0.482	0/1	3.27s
Total Tests 1 Wrong Tests 1 Total Cost $0.482 Response Time (avg) 3.27s

←

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

General Intelligence: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost