General Intelligence x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on General Intelligence, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

Most Affected Model

Grok 4.5 1

Failure Reasons

Did not follow instructions78 Wrong answer62 API error12 Timed out4

Categories

Domain specific421 Anti-AI Tricks293 Coding259 Puzzle Solving204 Trivia172 Combined69 General Intelligence62 Instructions following61 Data parsing and extraction41 Tool Calling3

62/62

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#117	LongCat 2.0 none	Meituan	1	5.0	$0.044	0/1	2.76s
Total Tests 1 Wrong Tests 1 Total Cost $0.044 Response Time (avg) 2.76s
#123	GPT-5.6 Luna low	OpenAI	1	5.0	$0.249	0/1	2.25s
Total Tests 1 Wrong Tests 1 Total Cost $0.249 Response Time (avg) 2.25s
#124	Gemini 2.5 Flash none	Google	1	5.0	$0.017	0/1	615ms
Total Tests 1 Wrong Tests 1 Total Cost $0.017 Response Time (avg) 615ms
#128	Gemini 3.1 Flash Lite none	Google	1	4.0	$0.046	0/1	992ms
Total Tests 1 Wrong Tests 1 Total Cost $0.046 Response Time (avg) 992ms
#135	Nemotron 3 Ultra none	NVIDIA	1	5.0	$0.095	0/1	13.5s
Total Tests 1 Wrong Tests 1 Total Cost $0.095 Response Time (avg) 13.5s
#138	GPT-5.6 Terra none	OpenAI	1	5.0	$0.349	0/1	1.03s
Total Tests 1 Wrong Tests 1 Total Cost $0.349 Response Time (avg) 1.03s
#140	Mimo V2 Omni medium	Xiaomi	1	5.4	$0.683	0/1	3.61s
Total Tests 1 Wrong Tests 1 Total Cost $0.683 Response Time (avg) 3.61s
#143	North Mini Code medium	Cohere	1	5.1	$0.000	0/1	25.1s
Total Tests 1 Wrong Tests 1 Total Cost $0.000 Response Time (avg) 25.1s
#145	GPT-5.4 none	OpenAI	1	4.4	$0.397	0/1	1.78s
Total Tests 1 Wrong Tests 1 Total Cost $0.397 Response Time (avg) 1.78s
#150	KAT-Coder-Air V2.5 high	Kwaipilot	1	5.1	$0.077	0/1	7.10s
Total Tests 1 Wrong Tests 1 Total Cost $0.077 Response Time (avg) 7.10s
#156	DeepSeek V4 Flash none	DeepSeek	1	4.2	$0.042	0/1	23.7s
Total Tests 1 Wrong Tests 1 Total Cost $0.042 Response Time (avg) 23.7s
#157	GLM 5.1 none	Z.ai	1	5.0	$0.164	0/1	790ms
Total Tests 1 Wrong Tests 1 Total Cost $0.164 Response Time (avg) 790ms
#160	MiMo-V2.5-Pro none	Xiaomi	1	4.0	$0.068	0/1	2.58s
Total Tests 1 Wrong Tests 1 Total Cost $0.068 Response Time (avg) 2.58s
#163	Mimo V2 Omni none	Xiaomi	1	4.1	$0.021	0/1	2.33s
Total Tests 1 Wrong Tests 1 Total Cost $0.021 Response Time (avg) 2.33s
#165	GPT-5.6 Luna none	OpenAI	1	5.0	$0.142	0/1	1.00s
Total Tests 1 Wrong Tests 1 Total Cost $0.142 Response Time (avg) 1.00s

←

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

General Intelligence: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost