General Intelligence x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on General Intelligence, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

Most Affected Model

Grok 4.5 1

Failure Reasons

Did not follow instructions78 Wrong answer62 API error12 Timed out4

Categories

Domain specific421 Anti-AI Tricks293 Coding259 Puzzle Solving204 Trivia172 Combined69 General Intelligence62 Instructions following61 Data parsing and extraction41 Tool Calling3

62/62

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#74	Qwen3.5 Plus 2026-04-20 medium	Qwen	1	4.9	$0.317	0/1	25.3s
Total Tests 1 Wrong Tests 1 Total Cost $0.317 Response Time (avg) 25.3s
#86	DeepSeek V4 Pro none	DeepSeek	1	5.0	$0.096	0/1	2.05s
Total Tests 1 Wrong Tests 1 Total Cost $0.096 Response Time (avg) 2.05s
#87	GPT-5.6 Sol none	OpenAI	1	6.5	$0.524	0/1	1.52s
Total Tests 1 Wrong Tests 1 Total Cost $0.524 Response Time (avg) 1.52s
#89	Qwen3.6 Flash medium	Qwen	1	4.8	$0.738	0/1	9.88s
Total Tests 1 Wrong Tests 1 Total Cost $0.738 Response Time (avg) 9.88s
#90	Step 3.7 Flash high	Stepfun	1	5.5	$1.207	0/1	4.17s
Total Tests 1 Wrong Tests 1 Total Cost $1.207 Response Time (avg) 4.17s
#95	Gemini 3.5 Flash-Lite low	Google	1	6.1	$0.145	0/1	1.71s
Total Tests 1 Wrong Tests 1 Total Cost $0.145 Response Time (avg) 1.71s
#96	LongCat 2.0 low	Meituan	1	3.4	$0.391	0/1	22.5s
Total Tests 1 Wrong Tests 1 Total Cost $0.391 Response Time (avg) 22.5s
#97	KAT-Coder-Pro V2.5 none	Kwaipilot	1	4.8	$0.476	0/1	5.16s
Total Tests 1 Wrong Tests 1 Total Cost $0.476 Response Time (avg) 5.16s
#101	GLM 5.2 none	Z.ai	1	6.1	$0.128	0/1	4.42s
Total Tests 1 Wrong Tests 1 Total Cost $0.128 Response Time (avg) 4.42s
#102	LongCat 2.0 high	Meituan	1	5.1	$0.469	0/1	17.0s
Total Tests 1 Wrong Tests 1 Total Cost $0.469 Response Time (avg) 17.0s
#103	Qwen3.6 Max Preview none	Qwen	1	4.3	$0.231	0/1	1.62s
Total Tests 1 Wrong Tests 1 Total Cost $0.231 Response Time (avg) 1.62s
#104	Gemini 3.5 Flash-Lite medium	Google	1	5.4	$0.369	0/1	2.93s
Total Tests 1 Wrong Tests 1 Total Cost $0.369 Response Time (avg) 2.93s
#108	Laguna XS 2.1 medium	Poolside	1	5.0	$0.068	0/1	4.15s
Total Tests 1 Wrong Tests 1 Total Cost $0.068 Response Time (avg) 4.15s
#111	Gemini 3.1 Flash Lite low	Google	1	4.0	$0.621	0/1	1.37s
Total Tests 1 Wrong Tests 1 Total Cost $0.621 Response Time (avg) 1.37s
#113	Qwen3.5 Plus 2026-02-15 none	Qwen	1	4.4	$0.073	0/1	2.26s
Total Tests 1 Wrong Tests 1 Total Cost $0.073 Response Time (avg) 2.26s

←

1 2 3 4 5

→

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

General Intelligence: Wrong answer

Filter models

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost