Data parsing and extraction x Wrong answer Ranking

AI BENCHY Category Failures

See which AI models are most likely to hit Wrong answer on Data parsing and extraction, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

Most Affected Model

Granite 4.1 8B 2

Failure Reasons

Wrong answer35 API error16 No answer5 Extra formatting4 Timed out1

Categories

Domain specific314 Anti-AI Tricks245 Coding194 Puzzle Solving147 Trivia130 Instructions following53 Combined52 Data parsing and extraction35 General Intelligence32 Tool Calling2

Rank	Model	Company	Wrong answer Count	Category Score	Tests Correct	Response Time (avg)
#163	Granite 4.1 8B none	IBM Granite	2	3.0	0/2	575ms
#155	Mercury 2 none	Inception	1	7.3	1/2	667ms
#160	LFM2-24B-A2B none	Liquid	2	3.0	0/2	714ms
#136	Elephant Alpha medium	Openrouter	1	6.5	1/2	979ms
#137	Elephant Alpha none	Openrouter	1	6.5	1/2	1.04s
#81	Mercury 2 medium	Inception	1	7.3	1/2	1.11s
#148	GPT-5.4 Nano none	OpenAI	1	6.5	1/2	1.11s
#140	Qwen3 Coder Next none	Qwen	1	6.5	1/2	1.32s
#162	Nemotron 3 Nano Omni 30b A3b Reasoning none	NVIDIA	2	3.8	0/2	1.42s
#68	Claude Opus 4.8 none	Anthropic	1	7.3	1/2	1.77s
#99	gpt-oss-120b medium	OpenAI	1	6.4	1/2	1.98s
#118	Qwen3.6 27B none	Qwen	1	7.3	1/2	2.06s
#57	Step 3.7 Flash low	Stepfun	1	7.3	1/2	2.29s
#149	Nemotron 3 Nano Omni 30b A3b Reasoning medium	NVIDIA	1	7.3	1/2	2.72s
#122	GLM 4.7 Flash none	Z.ai	1	7.3	1/2	4.82s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Data parsing and extraction: Wrong answer

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost