AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Data parsing and extraction: Wrong answer

Data parsing and extraction
Wrong answer

See which AI models are most likely to hit Wrong answer on Data parsing and extraction, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

15

Total Failures

19

Most Affected Model

MiMo-V2-Pro 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#23 MiMo-V2-Pro medium Xiaomi 1 7.3 1/2 17.2s
#54 Mercury 2 medium Inception 1 7.3 1/2 1.11s
#64 DeepSeek V3.2 none DeepSeek 1 6.3 1/2 9.42s
#68 gpt-oss-120b medium OpenAI 1 6.4 1/2 1.98s
#74 GLM 4.7 Flash none Z.ai 1 7.3 1/2 4.82s
#76 Kimi K2.5 none Moonshot AI 1 7.3 1/2 42.1s
#80 MiniMax M2.7 medium Minimax 1 6.3 1/2 21.9s
#81 Elephant medium Openrouter 1 6.5 1/2 979ms
#85 Elephant none Openrouter 1 6.5 1/2 1.04s
#87 Qwen3 Coder Next none Qwen 1 6.5 1/2 1.32s
#91 Mercury 2 none Inception 1 7.3 1/2 667ms
#92 Qwen3 Coder Next medium Qwen 1 6.5 1/2 81.8s
#96 GPT-5.4 Nano none OpenAI 1 6.5 1/2 1.11s
#57 GPT-5 Nano medium OpenAI 2 3.7 0/2 21.4s
#71 MiniMax M2.5 medium Minimax 2 4.6 0/2 7.48s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost