AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

General Intelligence: Wrong answer

General Intelligence
Wrong answer

See which AI models are most likely to hit Wrong answer on General Intelligence, so you can spot weak points faster.

Models Shown

15

Total Failures

32

Most Affected Model

Step 3.7 Flash 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#101 Mimo V2 Omni none Xiaomi 1 4.1 0/1 2.33s
#104 Nemotron 3 Ultra 550b A55b none NVIDIA 1 5.0 0/1 13.5s
#112 GLM 5.1 none Z.ai 1 5.0 0/1 790ms
#122 GLM 4.7 Flash none Z.ai 1 4.0 0/1 1.59s
#123 MiMo-V2.5-Pro none Xiaomi 1 4.0 0/1 2.58s
#125 GPT-5.4 none OpenAI 1 4.4 0/1 1.78s
#126 gpt-oss-120b none OpenAI 1 4.8 0/1 10.8s
#127 Grok 4.20 none X AI 1 4.8 0/1 659ms
#138 Ling-2.6-flash none Inclusionai 1 4.0 0/1 1.45s
#139 DeepSeek V4 Flash none DeepSeek 1 4.2 0/1 23.7s
#141 Nemotron 3 Super none NVIDIA 1 4.6 0/1 950ms
#142 Mistral Small 4 none Mistral 1 4.0 0/1 729ms
#143 MiMo-V2.5 none Xiaomi 1 4.4 0/1 6.86s
#147 GPT-4o-mini none OpenAI 1 4.0 0/1 909ms
#153 Qwen3.6 35B A3B none Qwen 1 4.4 0/1 3.51s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost