AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Domain specific: Wrong answer

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster.

Models Shown

15

Total Failures

182

Most Affected Model

Qwen3.6 Plus Preview 3
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#16 GPT-5.4 medium OpenAI 2 5.3 1/3 74.3s
#18 GLM 5 Turbo medium Z.ai 2 2.9 0/3 71.1s
#22 Gemini 3.1 Flash Lite Preview low Google 2 5.3 1/3 2.36s
#24 Gemma 4 26B A4B medium Google 2 2.9 0/3 23.6s
#25 Grok 4.20 Beta medium X AI 2 5.3 1/3 21.3s
#28 GPT-5.2 Chat none OpenAI 2 5.3 1/3 17.8s
#29 Gemini 3.1 Flash Lite Preview none Google 2 5.3 1/3 942ms
#30 Step 3.5 Flash medium Stepfun 2 5.3 1/3 170.5s
#31 GLM 5V Turbo medium Z.ai 2 5.3 1/3 38.1s
#38 GPT-5.4 Nano medium OpenAI 2 5.9 1/3 38.2s
#41 MiMo-V2-Flash medium Xiaomi 2 5.9 1/3 96.0s
#45 GPT-5 Mini medium OpenAI 2 3.6 0/3 44.6s
#46 Kimi K2.5 medium Moonshot AI 2 3.5 0/3 137.3s
#49 Qwen3.5 Plus 2026-02-15 none Qwen 2 5.3 1/3 1.17s
#51 Nemotron 3 Super medium NVIDIA 2 2.9 0/3 16.2s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost