AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Domain specific: Wrong answer

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster.

Models Shown

15

Total Failures

314

Most Affected Model

Qwen3.6 Max Preview 3
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#9 GPT-5.5 medium OpenAI 2 5.3 1/3 164.1s
#10 Claude Opus 4.8 medium Anthropic 2 5.3 1/3 14.2s
#12 Gemini 3.1 Flash Lite Preview high Google 2 5.3 1/3 127.6s
#13 Grok 4.20 Beta medium X AI 2 5.3 1/3 21.3s
#15 GPT-5.3-Codex medium OpenAI 2 5.9 1/3 64.3s
#16 Gemini 3 Flash Preview low Google 2 5.3 1/3 8.05s
#17 GLM 5 medium Z.ai 2 3.5 0/3 0ms
#19 Seed-2.0-Lite medium Bytedance Seed 2 5.9 1/3 88.7s
#21 GPT-5.4 medium OpenAI 2 5.3 1/3 74.3s
#23 GLM 5 Turbo medium Z.ai 2 2.9 0/3 71.1s
#24 GPT-5.2 Chat none OpenAI 2 5.3 1/3 17.8s
#28 Gemini 2.5 Flash medium Google 2 5.9 1/3 37.3s
#33 Hy3 preview medium Tencent 2 5.3 1/3 22.3s
#35 Gemini 3 PRO Preview medium Google 2 5.3 1/3 7.01s
#37 Gemma 4 26B A4B medium Google 2 2.9 0/3 23.6s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost