AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Anti-AI Tricks: Wrong answer

Anti-AI Tricks
Wrong answer

See which AI models are most likely to hit Wrong answer on Anti-AI Tricks, so you can spot weak points faster.

Models Shown

15

Total Failures

245

Most Affected Model

Gemini 2.5 Flash 4
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#111 Owl Alpha medium Openrouter 3 4.8 1/4 3.97s
#113 DeepSeek V4 Pro none DeepSeek 3 3.5 0/4 14.0s
#114 Qwen3.5 Plus 2026-04-20 none Qwen 3 4.8 1/4 1.88s
#115 Qwen3.5-27B none Qwen 3 4.8 1/4 788ms
#121 Owl Alpha none Openrouter 3 3.4 0/4 2.78s
#122 GLM 4.7 Flash none Z.ai 3 5.2 1/4 5.51s
#123 MiMo-V2.5-Pro none Xiaomi 3 3.3 0/4 2.67s
#124 Kimi K2.6 none Moonshot AI 3 4.6 1/4 1.39s
#127 Grok 4.20 none X AI 3 4.8 1/4 501ms
#131 Qwen3.5-122B-A10B none Qwen 3 4.8 1/4 1.59s
#132 Mistral Small 4 medium Mistral 3 5.6 1/4 2.67s
#141 Nemotron 3 Super none NVIDIA 3 4.8 1/4 4.46s
#145 Laguna M.1 none Poolside 3 3.4 0/4 705ms
#147 GPT-4o-mini none OpenAI 3 4.8 1/4 1.34s
#150 Qwen3 Coder Next medium Qwen 3 3.5 0/4 8.64s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost