AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Instructions following: Wrong answer

Instructions following
Wrong answer

See which AI models are most likely to hit Wrong answer on Instructions following, so you can spot weak points faster.

Models Shown

15

Total Failures

53

Most Affected Model

Gemini 3.5 Flash 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#134 GLM 5 Turbo none Z.ai 1 6.5 1/2 2.13s
#135 Kimi K2.5 none Moonshot AI 1 6.5 1/2 2.67s
#140 Qwen3 Coder Next none Qwen 1 6.3 1/2 7.78s
#141 Nemotron 3 Super none NVIDIA 1 6.3 1/2 804ms
#142 Mistral Small 4 none Mistral 1 6.5 1/2 380ms
#143 MiMo-V2.5 none Xiaomi 1 6.5 1/2 751ms
#144 GPT-5.4 Mini none OpenAI 1 6.3 1/2 728ms
#145 Laguna M.1 none Poolside 1 6.3 1/2 683ms
#146 Laguna Xs.2 none Poolside 1 6.5 1/2 439ms
#147 GPT-4o-mini none OpenAI 1 6.3 1/2 1.11s
#148 GPT-5.4 Nano none OpenAI 1 6.3 1/2 784ms
#150 Qwen3 Coder Next medium Qwen 1 6.3 1/2 7.49s
#151 Trinity Large Preview none Arcee AI 1 3.5 0/2 822ms
#152 MiMo-V2-Flash none Xiaomi 1 6.5 1/2 857ms
#153 Qwen3.6 35B A3B none Qwen 1 6.2 1/2 1.86s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost