AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Failures

Wrong answer Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one.

Models Shown

3

Total Failures

1092

Most Affected Model

Mercury 2 15
Rank Model Company Wrong answer Count Score Tests Correct Response Time (avg)
#1 Gemini 3 Flash Preview medium Google 1 9.8 19/20 16.7s
#2 Gemini 3.5 Flash high Google 1 9.6 19/20 8.30s
#32 Step 3.5 Flash none Stepfun 1 7.8 9/12 39.0s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)