AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Failures

Wrong answer Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Failure Count ↑.

Models Shown

15

Total Failures

572

Most Affected Model

Gemini 3.1 Pro Preview 1
Rank Model Company Wrong answer Count Score Tests Correct Response Time (avg)
#45 GPT-5 Mini medium OpenAI 4 7.0 9/18 24.0s
#46 Kimi K2.5 medium Moonshot AI 4 7.0 9/18 72.4s
#50 Hunter Alpha medium OpenRouter 4 6.7 8/18 10.3s
#21 Gemini 3 Flash Preview none Google 5 8.1 13/18 1.65s
#28 GPT-5.2 Chat none OpenAI 5 7.9 12/18 6.84s
#36 GPT-5.3 Chat none OpenAI 5 7.7 11/18 5.88s
#48 Gemma 4 31B none Google 5 6.9 10/18 4.02s
#71 MiniMax M2.5 medium Minimax 5 5.7 5/18 39.6s
#80 MiniMax M2.7 medium Minimax 5 5.3 4/18 31.1s
#54 Mercury 2 medium Inception 6 6.5 8/18 2.21s
#84 gpt-oss-120b none OpenAI 6 5.2 4/18 12.0s
#57 GPT-5 Nano medium OpenAI 7 6.3 7/18 44.1s
#60 Gemma 4 26B A4B none Google 7 6.2 7/18 6.59s
#68 gpt-oss-120b medium OpenAI 7 5.8 7/18 16.1s
#55 MiMo-V2-Omni none Xiaomi 8 6.5 8/18 1.99s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)