AI BENCHY
Your ad here

AI BENCHY Failures

Wrong answer Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one.

Models Shown

15

Total Failures

572

Most Affected Model

GPT-4o-mini 13
Rank Model Company Wrong answer Count Score Tests Correct Response Time (avg)
#41 MiMo-V2-Flash medium Xiaomi 3 7.5 11/18 23.4s
#42 Claude Sonnet 4.6 none Anthropic 3 7.4 11/18 4.98s
#47 Grok 4.20 medium X AI 3 7.0 9/18 10.3s
#51 Nemotron 3 Super medium NVIDIA 3 6.7 9/18 19.1s
#52 Grok 4.1 Fast medium X AI 3 6.7 9/18 23.9s
#56 Grok 4.20 Multi Agent Beta medium X AI 3 6.4 7/18 9.80s
#4 Claude Opus 4.7 none Anthropic 2 9.2 16/18 3.13s
#8 Qwen3.5 Plus 2026-02-15 medium Qwen 2 8.5 14/18 46.6s
#13 GLM 5 medium Z.ai 2 8.4 13/18 23.3s
#24 Gemma 4 26B A4B medium Google 2 8.0 13/18 25.0s
#26 Claude Sonnet 4.6 medium Anthropic 2 8.0 13/18 12.7s
#34 Kimi K2.6 medium Moonshot AI 2 7.7 11/18 45.2s
#37 Claude Opus 4.6 medium Anthropic 2 7.6 12/18 21.1s
#39 Seed-2.0-Mini medium Bytedance Seed 2 7.5 11/18 69.7s
#40 GPT-5.2 medium OpenAI 2 7.5 11/18 14.0s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)