AI BENCHY
Advertise here

AI BENCHY Failures

Wrong answer Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one.

Models Shown

15

Total Failures

1092

Most Affected Model

Mercury 2 15
Rank Model Company Wrong answer Count Score Tests Correct Response Time (avg)
#12 Gemini 3 Flash Preview low Google 4 8.6 16/20 5.86s
#15 GPT-5.3-Codex medium OpenAI 4 8.3 14/20 16.0s
#17 Grok 4.20 Beta medium X AI 4 8.2 13/18 9.81s
#20 Qwen3.5 Plus 2026-02-15 medium Qwen 4 8.1 14/20 67.9s
#24 Gemini 3.5 Flash minimal Google 4 7.9 14/20 1.58s
#28 GLM 5 Turbo medium Z.ai 4 7.9 13/20 22.7s
#29 Hy3 preview medium Tencent 4 7.8 14/20 16.0s
#30 Qwen3.6 35B A3B medium Qwen 4 7.8 14/20 17.3s
#31 Grok 4.3 medium X AI 4 7.8 13/20 49.2s
#37 Hy3 preview low Tencent 4 7.7 15/20 24.6s
#45 Grok Build 0.1 medium X AI 4 7.6 12/20 26.4s
#47 Gemma 4 26B A4B medium Google 4 7.5 13/20 51.4s
#51 GLM 5.1 medium Z.ai 4 7.4 12/20 32.2s
#53 MiMo-V2.5 medium Xiaomi 4 7.4 12/20 20.4s
#56 Qwen3.5-Flash medium Qwen 4 7.4 11/20 65.6s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)