AI BENCHY
Advertise here

AI BENCHY Failures

Wrong answer Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one.

Models Shown

15

Total Failures

1204

Most Affected Model

Mercury 2 16
Rank Model Company Wrong answer Count Score Tests Correct Response Time (avg)
#71 Step 3.7 Flash high Stepfun 6 7.0 11/21 64.5s
#75 Ring-2.6-1T medium Inclusionai 6 6.9 11/21 61.3s
#78 Qwen3.6 27B medium Qwen 6 6.8 10/21 59.7s
#107 Laguna Xs.2 medium Poolside 6 5.8 6/19 6.73s
#130 MiniMax M2.7 medium Minimax 6 5.3 5/21 38.2s
#14 Qwen3.6 Max Preview medium Qwen 5 8.5 16/21 59.6s
#16 Gemini 3 Flash Preview low Google 5 8.4 16/21 5.76s
#18 Qwen3.7 Plus medium Qwen 5 8.2 15/21 38.9s
#19 Seed-2.0-Lite medium Bytedance Seed 5 8.2 14/21 47.1s
#21 GPT-5.4 medium OpenAI 5 8.0 14/21 22.3s
#22 Step 3.7 Flash medium Stepfun 5 8.0 14/21 20.4s
#26 Qwen3.6 Plus medium Qwen 5 7.9 14/21 30.7s
#29 Qwen3.5-122B-A10B medium Qwen 5 7.8 14/21 42.5s
#32 Gemini 3.5 Flash minimal Google 5 7.7 14/21 1.57s
#38 Grok 4.3 medium X AI 5 7.6 13/21 47.5s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)