AI BENCHY
Advertise here

AI BENCHY Failures

Wrong answer Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one.

Models Shown

15

Total Failures

1204

Most Affected Model

Mercury 2 16
Rank Model Company Wrong answer Count Score Tests Correct Response Time (avg)
#106 Grok 4.20 Beta none X AI 10 5.8 6/18 1.19s
#111 Owl Alpha medium Openrouter 10 5.7 8/21 11.9s
#113 DeepSeek V4 Pro none DeepSeek 10 5.7 7/21 12.4s
#121 Owl Alpha none Openrouter 10 5.5 7/21 9.88s
#127 Grok 4.20 none X AI 10 5.4 6/18 1.11s
#145 Laguna M.1 none Poolside 10 4.8 4/19 2.89s
#61 Gemini 3.1 Flash Lite low Google 9 7.2 12/21 1.89s
#94 GPT-5 Nano medium OpenAI 9 6.3 9/21 42.5s
#99 gpt-oss-120b medium OpenAI 9 6.1 9/21 22.3s
#116 Hunter Alpha none OpenRouter 9 5.7 6/18 4.70s
#119 Cobuddy medium Baidu 9 5.6 7/21 39.9s
#136 Elephant Alpha medium Openrouter 9 5.1 6/21 1.27s
#137 Elephant Alpha none Openrouter 9 5.1 5/21 1.22s
#138 Ling-2.6-flash none Inclusionai 9 5.0 6/21 9.34s
#158 GLM 4.7 Flash medium Z.ai 9 4.4 4/21 35.1s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)