AI BENCHY
Your ad here

AI BENCHY Category Failures

Coding: Wrong answer

Coding
Wrong answer

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster.

Models Shown

11

Total Failures

26

Most Affected Model

MiMo-V2-Omni 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#77 Grok 4.20 none X AI 1 3.4 0/1 1.22s
#78 Mistral Small 4 none Mistral 1 4.5 0/1 1.28s
#79 gpt-oss-120b none OpenAI 1 4.3 0/1 9.57s
#81 Qwen3 Coder Next none Qwen 1 7.3 0/1 3.14s
#82 Nemotron 3 Super none NVIDIA 1 3.3 0/1 2.99s
#83 GPT-4o-mini none OpenAI 1 3.0 0/1 2.55s
#84 Qwen3.5-9B none Qwen 1 5.2 0/1 5.69s
#85 Mercury 2 none Inception 1 3.6 0/1 969ms
#88 MiMo-V2-Flash none Xiaomi 1 6.3 0/1 2.79s
#89 Grok 4.1 Fast none X AI 1 5.3 0/1 1.79s
#90 GPT-5.4 Nano none OpenAI 1 7.1 0/1 1.43s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost