AI BENCHY
Advertise here

AI BENCHY Category Failures

Coding: Wrong answer

Coding
Wrong answer

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

15

Total Failures

120

Most Affected Model

Granite 4.1 8B 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#4 Gemini 3.1 Pro Preview medium Google 1 7.0 1/2 54.3s
#36 Gemini 2.5 Flash medium Google 1 6.6 1/2 54.6s
#60 GLM 5V Turbo medium Z.ai 1 6.8 1/2 54.8s
#27 GPT-5.4 medium OpenAI 1 8.2 1/2 55.0s
#55 DeepSeek V4 Flash high DeepSeek 1 6.8 1/2 58.1s
#53 MiMo-V2.5 medium Xiaomi 1 6.9 1/2 64.5s
#77 Grok 4.20 medium X AI 2 4.1 0/2 65.1s
#45 Grok Build 0.1 medium X AI 1 5.3 0/2 67.4s
#11 GPT-5.5 medium OpenAI 1 8.2 1/2 69.7s
#65 GPT-5.4 Mini medium OpenAI 1 7.5 1/2 73.3s
#105 Cobuddy medium Baidu 1 4.1 0/2 79.2s
#44 MiMo-V2-Pro medium Xiaomi 1 7.5 1/2 94.2s
#1 Gemini 3 Flash Preview medium Google 1 7.9 1/2 96.0s
#21 Seed-2.0-Lite medium Bytedance Seed 1 7.0 1/2 107.7s
#38 Qwen3.5-122B-A10B medium Qwen 1 4.1 0/2 119.6s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost