AI BENCHY
Advertise here

AI BENCHY Category Failures

Coding: Wrong answer

Coding
Wrong answer

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

15

Total Failures

120

Most Affected Model

Granite 4.1 8B 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#91 Gemma 4 26B A4B none Google 1 4.1 0/2 3.83s
#34 Gemini 3.1 Flash Lite Preview medium Google 1 6.8 1/2 3.98s
#144 Hy3 preview none Tencent 1 2.3 0/1 4.56s
#89 GLM 5 none Z.ai 2 4.6 0/2 5.18s
#142 Qwen3.5-9B none Qwen 2 4.4 0/2 5.39s
#3 Gemini 3.5 Flash low Google 1 6.8 1/2 5.54s
#104 Qwen3.6 27B none Qwen 1 6.8 1/2 5.75s
#113 GLM 5.1 none Z.ai 2 4.3 0/2 6.33s
#12 Gemini 3 Flash Preview low Google 1 7.3 1/2 6.66s
#67 MiMo-V2-Flash medium Xiaomi 1 4.1 0/2 7.20s
#43 GPT-5.2 Chat none OpenAI 1 8.2 1/2 8.05s
#95 DeepSeek V4 Pro none DeepSeek 2 5.4 0/2 8.27s
#129 gpt-oss-120b none OpenAI 1 4.3 0/1 9.57s
#52 GPT-5.3 Chat none OpenAI 1 6.9 1/2 10.5s
#146 Ling-2.6-1T none Inclusionai 1 5.5 0/1 10.6s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost