AI BENCHY
Advertise here

AI BENCHY Category Failures

Coding: Wrong answer

Coding
Wrong answer

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

15

Total Failures

120

Most Affected Model

Granite 4.1 8B 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#137 Qwen3.6 35B A3B none Qwen 1 6.8 1/2 12.3s
#140 Trinity Large Preview none Arcee AI 1 4.9 0/1 14.3s
#84 Laguna Xs.2 medium Poolside 1 6.3 0/1 14.4s
#76 Gemma 4 31B none Google 1 6.8 1/2 14.8s
#114 DeepSeek V3.2 none DeepSeek 1 3.1 0/2 20.9s
#64 GPT-5.4 Nano medium OpenAI 1 6.8 1/2 21.1s
#131 DeepSeek V4 Flash none DeepSeek 2 4.8 0/2 24.5s
#126 Kimi K2.5 none Moonshot AI 1 6.8 1/2 36.0s
#118 Nemotron 3 Nano Omni 30b A3b Reasoning medium NVIDIA 1 3.3 0/1 38.1s
#9 Gemini 3.5 Flash none Google 1 8.2 1/2 39.6s
#121 Mistral Small 4 medium Mistral 2 5.1 0/2 44.8s
#111 gpt-oss-120b medium OpenAI 2 3.9 0/2 47.2s
#94 GPT-5 Nano medium OpenAI 2 5.4 0/2 47.8s
#59 Qwen3.6 Flash medium Qwen 2 5.1 0/2 51.9s
#56 Qwen3.5-Flash medium Qwen 1 4.1 0/2 54.2s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost