AI BENCHY
Advertise here

AI BENCHY Category Failures

Coding: Wrong answer

Coding
Wrong answer

See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster.

Models Shown

15

Total Failures

120

Most Affected Model

Qwen3.6 Flash 2
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#112 GPT-5.4 none OpenAI 1 6.8 1/2 1.99s
#114 DeepSeek V3.2 none DeepSeek 1 3.1 0/2 20.9s
#115 MiMo-V2.5-Pro none Xiaomi 1 5.0 0/2 1.80s
#116 Qwen3.6 Flash none Qwen 1 6.6 1/2 2.34s
#117 Grok 4.20 Beta none X AI 1 5.5 0/1 1.14s
#118 Nemotron 3 Nano Omni 30b A3b Reasoning medium NVIDIA 1 3.3 0/1 38.1s
#119 MiniMax M2.5 medium Minimax 1 3.5 0/2 125.8s
#120 Grok 4.20 none X AI 1 3.4 0/1 1.22s
#122 Elephant Alpha medium Openrouter 1 4.0 0/2 1.30s
#123 Laguna M.1 none Poolside 1 7.5 0/1 2.93s
#126 Kimi K2.5 none Moonshot AI 1 6.8 1/2 36.0s
#127 Laguna Xs.2 none Poolside 1 2.5 0/1 1.96s
#129 gpt-oss-120b none OpenAI 1 4.3 0/1 9.57s
#130 Elephant Alpha none Openrouter 1 4.7 0/2 1.39s
#136 GPT-5.4 Mini none OpenAI 1 6.8 1/2 1.01s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost