AI BENCHY Category Failures
Coding: Wrong answer
Coding
Wrong answer
See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Response Time (avg) ↑.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #91 | Gemma 4 26B A4B none | 1 | 4.1 | 0/2 | 3.83s | |
| #34 | Gemini 3.1 Flash Lite Preview medium | 1 | 6.8 | 1/2 | 3.98s | |
| #144 | Hy3 preview none | Tencent | 1 | 2.3 | 0/1 | 4.56s |
| #89 | GLM 5 none | Z.ai | 2 | 4.6 | 0/2 | 5.18s |
| #142 | Qwen3.5-9B none | Qwen | 2 | 4.4 | 0/2 | 5.39s |
| #3 | Gemini 3.5 Flash low | 1 | 6.8 | 1/2 | 5.54s | |
| #104 | Qwen3.6 27B none | Qwen | 1 | 6.8 | 1/2 | 5.75s |
| #113 | GLM 5.1 none | Z.ai | 2 | 4.3 | 0/2 | 6.33s |
| #12 | Gemini 3 Flash Preview low | 1 | 7.3 | 1/2 | 6.66s | |
| #67 | MiMo-V2-Flash medium | Xiaomi | 1 | 4.1 | 0/2 | 7.20s |
| #43 | GPT-5.2 Chat none | OpenAI | 1 | 8.2 | 1/2 | 8.05s |
| #95 | DeepSeek V4 Pro none | DeepSeek | 2 | 5.4 | 0/2 | 8.27s |
| #129 | gpt-oss-120b none | OpenAI | 1 | 4.3 | 0/1 | 9.57s |
| #52 | GPT-5.3 Chat none | OpenAI | 1 | 6.9 | 1/2 | 10.5s |
| #146 | Ling-2.6-1T none | Inclusionai | 1 | 5.5 | 0/1 | 10.6s |