AI BENCHY Category Failures
Coding: Wrong answer
Coding
Wrong answer
See which AI models are most likely to hit Wrong answer on Coding, so you can spot weak points faster. Sort by: Tests Correct ↓.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #89 | GLM 5 none | Z.ai | 2 | 4.6 | 0/2 | 5.18s |
| #91 | Gemma 4 26B A4B none | 1 | 4.1 | 0/2 | 3.83s | |
| #93 | MiMo-V2-Omni none | Xiaomi | 1 | 5.1 | 0/2 | 2.75s |
| #94 | GPT-5 Nano medium | OpenAI | 2 | 5.4 | 0/2 | 47.8s |
| #95 | DeepSeek V4 Pro none | DeepSeek | 2 | 5.4 | 0/2 | 8.27s |
| #101 | Qwen3.5 Plus 2026-04-20 none | Qwen | 1 | 4.4 | 0/2 | 2.08s |
| #105 | Cobuddy medium | Baidu | 1 | 4.1 | 0/2 | 79.2s |
| #109 | GLM 4.7 Flash none | Z.ai | 2 | 5.0 | 0/2 | 3.35s |
| #111 | gpt-oss-120b medium | OpenAI | 2 | 3.9 | 0/2 | 47.2s |
| #113 | GLM 5.1 none | Z.ai | 2 | 4.3 | 0/2 | 6.33s |
| #114 | DeepSeek V3.2 none | DeepSeek | 1 | 3.1 | 0/2 | 20.9s |
| #115 | MiMo-V2.5-Pro none | Xiaomi | 1 | 5.0 | 0/2 | 1.80s |
| #117 | Grok 4.20 Beta none | X AI | 1 | 5.5 | 0/1 | 1.14s |
| #118 | Nemotron 3 Nano Omni 30b A3b Reasoning medium | NVIDIA | 1 | 3.3 | 0/1 | 38.1s |
| #119 | MiniMax M2.5 medium | Minimax | 1 | 3.5 | 0/2 | 125.8s |