AI BENCHY Failures
Wrong answer Failures
See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one.
Categories
| Rank | Model | Company | Wrong answer Count | Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #58 | Step 3.5 Flash medium | Stepfun | 4 | 7.4 | 11/19 | 43.3s |
| #68 | Seed-2.0-Mini medium | Bytedance Seed | 4 | 7.1 | 11/20 | 79.2s |
| #69 | Claude Sonnet 4.6 none | Anthropic | 4 | 7.0 | 11/20 | 5.33s |
| #74 | Laguna M.1 medium | Poolside | 4 | 6.9 | 12/19 | 14.4s |
| #75 | Hunter Alpha medium | OpenRouter | 4 | 6.7 | 8/18 | 10.3s |
| #81 | Grok 4.20 Multi Agent Beta medium | X AI | 4 | 6.6 | 8/18 | 9.80s |
| #87 | Grok 4.1 Fast medium | X AI | 4 | 6.5 | 9/19 | 24.0s |
| #5 | Qwen3.7 Max medium | Qwen | 3 | 9.0 | 17/20 | 13.8s |
| #8 | GPT-5.5 low | OpenAI | 3 | 8.9 | 17/20 | 9.43s |
| #9 | Gemini 3.5 Flash none | 3 | 8.9 | 17/20 | 9.05s | |
| #10 | Claude Opus 4.7 none | Anthropic | 3 | 8.9 | 16/19 | 3.04s |
| #13 | Gemini 3.1 Flash Lite Preview high | 3 | 8.6 | 13/16 | 68.8s | |
| #16 | Qwen3.6 Plus Preview medium | Qwen | 3 | 8.2 | 16/19 | 15.2s |
| #18 | GLM 5 medium | Z.ai | 3 | 8.2 | 14/20 | 33.4s |
| #19 | Gemini 3 PRO Preview medium | 3 | 8.1 | 15/20 | 9.06s |