AI BENCHY Failures
Wrong answer Failures
See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Total Cost ↑.
Categories
169/169
Filter models
No models match the current search and filters.
| Rank | Model | Company | Wrong answer Count | Score | Total Cost | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|---|
| #40 | MiniMax M3 medium | Minimax | 3 | 7.6 | $0.131 | 11/21 | 68.2s |
| #75 | Qwen3.6 35B A3B medium | Qwen | 4 | 6.7 | $0.146 | 13/21 | 18.1s |
| #41 | DeepSeek V4 Pro high | DeepSeek | 6 | 7.6 | $0.157 | 9/21 | 77.2s |
| #26 | Nemotron 3 Ultra 550b A55b medium | NVIDIA | 7 | 8.1 | $0.158 | 13/21 | 15.1s |
| #16 | GPT-5 Mini medium | OpenAI | 5 | 8.5 | $0.159 | 12/21 | 23.6s |
| #18 | Seed-2.0-Lite medium | Bytedance Seed | 5 | 8.5 | $0.175 | 14/21 | 47.1s |
| #25 | Qwen3.7 Plus medium | Qwen | 5 | 8.2 | $0.177 | 15/21 | 38.9s |
| #15 | GLM 5 medium | Z.ai | 3 | 8.6 | $0.228 | 15/21 | 33.5s |
| #90 | GPT-5.5 none | OpenAI | 11 | 6.3 | $0.231 | 10/21 | 1.89s |
| #47 | Qwen3.6 Flash medium | Qwen | 8 | 7.5 | $0.288 | 12/21 | 19.2s |
| #64 | GLM 5.1 medium | Z.ai | 4 | 7.1 | $0.292 | 12/21 | 33.7s |
| #30 | Qwen3.6 Plus medium | Qwen | 5 | 7.8 | $0.294 | 14/21 | 30.7s |
| #146 | MiniMax M2.5 medium | Minimax | 7 | 4.7 | $0.303 | 5/21 | 65.4s |
| #28 | Qwen3.5 Plus 2026-02-15 medium | Qwen | 4 | 8.0 | $0.310 | 14/21 | 73.8s |
| #55 | Claude Sonnet 4.6 none | Anthropic | 5 | 7.3 | $0.316 | 11/21 | 5.04s |