AI BENCHY Failures
Wrong answer Failures
See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Total Cost ↑.
Categories
169/169
Filter models
No models match the current search and filters.
| Rank | Model | Company | Wrong answer Count | Score | Total Cost | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|---|
| #107 | North Mini Code medium | Cohere | 9 | 5.8 | $0.000 | 9/21 | 106.2s |
| #108 | Owl Alpha medium | Openrouter | 10 | 5.8 | $0.000 | 8/21 | 11.9s |
| #110 | Owl Alpha none | Openrouter | 10 | 5.8 | $0.000 | 7/21 | 9.88s |
| #113 | Qwen3.6 Plus Preview medium | Qwen | 2 | 5.8 | $0.000 | 9/19 | 15.2s |
| #131 | North Mini Code none | Cohere | 12 | 5.1 | $0.000 | 4/21 | 29.8s |
| #132 | Hunter Alpha medium | OpenRouter | 4 | 5.1 | $0.000 | 8/18 | 10.3s |
| #138 | Laguna M.1 medium | Poolside | 4 | 5.0 | $0.000 | 9/19 | 14.7s |
| #140 | Cobuddy medium | Baidu | 9 | 4.9 | $0.000 | 7/21 | 39.9s |
| #150 | Laguna M.1 none | Poolside | 10 | 4.6 | $0.000 | 4/19 | 2.89s |
| #152 | Elephant Alpha none | Openrouter | 9 | 4.6 | $0.000 | 5/21 | 1.22s |
| #153 | Elephant Alpha medium | Openrouter | 9 | 4.5 | $0.000 | 6/21 | 1.27s |
| #154 | Hunter Alpha none | OpenRouter | 9 | 4.5 | $0.000 | 6/18 | 4.70s |
| #156 | Laguna Xs.2 medium | Poolside | 6 | 4.3 | $0.000 | 6/19 | 6.73s |
| #162 | Laguna Xs.2 none | Poolside | 8 | 4.0 | $0.000 | 5/19 | 806ms |
| #166 | Nemotron 3 Nano Omni 30b A3b Reasoning medium | NVIDIA | 7 | 3.6 | $0.000 | 4/19 | 17.1s |