AI BENCHY Failures
Wrong answer Failures
See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Total Cost ↓.
Categories
169/169
Filter models
No models match the current search and filters.
| Rank | Model | Company | Wrong answer Count | Score | Total Cost | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|---|
| #156 | Laguna Xs.2 medium | Poolside | 6 | 4.3 | $0.000 | 6/19 | 6.73s |
| #162 | Laguna Xs.2 none | Poolside | 8 | 4.0 | $0.000 | 5/19 | 806ms |
| #166 | Nemotron 3 Nano Omni 30b A3b Reasoning medium | NVIDIA | 7 | 3.6 | $0.000 | 4/19 | 17.1s |
| #167 | Nemotron 3 Nano Omni 30b A3b Reasoning none | NVIDIA | 9 | 3.5 | $0.000 | 2/19 | 728ms |