AI BENCHY Category Failures
Trivia: Wrong answer
Trivia
Wrong answer
See which AI models are most likely to hit Wrong answer on Trivia, so you can spot weak points faster. Sort by: Response Time (avg) ↑.
Failure Reasons
133/133
Filter models
No models match the current search and filters.
| Rank | Model | Company | Wrong answer Count | Category Score | Total Cost | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|---|
| #126 | DeepSeek V3.2 none | DeepSeek | 1 | 3.0 | $0.017 | 0/1 | 17.2s |
| #79 | GPT-5 Nano medium | OpenAI | 1 | 3.0 | $0.081 | 0/1 | 20.1s |
| #127 | MiniMax M2.7 medium | Minimax | 1 | 3.0 | $0.104 | 0/1 | 22.8s |
| #115 | Grok 4.1 Fast medium | X AI | 1 | 3.0 | $0.069 | 0/1 | 25.5s |
| #78 | gpt-oss-120b medium | OpenAI | 1 | 3.0 | $0.013 | 0/1 | 26.5s |
| #22 | GPT-5.2 medium | OpenAI | 1 | 3.0 | $0.548 | 0/1 | 28.2s |
| #64 | GLM 5.1 medium | Z.ai | 1 | 3.0 | $0.292 | 0/1 | 29.4s |
| #31 | Claude Sonnet 4.6 medium | Anthropic | 1 | 3.0 | $1.418 | 0/1 | 30.1s |
| #27 | GPT-5.4 Mini medium | OpenAI | 1 | 3.0 | $0.526 | 0/1 | 30.1s |
| #75 | Qwen3.6 35B A3B medium | Qwen | 1 | 3.0 | $0.146 | 0/1 | 32.9s |
| #122 | Qwen3.5 Plus 2026-04-20 none | Qwen | 1 | 3.0 | $0.032 | 0/1 | 33.3s |
| #3 | Qwen3.7 Max medium | Qwen | 1 | 3.0 | $0.523 | 0/1 | 33.4s |
| #41 | DeepSeek V4 Pro high | DeepSeek | 1 | 3.0 | $0.157 | 0/1 | 34.0s |
| #160 | Grok Build 0.1 none | X AI | 1 | 3.0 | $0.547 | 0/1 | 36.1s |
| #140 | Cobuddy medium | Baidu | 1 | 3.0 | $0.000 | 0/1 | 37.0s |