AI BENCHY Category Failures
Trivia: Wrong answer
Trivia
Wrong answer
See which AI models are most likely to hit Wrong answer on Trivia, so you can spot weak points faster. Sort by: Total Cost ↑.
Failure Reasons
133/133
Filter models
No models match the current search and filters.
| Rank | Model | Company | Wrong answer Count | Category Score | Total Cost | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|---|
| #24 | Gemini 2.5 Flash medium | 1 | 3.0 | $0.379 | 0/1 | 2.76s | |
| #19 | GPT-5.2 Chat none | OpenAI | 1 | 3.0 | $0.393 | 0/1 | 6.89s |
| #89 | Qwen3.5-35B-A3B medium | Qwen | 1 | 3.0 | $0.401 | 0/1 | 177.4s |
| #45 | GPT-5.3 Chat none | OpenAI | 1 | 3.0 | $0.433 | 0/1 | 4.38s |
| #81 | Qwen3.6 27B medium | Qwen | 1 | 3.0 | $0.440 | 0/1 | 81.0s |
| #56 | GLM 5V Turbo medium | Z.ai | 1 | 3.0 | $0.457 | 0/1 | 41.0s |
| #49 | Claude Opus 4.7 none | Anthropic | 1 | 3.0 | $0.505 | 0/1 | 1.46s |
| #3 | Qwen3.7 Max medium | Qwen | 1 | 3.0 | $0.523 | 0/1 | 33.4s |
| #27 | GPT-5.4 Mini medium | OpenAI | 1 | 3.0 | $0.526 | 0/1 | 30.1s |
| #29 | Qwen3.5-27B medium | Qwen | 1 | 3.0 | $0.536 | 0/1 | 85.1s |
| #160 | Grok Build 0.1 none | X AI | 1 | 3.0 | $0.547 | 0/1 | 36.1s |
| #22 | GPT-5.2 medium | OpenAI | 1 | 3.0 | $0.548 | 0/1 | 28.2s |
| #65 | Kimi K2.7 Code medium | Moonshot AI | 1 | 3.0 | $0.583 | 0/1 | 341.8s |
| #36 | Qwen3.5-122B-A10B medium | Qwen | 1 | 3.0 | $0.588 | 0/1 | 52.9s |
| #53 | Grok 4.20 medium | X AI | 1 | 3.0 | $0.609 | 0/1 | 63.5s |