AI BENCHY Category Failures
Domain specific: Wrong answer
Domain specific
Wrong answer
See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Response Time (avg) ↑.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #16 | Gemini 3 Flash Preview low | 2 | 5.3 | 1/3 | 8.05s | |
| #111 | Owl Alpha medium | Openrouter | 2 | 5.3 | 1/3 | 8.58s |
| #51 | Mimo V2 PRO medium | Xiaomi | 1 | 5.3 | 1/3 | 8.82s |
| #79 | Hunter Alpha medium | OpenRouter | 1 | 3.0 | 0/3 | 10.5s |
| #20 | Gemini 3.5 Flash none | 1 | 7.6 | 2/3 | 10.6s | |
| #107 | Laguna Xs.2 medium | Poolside | 2 | 4.1 | 0/3 | 11.1s |
| #63 | GPT-5.3 Chat none | OpenAI | 3 | 3.5 | 0/3 | 13.0s |
| #2 | Gemini 3.5 Flash high | 1 | 7.6 | 2/3 | 14.1s | |
| #10 | Claude Opus 4.8 medium | Anthropic | 2 | 5.3 | 1/3 | 14.2s |
| #39 | Qwen3.6 Flash medium | Qwen | 3 | 3.5 | 0/3 | 14.6s |
| #105 | Nemotron 3 Super medium | NVIDIA | 2 | 2.9 | 0/3 | 16.2s |
| #25 | Qwen3.5 Plus 2026-02-15 medium | Qwen | 1 | 5.3 | 1/3 | 17.5s |
| #156 | Hy3 preview none | Tencent | 2 | 3.6 | 0/3 | 17.6s |
| #24 | GPT-5.2 Chat none | OpenAI | 2 | 5.3 | 1/3 | 17.8s |
| #130 | MiniMax M2.7 medium | Minimax | 1 | 3.0 | 0/3 | 19.0s |