AI BENCHY Category Failures
Domain specific: Wrong answer
Domain specific
Wrong answer
See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Response Time (avg) ↓.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #33 | Hy3 preview medium | Tencent | 2 | 5.3 | 1/3 | 22.3s |
| #93 | Qwen3.6 Plus Preview medium | Qwen | 2 | 3.0 | 0/3 | 22.1s |
| #13 | Grok 4.20 Beta medium | X AI | 2 | 5.3 | 1/3 | 21.3s |
| #139 | DeepSeek V4 Flash none | DeepSeek | 2 | 5.3 | 1/3 | 19.7s |
| #130 | MiniMax M2.7 medium | Minimax | 1 | 3.0 | 0/3 | 19.0s |
| #24 | GPT-5.2 Chat none | OpenAI | 2 | 5.3 | 1/3 | 17.8s |
| #156 | Hy3 preview none | Tencent | 2 | 3.6 | 0/3 | 17.6s |
| #25 | Qwen3.5 Plus 2026-02-15 medium | Qwen | 1 | 5.3 | 1/3 | 17.5s |
| #105 | Nemotron 3 Super medium | NVIDIA | 2 | 2.9 | 0/3 | 16.2s |
| #39 | Qwen3.6 Flash medium | Qwen | 3 | 3.5 | 0/3 | 14.6s |
| #10 | Claude Opus 4.8 medium | Anthropic | 2 | 5.3 | 1/3 | 14.2s |
| #2 | Gemini 3.5 Flash high | 1 | 7.6 | 2/3 | 14.1s | |
| #63 | GPT-5.3 Chat none | OpenAI | 3 | 3.5 | 0/3 | 13.0s |
| #107 | Laguna Xs.2 medium | Poolside | 2 | 4.1 | 0/3 | 11.1s |
| #20 | Gemini 3.5 Flash none | 1 | 7.6 | 2/3 | 10.6s |