AI BENCHY Category Failures
Anti-AI Tricks: Wrong answer
Anti-AI Tricks
Wrong answer
See which AI models are most likely to hit Wrong answer on Anti-AI Tricks, so you can spot weak points faster. Sort by: Response Time (avg) ↓.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #24 | GPT-5.2 Chat none | OpenAI | 1 | 8.7 | 3/4 | 3.40s |
| #140 | Qwen3 Coder Next none | Qwen | 2 | 3.6 | 0/4 | 3.31s |
| #13 | Grok 4.20 Beta medium | X AI | 1 | 8.7 | 3/4 | 3.16s |
| #109 | GLM 5V Turbo none | Z.ai | 3 | 4.8 | 1/4 | 3.13s |
| #52 | Claude Sonnet 4.6 medium | Anthropic | 1 | 6.5 | 2/4 | 2.98s |
| #77 | Claude Sonnet 4.6 none | Anthropic | 1 | 4.8 | 1/4 | 2.94s |
| #134 | GLM 5 Turbo none | Z.ai | 4 | 3.0 | 0/4 | 2.84s |
| #118 | Qwen3.6 27B none | Qwen | 4 | 3.8 | 0/4 | 2.83s |
| #121 | Owl Alpha none | Openrouter | 3 | 3.4 | 0/4 | 2.78s |
| #107 | Laguna Xs.2 medium | Poolside | 1 | 6.9 | 2/4 | 2.68s |
| #123 | MiMo-V2.5-Pro none | Xiaomi | 3 | 3.3 | 0/4 | 2.67s |
| #132 | Mistral Small 4 medium | Mistral | 3 | 5.6 | 1/4 | 2.67s |
| #74 | Qwen3.6 Max Preview none | Qwen | 3 | 5.2 | 1/4 | 2.63s |
| #110 | Seed-2.0-Lite none | Bytedance Seed | 4 | 3.0 | 0/4 | 2.43s |
| #98 | GLM 5 none | Z.ai | 3 | 4.8 | 1/4 | 2.37s |