AI BENCHY Category Failures
Anti-AI Tricks: Wrong answer
Anti-AI Tricks
Wrong answer
See which AI models are most likely to hit Wrong answer on Anti-AI Tricks, so you can spot weak points faster. Sort by: Response Time (avg) ↑.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #112 | GLM 5.1 none | Z.ai | 4 | 4.0 | 0/4 | 2.11s |
| #50 | Gemini 3.1 Flash Lite Preview low | 1 | 8.3 | 3/4 | 2.12s | |
| #8 | Claude Opus 4.7 none | Anthropic | 1 | 8.3 | 3/4 | 2.12s |
| #143 | MiMo-V2.5 none | Xiaomi | 4 | 3.5 | 0/4 | 2.19s |
| #104 | Nemotron 3 Ultra 550b A55b none | NVIDIA | 4 | 3.5 | 0/4 | 2.35s |
| #98 | GLM 5 none | Z.ai | 3 | 4.8 | 1/4 | 2.37s |
| #110 | Seed-2.0-Lite none | Bytedance Seed | 4 | 3.0 | 0/4 | 2.43s |
| #74 | Qwen3.6 Max Preview none | Qwen | 3 | 5.2 | 1/4 | 2.63s |
| #132 | Mistral Small 4 medium | Mistral | 3 | 5.6 | 1/4 | 2.67s |
| #123 | MiMo-V2.5-Pro none | Xiaomi | 3 | 3.3 | 0/4 | 2.67s |
| #107 | Laguna Xs.2 medium | Poolside | 1 | 6.9 | 2/4 | 2.68s |
| #121 | Owl Alpha none | Openrouter | 3 | 3.4 | 0/4 | 2.78s |
| #118 | Qwen3.6 27B none | Qwen | 4 | 3.8 | 0/4 | 2.83s |
| #134 | GLM 5 Turbo none | Z.ai | 4 | 3.0 | 0/4 | 2.84s |
| #77 | Claude Sonnet 4.6 none | Anthropic | 1 | 4.8 | 1/4 | 2.94s |