AI BENCHY Category Failures
Anti-AI Tricks: Wrong answer
Anti-AI Tricks
Wrong answer
See which AI models are most likely to hit Wrong answer on Anti-AI Tricks, so you can spot weak points faster. Sort by: Response Time (avg) ↑.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #22 | Gemini 3.1 Flash Lite Preview low | 1 | 8.3 | 3/4 | 2.12s | |
| #4 | Claude Opus 4.7 none | Anthropic | 1 | 8.3 | 3/4 | 2.12s |
| #53 | GLM 5 none | Z.ai | 3 | 4.8 | 1/4 | 2.37s |
| #61 | Seed-2.0-Lite none | Bytedance Seed | 4 | 3.0 | 0/4 | 2.43s |
| #73 | Mistral Small 4 medium | Mistral | 3 | 5.6 | 1/4 | 2.67s |
| #77 | GLM 5 Turbo none | Z.ai | 4 | 3.0 | 0/4 | 2.84s |
| #42 | Claude Sonnet 4.6 none | Anthropic | 1 | 4.8 | 1/4 | 2.94s |
| #26 | Claude Sonnet 4.6 medium | Anthropic | 1 | 6.5 | 2/4 | 2.98s |
| #78 | Trinity Large Preview none | Arcee AI | 4 | 3.0 | 0/4 | 3.02s |
| #58 | GLM 5V Turbo none | Z.ai | 3 | 4.8 | 1/4 | 3.13s |
| #25 | Grok 4.20 Beta medium | X AI | 1 | 8.7 | 3/4 | 3.16s |
| #87 | Qwen3 Coder Next none | Qwen | 2 | 3.6 | 0/4 | 3.31s |
| #47 | Grok 4.20 medium | X AI | 1 | 8.2 | 3/4 | 3.36s |
| #28 | GPT-5.2 Chat none | OpenAI | 1 | 8.7 | 3/4 | 3.40s |
| #56 | Grok 4.20 Multi Agent Beta medium | X AI | 1 | 6.9 | 2/4 | 3.46s |