AI BENCHY Category Failures
Anti-AI Tricks: Wrong answer
Anti-AI Tricks
Wrong answer
See which AI models are most likely to hit Wrong answer on Anti-AI Tricks, so you can spot weak points faster. Sort by: Tests Correct ↓.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #65 | Grok 4.20 medium | X AI | 1 | 8.2 | 3/4 | 3.95s |
| #70 | GPT-5.4 Nano medium | OpenAI | 1 | 8.3 | 3/4 | 4.52s |
| #78 | Qwen3.6 27B medium | Qwen | 1 | 8.3 | 3/4 | 12.6s |
| #86 | Grok 4.1 Fast medium | X AI | 1 | 8.7 | 3/4 | 3.81s |
| #87 | Gemini 3.1 Flash Lite minimal | 1 | 8.3 | 3/4 | 1.10s | |
| #100 | Grok Build 0.1 none | X AI | 1 | 8.7 | 3/4 | 6.30s |
| #102 | Gemma 4 26B A4B none | 1 | 8.3 | 3/4 | 1.28s | |
| #119 | Cobuddy medium | Baidu | 1 | 8.7 | 3/4 | 10.00s |
| #32 | Gemini 3.5 Flash minimal | 2 | 6.5 | 2/4 | 892ms | |
| #34 | Qwen3.7 Max none | Qwen | 2 | 6.5 | 2/4 | 1.08s |
| #42 | GPT-5.2 medium | OpenAI | 1 | 6.5 | 2/4 | 7.81s |
| #52 | Claude Sonnet 4.6 medium | Anthropic | 1 | 6.5 | 2/4 | 2.98s |
| #54 | GPT-5 Mini medium | OpenAI | 1 | 7.1 | 2/4 | 13.9s |
| #58 | Gemini 3.1 Flash Lite Preview none | 1 | 7.5 | 2/4 | 1.04s | |
| #59 | GLM 5V Turbo medium | Z.ai | 1 | 7.2 | 2/4 | 10.8s |