AI BENCHY Category Failures
Anti-AI Tricks: Wrong answer
Anti-AI Tricks
Wrong answer
See which AI models are most likely to hit Wrong answer on Anti-AI Tricks, so you can spot weak points faster. Sort by: Tests Correct ↑.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #91 | GPT-5.5 none | OpenAI | 2 | 6.9 | 2/4 | 1.31s |
| #92 | Laguna M.1 medium | Poolside | 1 | 6.5 | 2/4 | 4.87s |
| #94 | GPT-5 Nano medium | OpenAI | 2 | 6.5 | 2/4 | 25.5s |
| #99 | gpt-oss-120b medium | OpenAI | 1 | 6.7 | 2/4 | 10.2s |
| #103 | DeepSeek V4 Pro high | DeepSeek | 1 | 6.4 | 2/4 | 16.5s |
| #107 | Laguna Xs.2 medium | Poolside | 1 | 6.9 | 2/4 | 2.68s |
| #126 | gpt-oss-120b none | OpenAI | 1 | 6.5 | 2/4 | 32.8s |
| #130 | MiniMax M2.7 medium | Minimax | 1 | 7.9 | 2/4 | 40.3s |
| #136 | Elephant Alpha medium | Openrouter | 2 | 6.6 | 2/4 | 1.19s |
| #137 | Elephant Alpha none | Openrouter | 1 | 6.6 | 2/4 | 963ms |
| #138 | Ling-2.6-flash none | Inclusionai | 1 | 6.8 | 2/4 | 11.8s |
| #149 | Nemotron 3 Nano Omni 30b A3b Reasoning medium | NVIDIA | 1 | 6.4 | 2/4 | 1.20s |
| #8 | Claude Opus 4.7 none | Anthropic | 1 | 8.3 | 3/4 | 2.12s |
| #11 | Claude Opus 4.7 medium | Anthropic | 1 | 8.3 | 3/4 | 1.85s |
| #13 | Grok 4.20 Beta medium | X AI | 1 | 8.7 | 3/4 | 3.16s |