AI BENCHY Category Failures
Anti-AI Tricks: Wrong answer
Anti-AI Tricks
Wrong answer
See which AI models are most likely to hit Wrong answer on Anti-AI Tricks, so you can spot weak points faster. Sort by: Tests Correct ↓.
Failure Reasons
| Rank | Model | Company | Wrong answer Count | Category Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #103 | DeepSeek V4 Pro high | DeepSeek | 1 | 6.4 | 2/4 | 16.5s |
| #107 | Laguna Xs.2 medium | Poolside | 1 | 6.9 | 2/4 | 2.68s |
| #126 | gpt-oss-120b none | OpenAI | 1 | 6.5 | 2/4 | 32.8s |
| #130 | MiniMax M2.7 medium | Minimax | 1 | 7.9 | 2/4 | 40.3s |
| #136 | Elephant Alpha medium | Openrouter | 2 | 6.6 | 2/4 | 1.19s |
| #137 | Elephant Alpha none | Openrouter | 1 | 6.6 | 2/4 | 963ms |
| #138 | Ling-2.6-flash none | Inclusionai | 1 | 6.8 | 2/4 | 11.8s |
| #149 | Nemotron 3 Nano Omni 30b A3b Reasoning medium | NVIDIA | 1 | 6.4 | 2/4 | 1.20s |
| #67 | MiniMax M3 medium | Minimax | 2 | 5.5 | 1/4 | 14.9s |
| #74 | Qwen3.6 Max Preview none | Qwen | 3 | 5.2 | 1/4 | 2.63s |
| #77 | Claude Sonnet 4.6 none | Anthropic | 1 | 4.8 | 1/4 | 2.94s |
| #95 | Qwen3.5 Plus 2026-02-15 none | Qwen | 3 | 4.8 | 1/4 | 1.91s |
| #98 | GLM 5 none | Z.ai | 3 | 4.8 | 1/4 | 2.37s |
| #109 | GLM 5V Turbo none | Z.ai | 3 | 4.8 | 1/4 | 3.13s |
| #111 | Owl Alpha medium | Openrouter | 3 | 4.8 | 1/4 | 3.97s |