AI BENCHY
Your ad here

AI BENCHY Category Failures

Domain specific: Wrong answer

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Tests Correct ↑.

Models Shown

15

Total Failures

182

Most Affected Model

Qwen3.6 Plus Preview 3
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#77 GLM 5 Turbo none Z.ai 2 5.3 1/3 1.97s
#78 Trinity Large Preview none Arcee AI 2 5.3 1/3 877ms
#83 Mistral Small 4 none Mistral 2 5.3 1/3 367ms
#87 Qwen3 Coder Next none Qwen 2 5.3 1/3 962ms
#91 Mercury 2 none Inception 2 5.3 1/3 534ms
#92 Qwen3 Coder Next medium Qwen 2 5.3 1/3 638ms
#94 MiMo-V2-Flash none Xiaomi 2 5.3 1/3 564ms
#95 Grok 4.1 Fast none X AI 2 5.9 1/3 1.06s
#98 LFM2-24B-A2B none Liquid 1 5.9 1/3 287ms
#2 Gemini 3.1 Pro Preview medium Google 1 7.7 2/3 32.7s
#4 Claude Opus 4.7 none Anthropic 1 7.7 2/3 1.19s
#14 Gemma 4 31B medium Google 1 7.7 2/3 38.5s
#21 Gemini 3 Flash Preview none Google 1 7.7 2/3 963ms
#42 Claude Sonnet 4.6 none Anthropic 1 7.7 2/3 3.54s
#48 Gemma 4 31B none Google 1 7.7 2/3 3.22s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost