AI BENCHY
Advertise here

AI BENCHY Category Failures

Domain specific: Wrong answer

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster.

Models Shown

15

Total Failures

314

Most Affected Model

Qwen3.6 Max Preview 3
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#38 Grok 4.3 medium X AI 2 5.3 1/3 181.7s
#46 Qwen3.6 35B A3B medium Qwen 2 5.3 1/3 22.5s
#50 Gemini 3.1 Flash Lite Preview low Google 2 5.3 1/3 2.36s
#54 GPT-5 Mini medium OpenAI 2 3.6 0/3 44.6s
#57 Step 3.7 Flash low Stepfun 2 5.3 1/3 43.3s
#58 Gemini 3.1 Flash Lite Preview none Google 2 5.3 1/3 942ms
#59 GLM 5V Turbo medium Z.ai 2 5.3 1/3 38.1s
#61 Gemini 3.1 Flash Lite low Google 2 5.3 1/3 1.52s
#62 Step 3.5 Flash medium Stepfun 2 5.3 1/3 170.5s
#64 MiMo-V2-Flash medium Xiaomi 2 5.9 1/3 96.0s
#68 Claude Opus 4.8 none Anthropic 2 5.3 1/3 1.66s
#70 GPT-5.4 Nano medium OpenAI 2 5.9 1/3 38.2s
#71 Step 3.7 Flash high Stepfun 2 4.1 0/3 149.6s
#72 DeepSeek V3.2 medium DeepSeek 2 2.9 0/3 24.3s
#76 Kimi K2.5 medium Moonshot AI 2 3.5 0/3 137.3s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost