AI BENCHY
Your ad here

AI BENCHY Category Failures

Domain specific: Wrong answer

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

15

Total Failures

182

Most Affected Model

Gemini 3.1 Pro Preview 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#12 Gemini 3 PRO Preview medium Google 2 5.3 1/3 7.01s
#15 Gemini 2.5 Flash medium Google 2 5.9 1/3 37.3s
#16 GPT-5.4 medium OpenAI 2 5.3 1/3 74.3s
#22 Gemini 3.1 Flash Lite Preview low Google 2 5.3 1/3 2.36s
#23 MiMo-V2-Pro medium Xiaomi 1 5.3 1/3 6.00s
#25 Grok 4.20 Beta medium X AI 2 5.3 1/3 21.3s
#27 DeepSeek V3.2 medium DeepSeek 1 5.3 1/3 39.3s
#28 GPT-5.2 Chat none OpenAI 2 5.3 1/3 17.8s
#29 Gemini 3.1 Flash Lite Preview none Google 2 5.3 1/3 942ms
#30 Step 3.5 Flash medium Stepfun 2 5.3 1/3 170.5s
#31 GLM 5V Turbo medium Z.ai 2 5.3 1/3 38.1s
#32 Qwen3.5-Flash medium Qwen 1 5.3 1/3 146.5s
#33 GLM 5.1 medium Z.ai 1 5.3 1/3 29.8s
#38 GPT-5.4 Nano medium OpenAI 2 5.9 1/3 38.2s
#40 GPT-5.2 medium OpenAI 1 5.9 1/3 77.8s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost