AI BENCHY
Advertise here

AI BENCHY Category Failures

Domain specific: Wrong answer

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

15

Total Failures

314

Most Affected Model

Gemini 3.5 Flash 1
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#130 MiniMax M2.7 medium Minimax 1 3.0 0/3 19.0s
#133 DeepSeek V3.2 none DeepSeek 2 2.9 0/3 4.17s
#136 Elephant Alpha medium Openrouter 3 3.0 0/3 925ms
#137 Elephant Alpha none Openrouter 3 3.0 0/3 927ms
#138 Ling-2.6-flash none Inclusionai 3 3.0 0/3 4.95s
#141 Nemotron 3 Super none NVIDIA 3 3.6 0/3 6.23s
#143 MiMo-V2.5 none Xiaomi 3 3.0 0/3 756ms
#144 GPT-5.4 Mini none OpenAI 3 3.5 0/3 937ms
#145 Laguna M.1 none Poolside 3 3.6 0/3 5.50s
#147 GPT-4o-mini none OpenAI 3 3.0 0/3 637ms
#148 GPT-5.4 Nano none OpenAI 3 2.9 0/3 926ms
#149 Nemotron 3 Nano Omni 30b A3b Reasoning medium NVIDIA 2 2.9 0/3 56.7s
#153 Qwen3.6 35B A3B none Qwen 3 3.5 0/3 7.45s
#154 Qwen3.5-9B none Qwen 3 3.0 0/3 464ms
#156 Hy3 preview none Tencent 2 3.6 0/3 17.6s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost