AI BENCHY
Advertise here

AI BENCHY Category Failures

Domain specific: Wrong answer

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster.

Models Shown

15

Total Failures

314

Most Affected Model

Qwen3.6 Max Preview 3
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#137 Elephant Alpha none Openrouter 3 3.0 0/3 927ms
#138 Ling-2.6-flash none Inclusionai 3 3.0 0/3 4.95s
#141 Nemotron 3 Super none NVIDIA 3 3.6 0/3 6.23s
#143 MiMo-V2.5 none Xiaomi 3 3.0 0/3 756ms
#144 GPT-5.4 Mini none OpenAI 3 3.5 0/3 937ms
#145 Laguna M.1 none Poolside 3 3.6 0/3 5.50s
#147 GPT-4o-mini none OpenAI 3 3.0 0/3 637ms
#148 GPT-5.4 Nano none OpenAI 3 2.9 0/3 926ms
#153 Qwen3.6 35B A3B none Qwen 3 3.5 0/3 7.45s
#154 Qwen3.5-9B none Qwen 3 3.0 0/3 464ms
#159 Ling-2.6-1T none Inclusionai 3 3.0 0/3 1.04s
#162 Nemotron 3 Nano Omni 30b A3b Reasoning none NVIDIA 3 3.6 0/3 489ms
#163 Granite 4.1 8B none IBM Granite 3 3.0 0/3 357ms
#5 Qwen3.7 Max medium Qwen 2 5.9 1/3 24.9s
#6 GPT-5.5 low OpenAI 2 5.3 1/3 28.1s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost