AI BENCHY
Advertise here

AI BENCHY Category Failures

Domain specific: Wrong answer

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Tests Correct ↑.

Models Shown

15

Total Failures

314

Most Affected Model

Qwen3.6 Max Preview 3
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#119 Cobuddy medium Baidu 3 2.9 0/3 128.2s
#126 gpt-oss-120b none OpenAI 3 3.0 0/3 35.0s
#127 Grok 4.20 none X AI 2 3.0 0/3 687ms
#129 MiniMax M2.5 medium Minimax 2 2.9 0/3 237.3s
#130 MiniMax M2.7 medium Minimax 1 3.0 0/3 19.0s
#133 DeepSeek V3.2 none DeepSeek 2 2.9 0/3 4.17s
#136 Elephant Alpha medium Openrouter 3 3.0 0/3 925ms
#137 Elephant Alpha none Openrouter 3 3.0 0/3 927ms
#138 Ling-2.6-flash none Inclusionai 3 3.0 0/3 4.95s
#141 Nemotron 3 Super none NVIDIA 3 3.6 0/3 6.23s
#143 MiMo-V2.5 none Xiaomi 3 3.0 0/3 756ms
#144 GPT-5.4 Mini none OpenAI 3 3.5 0/3 937ms
#145 Laguna M.1 none Poolside 3 3.6 0/3 5.50s
#147 GPT-4o-mini none OpenAI 3 3.0 0/3 637ms
#148 GPT-5.4 Nano none OpenAI 3 2.9 0/3 926ms

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost