AI BENCHY
Advertise here

AI BENCHY Category Failures

Domain specific: Wrong answer

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

15

Total Failures

314

Most Affected Model

GLM 5 2
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#147 GPT-4o-mini none OpenAI 3 3.0 0/3 637ms
#150 Qwen3 Coder Next medium Qwen 2 5.3 1/3 638ms
#127 Grok 4.20 none X AI 2 3.0 0/3 687ms
#104 Nemotron 3 Ultra 550b A55b none NVIDIA 2 5.3 1/3 698ms
#122 GLM 4.7 Flash none Z.ai 1 7.7 2/3 744ms
#143 MiMo-V2.5 none Xiaomi 3 3.0 0/3 756ms
#90 Gemini 3.1 Flash Lite none Google 3 2.9 0/3 762ms
#88 Qwen3.7 Plus none Qwen 3 3.0 0/3 868ms
#151 Trinity Large Preview none Arcee AI 2 5.3 1/3 877ms
#123 MiMo-V2.5-Pro none Xiaomi 2 5.3 1/3 877ms
#108 Qwen3.5-Flash none Qwen 1 7.7 2/3 905ms
#136 Elephant Alpha medium Openrouter 3 3.0 0/3 925ms
#148 GPT-5.4 Nano none OpenAI 3 2.9 0/3 926ms
#137 Elephant Alpha none Openrouter 3 3.0 0/3 927ms
#144 GPT-5.4 Mini none OpenAI 3 3.5 0/3 937ms

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost