AI BENCHY
Advertise here

AI BENCHY Category Failures

Domain specific: Wrong answer

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

15

Total Failures

314

Most Affected Model

MiniMax M2.5 2
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#34 Qwen3.7 Max none Qwen 1 7.7 2/3 975ms
#48 Gemini 3 Flash Preview none Google 1 7.7 2/3 963ms
#140 Qwen3 Coder Next none Qwen 2 5.3 1/3 962ms
#58 Gemini 3.1 Flash Lite Preview none Google 2 5.3 1/3 942ms
#144 GPT-5.4 Mini none OpenAI 3 3.5 0/3 937ms
#137 Elephant Alpha none Openrouter 3 3.0 0/3 927ms
#148 GPT-5.4 Nano none OpenAI 3 2.9 0/3 926ms
#136 Elephant Alpha medium Openrouter 3 3.0 0/3 925ms
#108 Qwen3.5-Flash none Qwen 1 7.7 2/3 905ms
#123 MiMo-V2.5-Pro none Xiaomi 2 5.3 1/3 877ms
#151 Trinity Large Preview none Arcee AI 2 5.3 1/3 877ms
#88 Qwen3.7 Plus none Qwen 3 3.0 0/3 868ms
#90 Gemini 3.1 Flash Lite none Google 3 2.9 0/3 762ms
#143 MiMo-V2.5 none Xiaomi 3 3.0 0/3 756ms
#122 GLM 4.7 Flash none Z.ai 1 7.7 2/3 744ms

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost