AI BENCHY
Your ad here

AI BENCHY Category Failures

Instructions following: Wrong answer

Instructions following
Wrong answer

See which AI models are most likely to hit Wrong answer on Instructions following, so you can spot weak points faster.

Models Shown

15

Total Failures

44

Most Affected Model

Qwen3.5-27B 2
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#62 Gemini 2.5 Flash none Google 1 8.0 1/2 672ms
#63 Qwen3.5-35B-A3B none Qwen 1 6.3 1/2 809ms
#65 MiMo-V2-Pro none Xiaomi 1 6.5 1/2 2.51s
#66 GPT-5.4 none OpenAI 1 6.5 1/2 1.07s
#69 Kimi K2.6 none Moonshot AI 1 6.5 1/2 1.64s
#72 Hunter Alpha none OpenRouter 1 6.4 1/2 2.82s
#73 Mistral Small 4 medium Mistral 1 7.3 1/2 1.38s
#74 GLM 4.7 Flash none Z.ai 1 6.5 1/2 888ms
#76 Kimi K2.5 none Moonshot AI 1 6.5 1/2 2.67s
#77 GLM 5 Turbo none Z.ai 1 6.5 1/2 2.13s
#78 Trinity Large Preview none Arcee AI 1 4.1 0/2 1.09s
#79 Grok 4.20 Beta none X AI 1 4.8 0/2 687ms
#80 MiniMax M2.7 medium Minimax 1 3.7 0/2 12.6s
#82 Grok 4.20 none X AI 1 4.8 0/2 455ms
#83 Mistral Small 4 none Mistral 1 6.5 1/2 380ms

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost