AI BENCHY
Advertise here

AI BENCHY Category Failures

Puzzle Solving: Wrong answer

Puzzle Solving
Wrong answer

See which AI models are most likely to hit Wrong answer on Puzzle Solving, so you can spot weak points faster.

Models Shown

15

Total Failures

147

Most Affected Model

Qwen3.5-Flash 3
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#122 GLM 4.7 Flash none Z.ai 1 6.4 1/3 1.20s
#123 MiMo-V2.5-Pro none Xiaomi 1 6.7 1/3 1.30s
#125 GPT-5.4 none OpenAI 1 5.6 1/3 1.44s
#126 gpt-oss-120b none OpenAI 1 6.0 1/3 8.21s
#129 MiniMax M2.5 medium Minimax 1 5.3 1/3 11.2s
#130 MiniMax M2.7 medium Minimax 1 5.9 1/3 24.9s
#134 GLM 5 Turbo none Z.ai 1 5.5 1/3 2.65s
#136 Elephant Alpha medium Openrouter 1 5.3 1/3 868ms
#139 DeepSeek V4 Flash none DeepSeek 1 3.1 0/3 23.7s
#141 Nemotron 3 Super none NVIDIA 1 5.5 1/3 2.36s
#143 MiMo-V2.5 none Xiaomi 1 5.4 1/3 2.13s
#144 GPT-5.4 Mini none OpenAI 1 5.4 1/3 836ms
#146 Laguna Xs.2 none Poolside 1 5.3 1/3 650ms
#148 GPT-5.4 Nano none OpenAI 1 5.4 1/3 1.25s
#153 Qwen3.6 35B A3B none Qwen 1 3.2 0/3 1.07s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost