AI BENCHY
Your ad here

AI BENCHY Failures

Wrong answer Failures

See which AI models run into Wrong answer most often, so you can spot reliability risks before choosing one. Sort by: Tests Correct ↓.

Models Shown

15

Total Failures

572

Most Affected Model

Gemini 3.1 Pro Preview 1
Rank Model Company Wrong answer Count Score Tests Correct Response Time (avg)
#2 Gemini 3.1 Pro Preview medium Google 1 9.6 17/18 16.0s
#3 Claude Opus 4.7 medium Anthropic 1 9.2 16/18 3.53s
#4 Claude Opus 4.7 none Anthropic 2 9.2 16/18 3.13s
#5 Gemini 3 Flash Preview low Google 3 8.8 15/18 6.01s
#8 Qwen3.5 Plus 2026-02-15 medium Qwen 2 8.5 14/18 46.6s
#12 Gemini 3 PRO Preview medium Google 3 8.4 14/18 9.06s
#9 Qwen3.6 Plus Preview medium Qwen 3 8.5 13/17 13.9s
#11 Gemini 3.1 Flash Lite Preview high Google 3 8.4 12/16 68.8s
#6 Seed-2.0-Lite medium Bytedance Seed 3 8.6 13/18 30.4s
#7 GPT-5.3-Codex medium OpenAI 3 8.6 13/18 15.4s
#10 Qwen3.5-27B medium Qwen 1 8.4 13/18 53.0s
#13 GLM 5 medium Z.ai 2 8.4 13/18 23.3s
#14 Gemma 4 31B medium Google 1 8.3 13/18 24.9s
#15 Gemini 2.5 Flash medium Google 4 8.2 13/18 12.1s
#16 GPT-5.4 medium OpenAI 3 8.2 13/18 18.6s

Top Models by Wrong answer Count

Wrong answer Count vs Score

Top Models by Response Time (avg)