AI BENCHY
Compare Charts Methodology
❤️ Made by XCS
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Category Failures

Domain specific
Wrong answer

See which AI models are most likely to hit Wrong answer on Domain specific, so you can spot weak points faster. Sort by: Response Time (avg) ↓.

Models Shown

53

Total Failures

98

Most Affected Model

MiniMax M2.5 2
Rank Model Company Wrong answer Count Category Score Tests Correct Response Time (avg)
#43 MiniMax M2.5 medium Minimax 2 10.0 0/3 237.3s
#34 GPT-5 Nano medium OpenAI 1 4.0 1/3 204.0s
#52 GLM 4.7 Flash medium Z.ai 2 10.0 0/3 174.6s
#13 Step 3.5 Flash medium Stepfun 2 4.0 1/3 170.5s
#24 Qwen3.5-Flash medium Qwen 1 4.0 1/3 146.5s
#28 Kimi K2.5 medium Moonshot AI 2 10.0 0/3 137.3s
#8 Gemini 3.1 Flash Lite Preview high Google 2 4.0 1/3 127.6s
#30 Grok 4.1 Fast medium X AI 1 4.0 1/3 121.8s
#21 MiMo-V2-Flash medium Xiaomi 2 4.0 1/3 96.0s
#35 Qwen3.5-35B-A3B medium Qwen 1 10.0 0/3 88.3s
#26 Claude Opus 4.6 medium Anthropic 1 10.0 0/3 83.4s
#7 Qwen3.5-27B medium Qwen 1 4.0 1/3 79.5s
#27 GPT-5.2 medium OpenAI 1 4.0 1/3 77.8s
#9 GPT-5.4 medium OpenAI 2 4.0 1/3 74.3s
#3 GPT-5.3-Codex medium OpenAI 2 4.0 1/3 64.3s
#10 Qwen3.5-122B-A10B medium Qwen 3 10.0 0/3 63.4s
#39 gpt-oss-120b medium OpenAI 3 10.0 0/3 50.9s
#32 GPT-5 Mini medium OpenAI 2 10.0 0/3 44.6s
#18 DeepSeek V3.2 medium DeepSeek 1 4.0 1/3 39.3s
#16 Gemini 2.5 Flash medium Google 2 4.0 1/3 37.3s
#2 Gemini 3.1 Pro Preview medium Google 1 7.0 2/3 32.7s
#15 GPT-5.2 Chat none OpenAI 2 4.0 1/3 17.8s
#4 Qwen3.5 Plus 2026-02-15 medium Qwen 1 4.0 1/3 17.5s
#19 GPT-5.3 Chat none OpenAI 3 10.0 0/3 13.0s
#5 Gemini 3 Flash Preview low Google 2 4.0 1/3 8.05s
#6 Gemini 3 Pro Preview medium Google 2 4.0 1/3 7.01s
#36 Mercury 2 medium Inception 3 10.0 0/3 6.48s
#46 Kimi K2.5 none Moonshot AI 2 4.0 1/3 4.38s
#12 Gemini 3.1 Flash Lite Preview medium Google 3 10.0 0/3 4.21s
#25 Claude Sonnet 4.6 none Anthropic 1 7.0 2/3 3.54s
#17 Gemini 3.1 Flash Lite Preview low Google 2 4.0 1/3 2.36s
#31 GLM 5 none Z.ai 3 10.0 0/3 2.24s
#33 DeepSeek V3.2 none DeepSeek 3 10.0 0/3 1.61s
#29 Qwen3.5 Plus 2026-02-15 none Qwen 2 4.0 1/3 1.17s
#44 GPT-5.4 none OpenAI 2 4.0 1/3 1.07s
#53 Grok 4.1 Fast none X AI 2 4.0 1/3 1.06s
#20 Gemini 3 Flash Preview none Google 1 7.0 2/3 963ms
#48 Qwen3 Coder Next none Qwen 2 4.0 1/3 962ms
#22 Gemini 3.1 Flash Lite Preview none Google 2 4.0 1/3 942ms
#37 Qwen3.5-Flash none Qwen 1 7.0 2/3 905ms
#45 Trinity Large Preview none Arcee AI 2 4.0 1/3 877ms
#49 GLM 4.7 Flash none Z.ai 1 7.0 2/3 744ms
#50 Qwen3 Coder Next medium Qwen 2 4.0 1/3 638ms
#47 GPT-4o-mini none OpenAI 3 10.0 0/3 637ms
#54 MiMo-V2-Flash none Xiaomi 2 4.0 1/3 564ms
#41 Qwen3.5-27B none Qwen 3 10.0 0/3 540ms
#51 Mercury 2 none Inception 2 4.0 1/3 534ms
#38 Gemini 2.5 Flash none Google 2 4.0 1/3 495ms
#42 Qwen3.5-35B-A3B none Qwen 1 7.0 2/3 485ms
#40 Qwen3.5-122B-A10B none Qwen 2 4.0 1/3 465ms
#55 LFM2-24B-A2B none Liquid 1 4.0 1/3 287ms
#11 Claude Sonnet 4.6 medium Anthropic 1 10.0 0/3 0ms
#14 GLM 5 medium Z.ai 2 10.0 0/3 0ms

Top Models by Wrong answer Count

Wrong answer Count vs Avg Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost