AI BENCHY
Compare Charts Methodology
❤️ Made by XCS
Your ad here

AI BENCHY Failures

Extra formatting Failures

See which AI models run into Extra formatting most often, so you can spot reliability risks before choosing one. Sort by: Avg Score ↑.

Models Shown

6

Total Failures

13

Most Affected Model

MiMo-V2-Flash 1
Rank Model Company Extra formatting Count Avg Score Tests Correct Response Time (avg)
#54 MiMo-V2-Flash none Xiaomi 1 2.9 3/16 2.97s
#48 Qwen3 Coder Next none Qwen 1 4.0 4/16 11.7s
#33 DeepSeek V3.2 none DeepSeek 2 5.5 7/16 12.9s
#26 Claude Opus 4.6 medium Anthropic 4 6.6 10/16 22.9s
#25 Claude Sonnet 4.6 none Anthropic 3 6.8 10/16 5.57s
#11 Claude Sonnet 4.6 medium Anthropic 2 7.7 12/16 11.2s

Top Models by Extra formatting Count

Extra formatting Count vs Avg Score

Top Models by Response Time (avg)