AI BENCHY
Compare Charts Methodology
❤️ Made by XCS
Your ad here

AI BENCHY Failures

Extra formatting Failures

See which AI models run into Extra formatting most often, so you can spot reliability risks before choosing one. Sort by: Failure Count ↑.

Models Shown

6

Total Failures

13

Most Affected Model

Qwen3 Coder Next 1
Rank Model Company Extra formatting Count Avg Score Tests Correct Response Time (avg)
#48 Qwen3 Coder Next none Qwen 1 4.0 4/16 11.7s
#54 MiMo-V2-Flash none Xiaomi 1 2.9 3/16 2.97s
#11 Claude Sonnet 4.6 medium Anthropic 2 7.7 12/16 11.2s
#33 DeepSeek V3.2 none DeepSeek 2 5.5 7/16 12.9s
#25 Claude Sonnet 4.6 none Anthropic 3 6.8 10/16 5.57s
#26 Claude Opus 4.6 medium Anthropic 4 6.6 10/16 22.9s

Top Models by Extra formatting Count

Extra formatting Count vs Avg Score

Top Models by Response Time (avg)