AI BENCHY
Your ad here

AI BENCHY Failures

Extra formatting Failures

See which AI models run into Extra formatting most often, so you can spot reliability risks before choosing one. Sort by: Response Time (avg) ↓.

Models Shown

15

Total Failures

23

Most Affected Model

Qwen3.5-9B 1
Rank Model Company Extra formatting Count Score Tests Correct Response Time (avg)
#97 Qwen3.5-9B medium Qwen 1 4.4 3/18 73.6s
#10 Qwen3.5-27B medium Qwen 1 8.4 13/18 53.0s
#41 MiMo-V2-Flash medium Xiaomi 1 7.5 11/18 23.4s
#37 Claude Opus 4.6 medium Anthropic 4 7.6 12/18 21.1s
#35 MiMo-V2-Omni medium Xiaomi 1 7.7 11/18 16.8s
#26 Claude Sonnet 4.6 medium Anthropic 2 8.0 13/18 12.7s
#23 MiMo-V2-Pro medium Xiaomi 1 8.1 12/18 12.3s
#64 DeepSeek V3.2 none DeepSeek 2 6.1 7/18 12.1s
#50 Hunter Alpha medium OpenRouter 1 6.7 8/18 10.3s
#47 Grok 4.20 medium X AI 1 7.0 9/18 10.3s
#87 Qwen3 Coder Next none Qwen 1 5.1 4/18 10.2s
#56 Grok 4.20 Multi Agent Beta medium X AI 2 6.4 7/18 9.80s
#42 Claude Sonnet 4.6 none Anthropic 3 7.4 11/18 4.98s
#94 MiMo-V2-Flash none Xiaomi 1 4.5 3/18 2.79s
#82 Grok 4.20 none X AI 1 5.2 5/18 1.11s

Top Models by Extra formatting Count

Extra formatting Count vs Score

Top Models by Response Time (avg)