AI BENCHY
Your ad here

AI BENCHY Failures

Extra formatting Failures

See which AI models run into Extra formatting most often, so you can spot reliability risks before choosing one. Sort by: Failure Count ↑.

Models Shown

15

Total Failures

23

Most Affected Model

Qwen3.5-27B 1
Rank Model Company Extra formatting Count Score Tests Correct Response Time (avg)
#10 Qwen3.5-27B medium Qwen 1 8.4 13/18 53.0s
#23 MiMo-V2-Pro medium Xiaomi 1 8.1 12/18 12.3s
#35 MiMo-V2-Omni medium Xiaomi 1 7.7 11/18 16.8s
#41 MiMo-V2-Flash medium Xiaomi 1 7.5 11/18 23.4s
#47 Grok 4.20 medium X AI 1 7.0 9/18 10.3s
#50 Hunter Alpha medium OpenRouter 1 6.7 8/18 10.3s
#82 Grok 4.20 none X AI 1 5.2 5/18 1.11s
#87 Qwen3 Coder Next none Qwen 1 5.1 4/18 10.2s
#94 MiMo-V2-Flash none Xiaomi 1 4.5 3/18 2.79s
#97 Qwen3.5-9B medium Qwen 1 4.4 3/18 73.6s
#26 Claude Sonnet 4.6 medium Anthropic 2 8.0 13/18 12.7s
#56 Grok 4.20 Multi Agent Beta medium X AI 2 6.4 7/18 9.80s
#64 DeepSeek V3.2 none DeepSeek 2 6.1 7/18 12.1s
#42 Claude Sonnet 4.6 none Anthropic 3 7.4 11/18 4.98s
#37 Claude Opus 4.6 medium Anthropic 4 7.6 12/18 21.1s

Top Models by Extra formatting Count

Extra formatting Count vs Score

Top Models by Response Time (avg)