AI BENCHY
Advertise here

AI BENCHY Failures

Extra formatting Failures

See which AI models run into Extra formatting most often, so you can spot reliability risks before choosing one.

Models Shown

15

Total Failures

48

Most Affected Model

Claude Opus 4.6 5
Rank Model Company Extra formatting Count Score Tests Correct Response Time (avg)
#69 Claude Opus 4.6 medium Anthropic 5 7.0 12/21 25.9s
#77 Claude Sonnet 4.6 none Anthropic 4 6.8 11/21 5.04s
#43 MiMo-V2.5-Pro medium Xiaomi 3 7.5 12/21 26.1s
#47 Grok Build 0.1 medium X AI 3 7.4 13/21 49.9s
#52 Claude Sonnet 4.6 medium Anthropic 3 7.4 13/21 17.1s
#68 Claude Opus 4.8 none Anthropic 3 7.0 12/21 3.47s
#56 MiMo-V2.5 medium Xiaomi 2 7.3 12/21 27.1s
#84 Grok 4.20 Multi Agent Beta medium X AI 2 6.6 8/18 9.69s
#133 DeepSeek V3.2 none DeepSeek 2 5.2 6/21 13.8s
#139 DeepSeek V4 Flash none DeepSeek 2 5.0 5/21 26.8s
#30 Qwen3.5-27B medium Qwen 1 7.8 13/21 68.4s
#38 Grok 4.3 medium X AI 1 7.6 13/21 47.5s
#51 Mimo V2 PRO medium Xiaomi 1 7.4 12/21 22.2s
#55 GLM 5.1 medium Z.ai 1 7.3 12/21 33.7s
#64 MiMo-V2-Flash medium Xiaomi 1 7.2 12/21 20.1s

Top Models by Extra formatting Count

Extra formatting Count vs Score

Top Models by Response Time (avg)