Kegagalan AI BENCHY
Kegagalan Format tambahan
Lihat model AI mana yang paling sering mengalami Format tambahan, agar Anda bisa melihat risiko keandalan sebelum memilih. Urutkan berdasarkan: Total Biaya ↓.
Model yang ditampilkan
15
Total kegagalan
53
Model yang paling terdampak
Grok 4.20 Multi Agent Beta 2
32/32
Filter model
Tidak ada model yang cocok dengan pencarian dan filter saat ini.
| Peringkat | Model | Perusahaan | Jumlah Format tambahan | Skor | Total Biaya | Tes benar | Waktu respons (rata-rata) |
|---|---|---|---|---|---|---|---|
| #136 | Grok 4.20 Multi Agent Beta medium | X AI | 2 | 5.0 | $5.599 | 8/18 | 9.69s |
| #38 | Claude Opus 4.6 medium | Anthropic | 5 | 7.7 | $2.053 | 12/21 | 25.9s |
| #31 | Claude Sonnet 4.6 medium | Anthropic | 3 | 7.8 | $1.418 | 13/21 | 17.1s |
| #42 | Grok Build 0.1 medium | X AI | 3 | 7.6 | $0.927 | 13/21 | 49.9s |
| #73 | Mimo V2 Omni medium | Xiaomi | 1 | 6.8 | $0.683 | 10/21 | 41.2s |
| #37 | Grok 4.3 medium | X AI | 1 | 7.7 | $0.614 | 13/21 | 47.5s |
| #53 | Grok 4.20 medium | X AI | 1 | 7.3 | $0.609 | 12/21 | 27.7s |
| #57 | Claude Opus 4.8 none | Anthropic | 3 | 7.2 | $0.539 | 12/21 | 3.47s |
| #29 | Qwen3.5-27B medium | Qwen | 1 | 7.9 | $0.536 | 13/21 | 68.4s |
| #77 | Mimo V2 PRO medium | Xiaomi | 1 | 6.7 | $0.333 | 12/21 | 22.2s |
| #55 | Claude Sonnet 4.6 none | Anthropic | 4 | 7.3 | $0.316 | 11/21 | 5.04s |
| #64 | GLM 5.1 medium | Z.ai | 1 | 7.1 | $0.292 | 12/21 | 33.7s |
| #41 | DeepSeek V4 Pro high | DeepSeek | 1 | 7.6 | $0.157 | 9/21 | 77.2s |
| #40 | MiniMax M3 medium | Minimax | 1 | 7.6 | $0.131 | 11/21 | 68.2s |
| #51 | MiMo-V2.5-Pro medium | Xiaomi | 3 | 7.4 | $0.106 | 12/21 | 26.1s |