Kegagalan AI BENCHY
Kegagalan Format tambahan
Lihat model AI mana yang paling sering mengalami Format tambahan, agar Anda bisa melihat risiko keandalan sebelum memilih.
30/30
Filter model
Tidak ada model yang cocok dengan pencarian dan filter saat ini.
| Peringkat | Model | Perusahaan | Jumlah Format tambahan | Skor | Total Biaya | Tes benar | Waktu respons (rata-rata) |
|---|---|---|---|---|---|---|---|
| #42 | Claude Opus 4.6 medium | Anthropic | 5 | 7.7 | $2.053 | 12/21 | 25.9s |
| #57 | Claude Sonnet 4.6 none | Anthropic | 4 | 7.3 | $0.316 | 11/21 | 5.04s |
| #35 | Claude Sonnet 4.6 medium | Anthropic | 3 | 7.8 | $1.418 | 13/21 | 17.1s |
| #45 | Grok Build 0.1 medium | X AI | 3 | 7.6 | $0.927 | 13/21 | 49.9s |
| #53 | MiMo-V2.5-Pro medium | Xiaomi | 3 | 7.4 | $0.106 | 12/21 | 26.1s |
| #60 | Claude Opus 4.8 none | Anthropic | 3 | 7.2 | $0.539 | 12/21 | 3.47s |
| #58 | Grok 4.20 Multi Agent Beta medium | X AI | 2 | 7.3 | $5.599 | 8/18 | 9.69s |
| #78 | MiMo-V2.5 medium | Xiaomi | 2 | 6.7 | $0.063 | 12/21 | 27.1s |
| #120 | DeepSeek V4 Flash none | DeepSeek | 2 | 5.5 | $0.008 | 5/21 | 26.8s |
| #133 | DeepSeek V3.2 none | DeepSeek | 2 | 5.3 | $0.017 | 6/21 | 13.8s |
| #30 | DeepSeek V4 Pro high | DeepSeek | 1 | 8.1 | $0.098 | 10/21 | 72.2s |
| #33 | Qwen3.5-27B medium | Qwen | 1 | 7.9 | $0.536 | 13/21 | 68.4s |
| #41 | Grok 4.3 medium | X AI | 1 | 7.7 | $0.614 | 13/21 | 47.5s |
| #44 | MiniMax M3 medium | Minimax | 1 | 7.6 | $0.131 | 11/21 | 68.2s |
| #55 | Grok 4.20 medium | X AI | 1 | 7.3 | $0.609 | 12/21 | 27.7s |