AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

Kegagalan AI BENCHY

Kegagalan Jawaban salah

Lihat model AI mana yang paling sering mengalami Jawaban salah, agar Anda bisa melihat risiko keandalan sebelum memilih.

Model yang ditampilkan

15

Total kegagalan

1104

Model yang paling terdampak

Mercury 2 15
Peringkat Model Perusahaan Jumlah Jawaban salah Skor Tes benar Waktu respons (rata-rata)
#21 Hy3 preview medium Tencent 3 8.1 15/20 16.3s
#22 Gemini 3 PRO Preview medium Google 3 8.1 15/20 9.05s
#28 Qwen3.5-27B medium Qwen 3 7.9 13/20 60.1s
#34 Gemma 4 26B A4B medium Google 3 7.8 14/20 50.9s
#48 MiMo-V2.5-Pro medium Xiaomi 3 7.6 12/20 21.8s
#49 Gemini 3.1 Flash Lite high Google 3 7.6 11/18 62.0s
#51 Qwen3.5-Flash medium Qwen 3 7.6 12/20 63.0s
#53 Claude Sonnet 4.6 medium Anthropic 3 7.6 13/20 15.8s
#59 Kimi K2.6 medium Moonshot AI 3 7.4 12/20 54.0s
#63 GPT-5.2 medium OpenAI 3 7.3 12/20 16.5s
#65 Claude Opus 4.8 none Anthropic 3 7.3 12/20 3.51s
#71 Claude Opus 4.6 medium Anthropic 3 7.2 12/20 25.5s
#156 Qwen3.5-9B medium Qwen 3 4.2 3/20 83.3s
#3 Gemini 3.5 Flash low Google 2 9.3 18/20 2.98s
#4 Gemini 3.1 Pro Preview medium Google 2 9.3 18/20 20.8s

Model teratas menurut Jumlah Jawaban salah

Jumlah Jawaban salah vs Skor

Model teratas menurut Waktu respons (rata-rata)