AI BENCHY
Bandingkan Grafik Metodologi
❤️ Made by XCS
Your ad here

Kegagalan AI BENCHY

Kegagalan Tidak mengikuti instruksi

Lihat model AI mana yang paling sering mengalami Tidak mengikuti instruksi, agar Anda bisa melihat risiko keandalan sebelum memilih. Urutkan berdasarkan: Waktu respons (rata-rata) ↓.

Model yang ditampilkan

41

Total kegagalan

77

Model yang paling terdampak

Qwen3.5-Flash 1
Peringkat Model Perusahaan Jumlah Tidak mengikuti instruksi Skor Rata-rata Tes benar Waktu respons (rata-rata)
#24 Qwen3.5-Flash medium Qwen 1 6.9 10/16 70.8s
#28 Kimi K2.5 medium Moonshot AI 2 6.4 9/16 69.8s
#8 Gemini 3.1 Flash Lite Preview high Google 1 8.2 12/16 68.8s
#23 Seed-2.0-Mini medium Bytedance Seed 1 6.9 10/16 65.1s
#7 Qwen3.5-27B medium Qwen 2 8.2 12/16 52.1s
#34 GPT-5 Nano medium OpenAI 3 5.5 7/16 47.9s
#43 MiniMax M2.5 medium Minimax 3 4.7 5/16 43.0s
#18 DeepSeek V3.2 medium DeepSeek 1 7.3 11/16 39.5s
#52 GLM 4.7 Flash medium Z.ai 2 3.1 4/16 36.8s
#13 Step 3.5 Flash medium Stepfun 3 7.4 10/16 29.1s
#30 Grok 4.1 Fast medium X AI 3 6.2 9/16 26.3s
#21 MiMo-V2-Flash medium Xiaomi 1 7.2 11/16 25.3s
#32 GPT-5 Mini medium OpenAI 4 6.0 8/16 25.1s
#9 GPT-5.4 medium OpenAI 2 8.0 12/16 20.1s
#39 gpt-oss-120b medium OpenAI 4 5.1 7/16 16.7s
#3 GPT-5.3-Codex medium OpenAI 2 8.4 12/16 16.6s
#14 GLM 5 medium Z.ai 1 7.4 11/16 16.2s
#27 GPT-5.2 medium OpenAI 3 6.5 10/16 15.3s
#50 Qwen3 Coder Next medium Qwen 5 3.5 3/16 12.5s
#16 Gemini 2.5 Flash medium Google 1 7.4 11/16 12.4s
#48 Qwen3 Coder Next none Qwen 1 4.0 4/16 11.7s
#15 GPT-5.2 Chat none OpenAI 1 7.4 11/16 7.03s
#19 GPT-5.3 Chat none OpenAI 2 7.3 10/16 5.96s
#25 Claude Sonnet 4.6 none Anthropic 1 6.8 10/16 5.57s
#42 Qwen3.5-35B-A3B none Qwen 2 4.7 6/16 4.10s
#12 Gemini 3.1 Flash Lite Preview medium Google 1 7.5 11/16 3.83s
#40 Qwen3.5-122B-A10B none Qwen 1 5.0 6/16 3.72s
#37 Qwen3.5-Flash none Qwen 1 5.2 7/16 3.54s
#17 Gemini 3.1 Flash Lite Preview low Google 1 7.3 11/16 3.36s
#45 Trinity Large Preview none Arcee AI 2 4.2 5/16 3.15s
#49 GLM 4.7 Flash none Z.ai 2 3.9 4/16 2.99s
#54 MiMo-V2-Flash none Xiaomi 1 2.9 3/16 2.97s
#36 Mercury 2 medium Inception 4 5.3 7/16 2.36s
#47 GPT-4o-mini none OpenAI 1 4.0 4/16 2.07s
#53 Grok 4.1 Fast none X AI 2 2.9 3/16 1.90s
#41 Qwen3.5-27B none Qwen 2 4.9 5/16 1.75s
#44 GPT-5.4 none OpenAI 1 4.5 6/16 1.48s
#22 Gemini 3.1 Flash Lite Preview none Google 2 7.1 10/16 1.33s
#38 Gemini 2.5 Flash none Google 1 5.2 6/16 923ms
#55 LFM2-24B-A2B none Liquid 2 2.6 1/16 811ms
#51 Mercury 2 none Inception 1 3.4 4/16 596ms

Model teratas menurut Jumlah Tidak mengikuti instruksi

Jumlah Tidak mengikuti instruksi vs skor rata-rata

Model teratas menurut Waktu respons (rata-rata)