Kushindwa kwa AI BENCHY
Kushindwa kwa Hakufuata maelekezo
Ona ni modeli gani za AI hukutana na Hakufuata maelekezo mara nyingi zaidi ili utambue hatari za utegemevu kabla ya kuchagua. Panga kwa: Idadi ya kushindwa ↑.
Modeli zilizoonyeshwa
41
Jumla ya kushindwa
77
Modeli iliyoathirika zaidi
Gemini 3.1 Flash Lite Preview 1Kategoria zinazohusiana
| Nafasi | Modeli | Kampuni | Idadi ya Hakufuata maelekezo | Wastani wa alama | Majaribio sahihi | Muda wa majibu (wastani) |
|---|---|---|---|---|---|---|
| #8 | Gemini 3.1 Flash Lite Preview high | 1 | 8.2 | 12/16 | 68.8s | |
| #12 | Gemini 3.1 Flash Lite Preview medium | 1 | 7.5 | 11/16 | 3.83s | |
| #14 | GLM 5 medium | Z.ai | 1 | 7.4 | 11/16 | 16.2s |
| #15 | GPT-5.2 Chat none | OpenAI | 1 | 7.4 | 11/16 | 7.03s |
| #16 | Gemini 2.5 Flash medium | 1 | 7.4 | 11/16 | 12.4s | |
| #17 | Gemini 3.1 Flash Lite Preview low | 1 | 7.3 | 11/16 | 3.36s | |
| #18 | DeepSeek V3.2 medium | DeepSeek | 1 | 7.3 | 11/16 | 39.5s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 1 | 7.2 | 11/16 | 25.3s |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.9 | 10/16 | 65.1s |
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 6.9 | 10/16 | 70.8s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 1 | 6.8 | 10/16 | 5.57s |
| #37 | Qwen3.5-Flash none | Qwen | 1 | 5.2 | 7/16 | 3.54s |
| #38 | Gemini 2.5 Flash none | 1 | 5.2 | 6/16 | 923ms | |
| #40 | Qwen3.5-122B-A10B none | Qwen | 1 | 5.0 | 6/16 | 3.72s |
| #44 | GPT-5.4 none | OpenAI | 1 | 4.5 | 6/16 | 1.48s |
| #47 | GPT-4o-mini none | OpenAI | 1 | 4.0 | 4/16 | 2.07s |
| #48 | Qwen3 Coder Next none | Qwen | 1 | 4.0 | 4/16 | 11.7s |
| #51 | Mercury 2 none | Inception | 1 | 3.4 | 4/16 | 596ms |
| #54 | MiMo-V2-Flash none | Xiaomi | 1 | 2.9 | 3/16 | 2.97s |
| #3 | GPT-5.3-Codex medium | OpenAI | 2 | 8.4 | 12/16 | 16.6s |
| #7 | Qwen3.5-27B medium | Qwen | 2 | 8.2 | 12/16 | 52.1s |
| #9 | GPT-5.4 medium | OpenAI | 2 | 8.0 | 12/16 | 20.1s |
| #19 | GPT-5.3 Chat none | OpenAI | 2 | 7.3 | 10/16 | 5.96s |
| #22 | Gemini 3.1 Flash Lite Preview none | 2 | 7.1 | 10/16 | 1.33s | |
| #28 | Kimi K2.5 medium | Moonshot AI | 2 | 6.4 | 9/16 | 69.8s |
| #41 | Qwen3.5-27B none | Qwen | 2 | 4.9 | 5/16 | 1.75s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 2 | 4.7 | 6/16 | 4.10s |
| #45 | Trinity Large Preview none | Arcee AI | 2 | 4.2 | 5/16 | 3.15s |
| #49 | GLM 4.7 Flash none | Z.ai | 2 | 3.9 | 4/16 | 2.99s |
| #52 | GLM 4.7 Flash medium | Z.ai | 2 | 3.1 | 4/16 | 36.8s |
| #53 | Grok 4.1 Fast none | X AI | 2 | 2.9 | 3/16 | 1.90s |
| #55 | LFM2-24B-A2B none | Liquid | 2 | 2.6 | 1/16 | 811ms |
| #13 | Step 3.5 Flash medium | Stepfun | 3 | 7.4 | 10/16 | 29.1s |
| #27 | GPT-5.2 medium | OpenAI | 3 | 6.5 | 10/16 | 15.3s |
| #30 | Grok 4.1 Fast medium | X AI | 3 | 6.2 | 9/16 | 26.3s |
| #34 | GPT-5 Nano medium | OpenAI | 3 | 5.5 | 7/16 | 47.9s |
| #43 | MiniMax M2.5 medium | Minimax | 3 | 4.7 | 5/16 | 43.0s |
| #32 | GPT-5 Mini medium | OpenAI | 4 | 6.0 | 8/16 | 25.1s |
| #36 | Mercury 2 medium | Inception | 4 | 5.3 | 7/16 | 2.36s |
| #39 | gpt-oss-120b medium | OpenAI | 4 | 5.1 | 7/16 | 16.7s |
| #50 | Qwen3 Coder Next medium | Qwen | 5 | 3.5 | 3/16 | 12.5s |