Kushindwa kwa kategoria za AI BENCHY
Utatuzi wa mafumbo: Hakufuata maelekezo
Utatuzi wa mafumbo
Hakufuata maelekezo
Ona ni modeli gani za AI zina uwezekano mkubwa wa kupata Hakufuata maelekezo katika Utatuzi wa mafumbo, ili uone udhaifu haraka.
Sababu za kushindwa
| Nafasi | Modeli | Kampuni | Idadi ya Hakufuata maelekezo | Alama ya kategoria | Majaribio sahihi | Muda wa majibu (wastani) |
|---|---|---|---|---|---|---|
| #53 | Gemini 3.1 Flash Lite high | 2 | 5.7 | 1/3 | 50.8s | |
| #87 | Gemini 3.1 Flash Lite minimal | 2 | 6.0 | 1/3 | 2.15s | |
| #153 | Qwen3.6 35B A3B none | Qwen | 2 | 3.2 | 0/3 | 1.07s |
| #12 | Gemini 3.1 Flash Lite Preview high | 1 | 7.7 | 2/3 | 46.7s | |
| #15 | GPT-5.3-Codex medium | OpenAI | 1 | 9.0 | 2/3 | 5.05s |
| #19 | Seed-2.0-Lite medium | Bytedance Seed | 1 | 9.0 | 2/3 | 10.2s |
| #21 | GPT-5.4 medium | OpenAI | 1 | 8.2 | 2/3 | 9.14s |
| #23 | GLM 5 Turbo medium | Z.ai | 1 | 8.7 | 2/3 | 5.23s |
| #30 | Qwen3.5-27B medium | Qwen | 1 | 8.2 | 2/3 | 59.6s |
| #31 | DeepSeek V4 Flash high | DeepSeek | 1 | 8.2 | 2/3 | 26.1s |
| #33 | Hy3 preview medium | Tencent | 1 | 7.7 | 2/3 | 11.1s |
| #38 | Grok 4.3 medium | X AI | 1 | 5.9 | 1/3 | 22.5s |
| #39 | Qwen3.6 Flash medium | Qwen | 1 | 8.2 | 2/3 | 6.29s |
| #42 | GPT-5.2 medium | OpenAI | 1 | 7.5 | 2/3 | 5.80s |
| #43 | MiMo-V2.5-Pro medium | Xiaomi | 1 | 6.7 | 1/3 | 5.31s |