Kushindwa kwa AI BENCHY
Kushindwa kwa Hakufuata maelekezo
Ona ni modeli gani za AI hukutana na Hakufuata maelekezo mara nyingi zaidi ili utambue hatari za utegemevu kabla ya kuchagua. Panga kwa: Muda wa majibu (wastani) ↑.
| Nafasi | Modeli | Kampuni | Idadi ya Hakufuata maelekezo | Alama | Majaribio sahihi | Muda wa majibu (wastani) |
|---|---|---|---|---|---|---|
| #126 | gpt-oss-120b none | OpenAI | 2 | 5.4 | 6/19 | 21.6s |
| #51 | Mimo V2 PRO medium | Xiaomi | 1 | 7.4 | 12/21 | 22.2s |
| #99 | gpt-oss-120b medium | OpenAI | 3 | 6.1 | 9/21 | 22.3s |
| #45 | GPT-5.4 Mini medium | OpenAI | 3 | 7.5 | 12/21 | 22.3s |
| #21 | GPT-5.4 medium | OpenAI | 2 | 8.0 | 14/21 | 22.3s |
| #23 | GLM 5 Turbo medium | Z.ai | 1 | 8.0 | 14/21 | 23.0s |
| #59 | GLM 5V Turbo medium | Z.ai | 1 | 7.2 | 11/21 | 23.1s |
| #54 | GPT-5 Mini medium | OpenAI | 3 | 7.3 | 12/21 | 23.6s |
| #86 | Grok 4.1 Fast medium | X AI | 4 | 6.5 | 9/19 | 23.8s |
| #69 | Claude Opus 4.6 medium | Anthropic | 1 | 7.0 | 12/21 | 25.9s |
| #43 | MiMo-V2.5-Pro medium | Xiaomi | 2 | 7.5 | 12/21 | 26.1s |
| #139 | DeepSeek V4 Flash none | DeepSeek | 1 | 5.0 | 5/21 | 26.8s |
| #56 | MiMo-V2.5 medium | Xiaomi | 1 | 7.3 | 12/21 | 27.1s |
| #65 | Grok 4.20 medium | X AI | 2 | 7.1 | 12/21 | 27.7s |
| #100 | Grok Build 0.1 none | X AI | 2 | 6.0 | 7/19 | 28.7s |