Kushindwa kwa kategoria za AI BENCHY
Utatuzi wa mafumbo: Hakufuata maelekezo
Utatuzi wa mafumbo
Hakufuata maelekezo
Ona ni modeli gani za AI zina uwezekano mkubwa wa kupata Hakufuata maelekezo katika Utatuzi wa mafumbo, ili uone udhaifu haraka.
Sababu za kushindwa
| Nafasi | Modeli | Kampuni | Idadi ya Hakufuata maelekezo | Alama ya kategoria | Majaribio sahihi | Muda wa majibu (wastani) |
|---|---|---|---|---|---|---|
| #18 | GLM 5 Turbo medium | Z.ai | 2 | 7.3 | 1/3 | 5.44s |
| #34 | Kimi K2.6 medium | Moonshot AI | 2 | 5.0 | 0/3 | 25.6s |
| #38 | GPT-5.4 Nano medium | OpenAI | 2 | 4.0 | 0/3 | 3.65s |
| #44 | GPT-5.4 Mini medium | OpenAI | 2 | 6.8 | 1/3 | 4.33s |
| #47 | Grok 4.20 medium | X AI | 2 | 6.4 | 1/3 | 3.89s |
| #51 | Nemotron 3 Super medium | NVIDIA | 2 | 3.5 | 0/3 | 8.39s |
| #54 | Mercury 2 medium | Inception | 2 | 3.9 | 0/3 | 934ms |
| #56 | Grok 4.20 Multi Agent Beta medium | X AI | 2 | 7.2 | 1/3 | 5.01s |
| #68 | gpt-oss-120b medium | OpenAI | 2 | 3.2 | 0/3 | 11.8s |
| #69 | Kimi K2.6 none | Moonshot AI | 2 | 3.4 | 0/3 | 1.66s |
| #73 | Mistral Small 4 medium | Mistral | 2 | 3.4 | 0/3 | 2.00s |
| #74 | GLM 4.7 Flash none | Z.ai | 2 | 4.4 | 0/3 | 1.00s |
| #80 | MiniMax M2.7 medium | Minimax | 2 | 3.8 | 0/3 | 25.6s |
| #81 | Elephant medium | Openrouter | 2 | 3.7 | 0/3 | 867ms |
| #83 | Mistral Small 4 none | Mistral | 2 | 3.1 | 0/3 | 589ms |