Kushindwa kwa kategoria za AI BENCHY
Akili ya jumla: Hakufuata maelekezo
Akili ya jumla
Hakufuata maelekezo
Ona ni modeli gani za AI zina uwezekano mkubwa wa kupata Hakufuata maelekezo katika Akili ya jumla, ili uone udhaifu haraka.
Sababu za kushindwa
| Nafasi | Modeli | Kampuni | Idadi ya Hakufuata maelekezo | Alama ya kategoria | Majaribio sahihi | Muda wa majibu (wastani) |
|---|---|---|---|---|---|---|
| #78 | Qwen3.6 27B medium | Qwen | 1 | 6.5 | 0/1 | 39.5s |
| #79 | Hunter Alpha medium | OpenRouter | 1 | 7.0 | 0/1 | 6.44s |
| #81 | Mercury 2 medium | Inception | 1 | 4.8 | 0/1 | 821ms |
| #83 | Step 3.5 Flash none | Stepfun | 1 | 4.0 | 0/1 | 14.4s |
| #84 | Grok 4.20 Multi Agent Beta medium | X AI | 1 | 5.8 | 0/1 | 6.40s |
| #86 | Grok 4.1 Fast medium | X AI | 1 | 4.2 | 0/1 | 16.2s |
| #87 | Gemini 3.1 Flash Lite minimal | 1 | 4.0 | 0/1 | 791ms | |
| #88 | Qwen3.7 Plus none | Qwen | 1 | 5.3 | 0/1 | 1.33s |
| #94 | GPT-5 Nano medium | OpenAI | 1 | 4.1 | 0/1 | 17.5s |
| #99 | gpt-oss-120b medium | OpenAI | 1 | 4.3 | 0/1 | 7.90s |
| #102 | Gemma 4 26B A4B none | 1 | 4.0 | 0/1 | 3.54s | |
| #103 | DeepSeek V4 Pro high | DeepSeek | 1 | 6.1 | 0/1 | 25.1s |
| #105 | Nemotron 3 Super medium | NVIDIA | 1 | 4.1 | 0/1 | 6.91s |
| #106 | Grok 4.20 Beta none | X AI | 1 | 5.0 | 0/1 | 541ms |
| #109 | GLM 5V Turbo none | Z.ai | 1 | 4.6 | 0/1 | 2.22s |