Kushindwa kwa kategoria za AI BENCHY
Utatuzi wa mafumbo: Jibu lisilo sahihi
Utatuzi wa mafumbo
Jibu lisilo sahihi
Ona ni modeli gani za AI zina uwezekano mkubwa wa kupata Jibu lisilo sahihi katika Utatuzi wa mafumbo, ili uone udhaifu haraka.
Sababu za kushindwa
| Nafasi | Modeli | Kampuni | Idadi ya Jibu lisilo sahihi | Alama ya kategoria | Majaribio sahihi | Muda wa majibu (wastani) |
|---|---|---|---|---|---|---|
| #46 | Kimi K2.5 medium | Moonshot AI | 1 | 5.3 | 1/3 | 45.4s |
| #48 | Gemma 4 31B none | 1 | 5.5 | 1/3 | 2.95s | |
| #49 | Qwen3.5 Plus 2026-02-15 none | Qwen | 1 | 7.7 | 2/3 | 2.82s |
| #50 | Hunter Alpha medium | OpenRouter | 1 | 6.1 | 1/3 | 5.36s |
| #51 | Nemotron 3 Super medium | NVIDIA | 1 | 3.5 | 0/3 | 8.39s |
| #52 | Grok 4.1 Fast medium | X AI | 1 | 5.3 | 1/3 | 8.08s |
| #53 | GLM 5 none | Z.ai | 1 | 7.7 | 2/3 | 2.05s |
| #54 | Mercury 2 medium | Inception | 1 | 3.9 | 0/3 | 934ms |
| #57 | GPT-5 Nano medium | OpenAI | 1 | 5.3 | 1/3 | 19.8s |
| #58 | GLM 5V Turbo none | Z.ai | 1 | 5.3 | 1/3 | 2.22s |
| #60 | Gemma 4 26B A4B none | 1 | 5.7 | 1/3 | 739ms | |
| #62 | Gemini 2.5 Flash none | 1 | 5.7 | 1/3 | 576ms | |
| #64 | DeepSeek V3.2 none | DeepSeek | 1 | 8.5 | 2/3 | 7.37s |
| #65 | MiMo-V2-Pro none | Xiaomi | 1 | 6.0 | 1/3 | 1.83s |
| #66 | GPT-5.4 none | OpenAI | 1 | 5.6 | 1/3 | 1.52s |