Kushindwa kwa AI BENCHY
Kushindwa kwa Jibu lisilo sahihi
Ona ni modeli gani za AI hukutana na Jibu lisilo sahihi mara nyingi zaidi ili utambue hatari za utegemevu kabla ya kuchagua. Panga kwa: Muda wa majibu (wastani) ↓.
| Nafasi | Modeli | Kampuni | Idadi ya Jibu lisilo sahihi | Wastani wa alama | Majaribio sahihi | Muda wa majibu (wastani) |
|---|---|---|---|---|---|---|
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 6.9 | 10/16 | 70.8s |
| #28 | Kimi K2.5 medium | Moonshot AI | 3 | 6.4 | 9/16 | 69.8s |
| #8 | Gemini 3.1 Flash Lite Preview high | 3 | 8.2 | 12/16 | 68.8s | |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.9 | 10/16 | 65.1s |
| #7 | Qwen3.5-27B medium | Qwen | 1 | 8.2 | 12/16 | 52.1s |
| #34 | GPT-5 Nano medium | OpenAI | 5 | 5.5 | 7/16 | 47.9s |
| #35 | Qwen3.5-35B-A3B medium | Qwen | 2 | 5.5 | 8/16 | 43.9s |
| #43 | MiniMax M2.5 medium | Minimax | 5 | 4.7 | 5/16 | 43.0s |
| #18 | DeepSeek V3.2 medium | DeepSeek | 3 | 7.3 | 11/16 | 39.5s |
| #52 | GLM 4.7 Flash medium | Z.ai | 7 | 3.1 | 4/16 | 36.8s |
| #4 | Qwen3.5 Plus 2026-02-15 medium | Qwen | 1 | 8.3 | 13/16 | 34.5s |
| #10 | Qwen3.5-122B-A10B medium | Qwen | 3 | 7.7 | 12/16 | 29.7s |
| #13 | Step 3.5 Flash medium | Stepfun | 3 | 7.4 | 10/16 | 29.1s |
| #30 | Grok 4.1 Fast medium | X AI | 2 | 6.2 | 9/16 | 26.3s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 3 | 7.2 | 11/16 | 25.3s |
| #32 | GPT-5 Mini medium | OpenAI | 3 | 6.0 | 8/16 | 25.1s |
| #26 | Claude Opus 4.6 medium | Anthropic | 2 | 6.6 | 10/16 | 22.9s |
| #9 | GPT-5.4 medium | OpenAI | 2 | 8.0 | 12/16 | 20.1s |
| #39 | gpt-oss-120b medium | OpenAI | 5 | 5.1 | 7/16 | 16.7s |
| #2 | Gemini 3.1 Pro Preview medium | 1 | 9.4 | 15/16 | 16.6s | |
| #3 | GPT-5.3-Codex medium | OpenAI | 2 | 8.4 | 12/16 | 16.6s |
| #14 | GLM 5 medium | Z.ai | 2 | 7.4 | 11/16 | 16.2s |
| #27 | GPT-5.2 medium | OpenAI | 1 | 6.5 | 10/16 | 15.3s |
| #33 | DeepSeek V3.2 none | DeepSeek | 6 | 5.5 | 7/16 | 12.9s |
| #50 | Qwen3 Coder Next medium | Qwen | 8 | 3.5 | 3/16 | 12.5s |
| #16 | Gemini 2.5 Flash medium | 4 | 7.4 | 11/16 | 12.4s | |
| #46 | Kimi K2.5 none | Moonshot AI | 11 | 4.1 | 5/16 | 11.9s |
| #48 | Qwen3 Coder Next none | Qwen | 10 | 4.0 | 4/16 | 11.7s |
| #11 | Claude Sonnet 4.6 medium | Anthropic | 1 | 7.7 | 12/16 | 11.2s |
| #6 | Gemini 3 Pro Preview medium | 3 | 8.2 | 13/16 | 7.15s | |
| #15 | GPT-5.2 Chat none | OpenAI | 4 | 7.4 | 11/16 | 7.03s |
| #5 | Gemini 3 Flash Preview low | 3 | 8.2 | 13/16 | 6.11s | |
| #19 | GPT-5.3 Chat none | OpenAI | 4 | 7.3 | 10/16 | 5.96s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 2 | 6.8 | 10/16 | 5.57s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 8 | 4.7 | 6/16 | 4.10s |
| #31 | GLM 5 none | Z.ai | 7 | 6.0 | 9/16 | 4.03s |
| #12 | Gemini 3.1 Flash Lite Preview medium | 4 | 7.5 | 11/16 | 3.83s | |
| #40 | Qwen3.5-122B-A10B none | Qwen | 9 | 5.0 | 6/16 | 3.72s |
| #37 | Qwen3.5-Flash none | Qwen | 8 | 5.2 | 7/16 | 3.54s |
| #17 | Gemini 3.1 Flash Lite Preview low | 4 | 7.3 | 11/16 | 3.36s | |
| #45 | Trinity Large Preview none | Arcee AI | 9 | 4.2 | 5/16 | 3.15s |
| #49 | GLM 4.7 Flash none | Z.ai | 9 | 3.9 | 4/16 | 2.99s |
| #54 | MiMo-V2-Flash none | Xiaomi | 10 | 2.9 | 3/16 | 2.97s |
| #29 | Qwen3.5 Plus 2026-02-15 none | Qwen | 7 | 6.2 | 9/16 | 2.65s |
| #36 | Mercury 2 medium | Inception | 5 | 5.3 | 7/16 | 2.36s |
| #47 | GPT-4o-mini none | OpenAI | 11 | 4.0 | 4/16 | 2.07s |
| #53 | Grok 4.1 Fast none | X AI | 11 | 2.9 | 3/16 | 1.90s |
| #41 | Qwen3.5-27B none | Qwen | 9 | 4.9 | 5/16 | 1.75s |
| #20 | Gemini 3 Flash Preview none | 5 | 7.2 | 11/16 | 1.75s | |
| #44 | GPT-5.4 none | OpenAI | 9 | 4.5 | 6/16 | 1.48s |
| #22 | Gemini 3.1 Flash Lite Preview none | 4 | 7.1 | 10/16 | 1.33s | |
| #38 | Gemini 2.5 Flash none | 9 | 5.2 | 6/16 | 923ms | |
| #55 | LFM2-24B-A2B none | Liquid | 9 | 2.6 | 1/16 | 811ms |
| #51 | Mercury 2 none | Inception | 11 | 3.4 | 4/16 | 596ms |