Kushindwa kwa AI BENCHY
Kushindwa kwa Jibu lisilo sahihi
Ona ni modeli gani za AI hukutana na Jibu lisilo sahihi mara nyingi zaidi ili utambue hatari za utegemevu kabla ya kuchagua.
| Nafasi | Modeli | Kampuni | Idadi ya Jibu lisilo sahihi | Wastani wa alama | Majaribio sahihi | Muda wa majibu (wastani) |
|---|---|---|---|---|---|---|
| #46 | Kimi K2.5 none | Moonshot AI | 11 | 4.1 | 5/16 | 11.9s |
| #47 | GPT-4o-mini none | OpenAI | 11 | 4.0 | 4/16 | 2.07s |
| #51 | Mercury 2 none | Inception | 11 | 3.4 | 4/16 | 596ms |
| #53 | Grok 4.1 Fast none | X AI | 11 | 2.9 | 3/16 | 1.90s |
| #48 | Qwen3 Coder Next none | Qwen | 10 | 4.0 | 4/16 | 11.7s |
| #54 | MiMo-V2-Flash none | Xiaomi | 10 | 2.9 | 3/16 | 2.97s |
| #38 | Gemini 2.5 Flash none | 9 | 5.2 | 6/16 | 923ms | |
| #40 | Qwen3.5-122B-A10B none | Qwen | 9 | 5.0 | 6/16 | 3.72s |
| #41 | Qwen3.5-27B none | Qwen | 9 | 4.9 | 5/16 | 1.75s |
| #44 | GPT-5.4 none | OpenAI | 9 | 4.5 | 6/16 | 1.48s |
| #45 | Trinity Large Preview none | Arcee AI | 9 | 4.2 | 5/16 | 3.15s |
| #49 | GLM 4.7 Flash none | Z.ai | 9 | 3.9 | 4/16 | 2.99s |
| #55 | LFM2-24B-A2B none | Liquid | 9 | 2.6 | 1/16 | 811ms |
| #37 | Qwen3.5-Flash none | Qwen | 8 | 5.2 | 7/16 | 3.54s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 8 | 4.7 | 6/16 | 4.10s |
| #50 | Qwen3 Coder Next medium | Qwen | 8 | 3.5 | 3/16 | 12.5s |
| #29 | Qwen3.5 Plus 2026-02-15 none | Qwen | 7 | 6.2 | 9/16 | 2.65s |
| #31 | GLM 5 none | Z.ai | 7 | 6.0 | 9/16 | 4.03s |
| #52 | GLM 4.7 Flash medium | Z.ai | 7 | 3.1 | 4/16 | 36.8s |
| #33 | DeepSeek V3.2 none | DeepSeek | 6 | 5.5 | 7/16 | 12.9s |
| #20 | Gemini 3 Flash Preview none | 5 | 7.2 | 11/16 | 1.75s | |
| #34 | GPT-5 Nano medium | OpenAI | 5 | 5.5 | 7/16 | 47.9s |
| #36 | Mercury 2 medium | Inception | 5 | 5.3 | 7/16 | 2.36s |
| #39 | gpt-oss-120b medium | OpenAI | 5 | 5.1 | 7/16 | 16.7s |
| #43 | MiniMax M2.5 medium | Minimax | 5 | 4.7 | 5/16 | 43.0s |
| #12 | Gemini 3.1 Flash Lite Preview medium | 4 | 7.5 | 11/16 | 3.83s | |
| #15 | GPT-5.2 Chat none | OpenAI | 4 | 7.4 | 11/16 | 7.03s |
| #16 | Gemini 2.5 Flash medium | 4 | 7.4 | 11/16 | 12.4s | |
| #17 | Gemini 3.1 Flash Lite Preview low | 4 | 7.3 | 11/16 | 3.36s | |
| #19 | GPT-5.3 Chat none | OpenAI | 4 | 7.3 | 10/16 | 5.96s |
| #22 | Gemini 3.1 Flash Lite Preview none | 4 | 7.1 | 10/16 | 1.33s | |
| #5 | Gemini 3 Flash Preview low | 3 | 8.2 | 13/16 | 6.11s | |
| #6 | Gemini 3 Pro Preview medium | 3 | 8.2 | 13/16 | 7.15s | |
| #8 | Gemini 3.1 Flash Lite Preview high | 3 | 8.2 | 12/16 | 68.8s | |
| #10 | Qwen3.5-122B-A10B medium | Qwen | 3 | 7.7 | 12/16 | 29.7s |
| #13 | Step 3.5 Flash medium | Stepfun | 3 | 7.4 | 10/16 | 29.1s |
| #18 | DeepSeek V3.2 medium | DeepSeek | 3 | 7.3 | 11/16 | 39.5s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 3 | 7.2 | 11/16 | 25.3s |
| #28 | Kimi K2.5 medium | Moonshot AI | 3 | 6.4 | 9/16 | 69.8s |
| #32 | GPT-5 Mini medium | OpenAI | 3 | 6.0 | 8/16 | 25.1s |
| #3 | GPT-5.3-Codex medium | OpenAI | 2 | 8.4 | 12/16 | 16.6s |
| #9 | GPT-5.4 medium | OpenAI | 2 | 8.0 | 12/16 | 20.1s |
| #14 | GLM 5 medium | Z.ai | 2 | 7.4 | 11/16 | 16.2s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 2 | 6.8 | 10/16 | 5.57s |
| #26 | Claude Opus 4.6 medium | Anthropic | 2 | 6.6 | 10/16 | 22.9s |
| #30 | Grok 4.1 Fast medium | X AI | 2 | 6.2 | 9/16 | 26.3s |
| #35 | Qwen3.5-35B-A3B medium | Qwen | 2 | 5.5 | 8/16 | 43.9s |
| #2 | Gemini 3.1 Pro Preview medium | 1 | 9.4 | 15/16 | 16.6s | |
| #4 | Qwen3.5 Plus 2026-02-15 medium | Qwen | 1 | 8.3 | 13/16 | 34.5s |
| #7 | Qwen3.5-27B medium | Qwen | 1 | 8.2 | 12/16 | 52.1s |
| #11 | Claude Sonnet 4.6 medium | Anthropic | 1 | 7.7 | 12/16 | 11.2s |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.9 | 10/16 | 65.1s |
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 6.9 | 10/16 | 70.8s |
| #27 | GPT-5.2 medium | OpenAI | 1 | 6.5 | 10/16 | 15.3s |