AI BENCHY ناکامیاں
غلط جواب ناکامیاں
دیکھیں کہ کن AI ماڈلز میں غلط جواب سب سے زیادہ ہوتا ہے، تاکہ آپ انتخاب سے پہلے قابلِ اعتماد ہونے کے خطرات سمجھ سکیں۔ ترتیب دیں حسب: ردِعمل کا وقت (اوسط) ↓.
| درجہ | ماڈل | کمپنی | غلط جواب کی تعداد | اوسط اسکور | درست ٹیسٹس | ردِعمل کا وقت (اوسط) |
|---|---|---|---|---|---|---|
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 6.9 | 10/16 | 70.8s |
| #28 | Kimi K2.5 medium | Moonshot AI | 3 | 6.4 | 9/16 | 69.8s |
| #8 | Gemini 3.1 Flash Lite Preview high | 3 | 8.2 | 12/16 | 68.8s | |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.9 | 10/16 | 65.1s |
| #7 | Qwen3.5-27B medium | Qwen | 1 | 8.2 | 12/16 | 52.1s |
| #34 | GPT-5 Nano medium | OpenAI | 5 | 5.5 | 7/16 | 47.9s |
| #35 | Qwen3.5-35B-A3B medium | Qwen | 2 | 5.5 | 8/16 | 43.9s |
| #43 | MiniMax M2.5 medium | Minimax | 5 | 4.7 | 5/16 | 43.0s |
| #18 | DeepSeek V3.2 medium | DeepSeek | 3 | 7.3 | 11/16 | 39.5s |
| #52 | GLM 4.7 Flash medium | Z.ai | 7 | 3.1 | 4/16 | 36.8s |
| #4 | Qwen3.5 Plus 2026-02-15 medium | Qwen | 1 | 8.3 | 13/16 | 34.5s |
| #10 | Qwen3.5-122B-A10B medium | Qwen | 3 | 7.7 | 12/16 | 29.7s |
| #13 | Step 3.5 Flash medium | Stepfun | 3 | 7.4 | 10/16 | 29.1s |
| #30 | Grok 4.1 Fast medium | X AI | 2 | 6.2 | 9/16 | 26.3s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 3 | 7.2 | 11/16 | 25.3s |
| #32 | GPT-5 Mini medium | OpenAI | 3 | 6.0 | 8/16 | 25.1s |
| #26 | Claude Opus 4.6 medium | Anthropic | 2 | 6.6 | 10/16 | 22.9s |
| #9 | GPT-5.4 medium | OpenAI | 2 | 8.0 | 12/16 | 20.1s |
| #39 | gpt-oss-120b medium | OpenAI | 5 | 5.1 | 7/16 | 16.7s |
| #2 | Gemini 3.1 Pro Preview medium | 1 | 9.4 | 15/16 | 16.6s | |
| #3 | GPT-5.3-Codex medium | OpenAI | 2 | 8.4 | 12/16 | 16.6s |
| #14 | GLM 5 medium | Z.ai | 2 | 7.4 | 11/16 | 16.2s |
| #27 | GPT-5.2 medium | OpenAI | 1 | 6.5 | 10/16 | 15.3s |
| #33 | DeepSeek V3.2 none | DeepSeek | 6 | 5.5 | 7/16 | 12.9s |
| #50 | Qwen3 Coder Next medium | Qwen | 8 | 3.5 | 3/16 | 12.5s |
| #16 | Gemini 2.5 Flash medium | 4 | 7.4 | 11/16 | 12.4s | |
| #46 | Kimi K2.5 none | Moonshot AI | 11 | 4.1 | 5/16 | 11.9s |
| #48 | Qwen3 Coder Next none | Qwen | 10 | 4.0 | 4/16 | 11.7s |
| #11 | Claude Sonnet 4.6 medium | Anthropic | 1 | 7.7 | 12/16 | 11.2s |
| #6 | Gemini 3 Pro Preview medium | 3 | 8.2 | 13/16 | 7.15s | |
| #15 | GPT-5.2 Chat none | OpenAI | 4 | 7.4 | 11/16 | 7.03s |
| #5 | Gemini 3 Flash Preview low | 3 | 8.2 | 13/16 | 6.11s | |
| #19 | GPT-5.3 Chat none | OpenAI | 4 | 7.3 | 10/16 | 5.96s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 2 | 6.8 | 10/16 | 5.57s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 8 | 4.7 | 6/16 | 4.10s |
| #31 | GLM 5 none | Z.ai | 7 | 6.0 | 9/16 | 4.03s |
| #12 | Gemini 3.1 Flash Lite Preview medium | 4 | 7.5 | 11/16 | 3.83s | |
| #40 | Qwen3.5-122B-A10B none | Qwen | 9 | 5.0 | 6/16 | 3.72s |
| #37 | Qwen3.5-Flash none | Qwen | 8 | 5.2 | 7/16 | 3.54s |
| #17 | Gemini 3.1 Flash Lite Preview low | 4 | 7.3 | 11/16 | 3.36s | |
| #45 | Trinity Large Preview none | Arcee AI | 9 | 4.2 | 5/16 | 3.15s |
| #49 | GLM 4.7 Flash none | Z.ai | 9 | 3.9 | 4/16 | 2.99s |
| #54 | MiMo-V2-Flash none | Xiaomi | 10 | 2.9 | 3/16 | 2.97s |
| #29 | Qwen3.5 Plus 2026-02-15 none | Qwen | 7 | 6.2 | 9/16 | 2.65s |
| #36 | Mercury 2 medium | Inception | 5 | 5.3 | 7/16 | 2.36s |
| #47 | GPT-4o-mini none | OpenAI | 11 | 4.0 | 4/16 | 2.07s |
| #53 | Grok 4.1 Fast none | X AI | 11 | 2.9 | 3/16 | 1.90s |
| #41 | Qwen3.5-27B none | Qwen | 9 | 4.9 | 5/16 | 1.75s |
| #20 | Gemini 3 Flash Preview none | 5 | 7.2 | 11/16 | 1.75s | |
| #44 | GPT-5.4 none | OpenAI | 9 | 4.5 | 6/16 | 1.48s |
| #22 | Gemini 3.1 Flash Lite Preview none | 4 | 7.1 | 10/16 | 1.33s | |
| #38 | Gemini 2.5 Flash none | 9 | 5.2 | 6/16 | 923ms | |
| #55 | LFM2-24B-A2B none | Liquid | 9 | 2.6 | 1/16 | 811ms |
| #51 | Mercury 2 none | Inception | 11 | 3.4 | 4/16 | 596ms |