AI BENCHY ناکامیاں
غلط جواب ناکامیاں
دیکھیں کہ کن AI ماڈلز میں غلط جواب سب سے زیادہ ہوتا ہے، تاکہ آپ انتخاب سے پہلے قابلِ اعتماد ہونے کے خطرات سمجھ سکیں۔ ترتیب دیں حسب: ناکامیوں کی تعداد ↑.
| درجہ | ماڈل | کمپنی | غلط جواب کی تعداد | اوسط اسکور | درست ٹیسٹس | ردِعمل کا وقت (اوسط) |
|---|---|---|---|---|---|---|
| #2 | Gemini 3.1 Pro Preview medium | 1 | 9.4 | 15/16 | 16.6s | |
| #4 | Qwen3.5 Plus 2026-02-15 medium | Qwen | 1 | 8.3 | 13/16 | 34.5s |
| #7 | Qwen3.5-27B medium | Qwen | 1 | 8.2 | 12/16 | 52.1s |
| #11 | Claude Sonnet 4.6 medium | Anthropic | 1 | 7.7 | 12/16 | 11.2s |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.9 | 10/16 | 65.1s |
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 6.9 | 10/16 | 70.8s |
| #27 | GPT-5.2 medium | OpenAI | 1 | 6.5 | 10/16 | 15.3s |
| #3 | GPT-5.3-Codex medium | OpenAI | 2 | 8.4 | 12/16 | 16.6s |
| #9 | GPT-5.4 medium | OpenAI | 2 | 8.0 | 12/16 | 20.1s |
| #14 | GLM 5 medium | Z.ai | 2 | 7.4 | 11/16 | 16.2s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 2 | 6.8 | 10/16 | 5.57s |
| #26 | Claude Opus 4.6 medium | Anthropic | 2 | 6.6 | 10/16 | 22.9s |
| #30 | Grok 4.1 Fast medium | X AI | 2 | 6.2 | 9/16 | 26.3s |
| #35 | Qwen3.5-35B-A3B medium | Qwen | 2 | 5.5 | 8/16 | 43.9s |
| #5 | Gemini 3 Flash Preview low | 3 | 8.2 | 13/16 | 6.11s | |
| #6 | Gemini 3 Pro Preview medium | 3 | 8.2 | 13/16 | 7.15s | |
| #8 | Gemini 3.1 Flash Lite Preview high | 3 | 8.2 | 12/16 | 68.8s | |
| #10 | Qwen3.5-122B-A10B medium | Qwen | 3 | 7.7 | 12/16 | 29.7s |
| #13 | Step 3.5 Flash medium | Stepfun | 3 | 7.4 | 10/16 | 29.1s |
| #18 | DeepSeek V3.2 medium | DeepSeek | 3 | 7.3 | 11/16 | 39.5s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 3 | 7.2 | 11/16 | 25.3s |
| #28 | Kimi K2.5 medium | Moonshot AI | 3 | 6.4 | 9/16 | 69.8s |
| #32 | GPT-5 Mini medium | OpenAI | 3 | 6.0 | 8/16 | 25.1s |
| #12 | Gemini 3.1 Flash Lite Preview medium | 4 | 7.5 | 11/16 | 3.83s | |
| #15 | GPT-5.2 Chat none | OpenAI | 4 | 7.4 | 11/16 | 7.03s |
| #16 | Gemini 2.5 Flash medium | 4 | 7.4 | 11/16 | 12.4s | |
| #17 | Gemini 3.1 Flash Lite Preview low | 4 | 7.3 | 11/16 | 3.36s | |
| #19 | GPT-5.3 Chat none | OpenAI | 4 | 7.3 | 10/16 | 5.96s |
| #22 | Gemini 3.1 Flash Lite Preview none | 4 | 7.1 | 10/16 | 1.33s | |
| #20 | Gemini 3 Flash Preview none | 5 | 7.2 | 11/16 | 1.75s | |
| #34 | GPT-5 Nano medium | OpenAI | 5 | 5.5 | 7/16 | 47.9s |
| #36 | Mercury 2 medium | Inception | 5 | 5.3 | 7/16 | 2.36s |
| #39 | gpt-oss-120b medium | OpenAI | 5 | 5.1 | 7/16 | 16.7s |
| #43 | MiniMax M2.5 medium | Minimax | 5 | 4.7 | 5/16 | 43.0s |
| #33 | DeepSeek V3.2 none | DeepSeek | 6 | 5.5 | 7/16 | 12.9s |
| #29 | Qwen3.5 Plus 2026-02-15 none | Qwen | 7 | 6.2 | 9/16 | 2.65s |
| #31 | GLM 5 none | Z.ai | 7 | 6.0 | 9/16 | 4.03s |
| #52 | GLM 4.7 Flash medium | Z.ai | 7 | 3.1 | 4/16 | 36.8s |
| #37 | Qwen3.5-Flash none | Qwen | 8 | 5.2 | 7/16 | 3.54s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 8 | 4.7 | 6/16 | 4.10s |
| #50 | Qwen3 Coder Next medium | Qwen | 8 | 3.5 | 3/16 | 12.5s |
| #38 | Gemini 2.5 Flash none | 9 | 5.2 | 6/16 | 923ms | |
| #40 | Qwen3.5-122B-A10B none | Qwen | 9 | 5.0 | 6/16 | 3.72s |
| #41 | Qwen3.5-27B none | Qwen | 9 | 4.9 | 5/16 | 1.75s |
| #44 | GPT-5.4 none | OpenAI | 9 | 4.5 | 6/16 | 1.48s |
| #45 | Trinity Large Preview none | Arcee AI | 9 | 4.2 | 5/16 | 3.15s |
| #49 | GLM 4.7 Flash none | Z.ai | 9 | 3.9 | 4/16 | 2.99s |
| #55 | LFM2-24B-A2B none | Liquid | 9 | 2.6 | 1/16 | 811ms |
| #48 | Qwen3 Coder Next none | Qwen | 10 | 4.0 | 4/16 | 11.7s |
| #54 | MiMo-V2-Flash none | Xiaomi | 10 | 2.9 | 3/16 | 2.97s |
| #46 | Kimi K2.5 none | Moonshot AI | 11 | 4.1 | 5/16 | 11.9s |
| #47 | GPT-4o-mini none | OpenAI | 11 | 4.0 | 4/16 | 2.07s |
| #51 | Mercury 2 none | Inception | 11 | 3.4 | 4/16 | 596ms |
| #53 | Grok 4.1 Fast none | X AI | 11 | 2.9 | 3/16 | 1.90s |