AI BENCHY ناکامیاں
غلط جواب ناکامیاں
دیکھیں کہ کن AI ماڈلز میں غلط جواب سب سے زیادہ ہوتا ہے، تاکہ آپ انتخاب سے پہلے قابلِ اعتماد ہونے کے خطرات سمجھ سکیں۔ ترتیب دیں حسب: ردِعمل کا وقت (اوسط) ↑.
| درجہ | ماڈل | کمپنی | غلط جواب کی تعداد | اوسط اسکور | درست ٹیسٹس | ردِعمل کا وقت (اوسط) |
|---|---|---|---|---|---|---|
| #51 | Mercury 2 none | Inception | 11 | 3.4 | 4/16 | 596ms |
| #55 | LFM2-24B-A2B none | Liquid | 9 | 2.6 | 1/16 | 811ms |
| #38 | Gemini 2.5 Flash none | 9 | 5.2 | 6/16 | 923ms | |
| #22 | Gemini 3.1 Flash Lite Preview none | 4 | 7.1 | 10/16 | 1.33s | |
| #44 | GPT-5.4 none | OpenAI | 9 | 4.5 | 6/16 | 1.48s |
| #20 | Gemini 3 Flash Preview none | 5 | 7.2 | 11/16 | 1.75s | |
| #41 | Qwen3.5-27B none | Qwen | 9 | 4.9 | 5/16 | 1.75s |
| #53 | Grok 4.1 Fast none | X AI | 11 | 2.9 | 3/16 | 1.90s |
| #47 | GPT-4o-mini none | OpenAI | 11 | 4.0 | 4/16 | 2.07s |
| #36 | Mercury 2 medium | Inception | 5 | 5.3 | 7/16 | 2.36s |
| #29 | Qwen3.5 Plus 2026-02-15 none | Qwen | 7 | 6.2 | 9/16 | 2.65s |
| #54 | MiMo-V2-Flash none | Xiaomi | 10 | 2.9 | 3/16 | 2.97s |
| #49 | GLM 4.7 Flash none | Z.ai | 9 | 3.9 | 4/16 | 2.99s |
| #45 | Trinity Large Preview none | Arcee AI | 9 | 4.2 | 5/16 | 3.15s |
| #17 | Gemini 3.1 Flash Lite Preview low | 4 | 7.3 | 11/16 | 3.36s | |
| #37 | Qwen3.5-Flash none | Qwen | 8 | 5.2 | 7/16 | 3.54s |
| #40 | Qwen3.5-122B-A10B none | Qwen | 9 | 5.0 | 6/16 | 3.72s |
| #12 | Gemini 3.1 Flash Lite Preview medium | 4 | 7.5 | 11/16 | 3.83s | |
| #31 | GLM 5 none | Z.ai | 7 | 6.0 | 9/16 | 4.03s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 8 | 4.7 | 6/16 | 4.10s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 2 | 6.8 | 10/16 | 5.57s |
| #19 | GPT-5.3 Chat none | OpenAI | 4 | 7.3 | 10/16 | 5.96s |
| #5 | Gemini 3 Flash Preview low | 3 | 8.2 | 13/16 | 6.11s | |
| #15 | GPT-5.2 Chat none | OpenAI | 4 | 7.4 | 11/16 | 7.03s |
| #6 | Gemini 3 Pro Preview medium | 3 | 8.2 | 13/16 | 7.15s | |
| #11 | Claude Sonnet 4.6 medium | Anthropic | 1 | 7.7 | 12/16 | 11.2s |
| #48 | Qwen3 Coder Next none | Qwen | 10 | 4.0 | 4/16 | 11.7s |
| #46 | Kimi K2.5 none | Moonshot AI | 11 | 4.1 | 5/16 | 11.9s |
| #16 | Gemini 2.5 Flash medium | 4 | 7.4 | 11/16 | 12.4s | |
| #50 | Qwen3 Coder Next medium | Qwen | 8 | 3.5 | 3/16 | 12.5s |
| #33 | DeepSeek V3.2 none | DeepSeek | 6 | 5.5 | 7/16 | 12.9s |
| #27 | GPT-5.2 medium | OpenAI | 1 | 6.5 | 10/16 | 15.3s |
| #14 | GLM 5 medium | Z.ai | 2 | 7.4 | 11/16 | 16.2s |
| #3 | GPT-5.3-Codex medium | OpenAI | 2 | 8.4 | 12/16 | 16.6s |
| #2 | Gemini 3.1 Pro Preview medium | 1 | 9.4 | 15/16 | 16.6s | |
| #39 | gpt-oss-120b medium | OpenAI | 5 | 5.1 | 7/16 | 16.7s |
| #9 | GPT-5.4 medium | OpenAI | 2 | 8.0 | 12/16 | 20.1s |
| #26 | Claude Opus 4.6 medium | Anthropic | 2 | 6.6 | 10/16 | 22.9s |
| #32 | GPT-5 Mini medium | OpenAI | 3 | 6.0 | 8/16 | 25.1s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 3 | 7.2 | 11/16 | 25.3s |
| #30 | Grok 4.1 Fast medium | X AI | 2 | 6.2 | 9/16 | 26.3s |
| #13 | Step 3.5 Flash medium | Stepfun | 3 | 7.4 | 10/16 | 29.1s |
| #10 | Qwen3.5-122B-A10B medium | Qwen | 3 | 7.7 | 12/16 | 29.7s |
| #4 | Qwen3.5 Plus 2026-02-15 medium | Qwen | 1 | 8.3 | 13/16 | 34.5s |
| #52 | GLM 4.7 Flash medium | Z.ai | 7 | 3.1 | 4/16 | 36.8s |
| #18 | DeepSeek V3.2 medium | DeepSeek | 3 | 7.3 | 11/16 | 39.5s |
| #43 | MiniMax M2.5 medium | Minimax | 5 | 4.7 | 5/16 | 43.0s |
| #35 | Qwen3.5-35B-A3B medium | Qwen | 2 | 5.5 | 8/16 | 43.9s |
| #34 | GPT-5 Nano medium | OpenAI | 5 | 5.5 | 7/16 | 47.9s |
| #7 | Qwen3.5-27B medium | Qwen | 1 | 8.2 | 12/16 | 52.1s |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.9 | 10/16 | 65.1s |
| #8 | Gemini 3.1 Flash Lite Preview high | 3 | 8.2 | 12/16 | 68.8s | |
| #28 | Kimi K2.5 medium | Moonshot AI | 3 | 6.4 | 9/16 | 69.8s |
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 6.9 | 10/16 | 70.8s |