AI BENCHY زمرہ
ہدایات کی پیروی درجہ بندی
دیکھیں کہ ہدایات کی پیروی میں کون سے AI ماڈلز بہترین کارکردگی دکھاتے ہیں، کون سے قابلِ اعتماد رہتے ہیں، اور سب سے بڑے فرق کہاں نظر آتے ہیں۔ ترتیب دیں حسب: ردِعمل کا وقت (اوسط) ↑.
متعلقہ ناکامی کی وجوہات
| درجہ | ماڈل | کمپنی | ہدایات کی پیروی اسکور | اوسط اسکور | درست ٹیسٹس | ردِعمل کا وقت (اوسط) |
|---|---|---|---|---|---|---|
| #51 | Mercury 2 none | Inception | 5.5 | 3.4 | 1/2 | 551ms |
| #40 | Qwen3.5-122B-A10B none | Qwen | 4.5 | 5.0 | 0/2 | 585ms |
| #38 | Gemini 2.5 Flash none | 9.0 | 5.2 | 1/2 | 672ms | |
| #42 | Qwen3.5-35B-A3B none | Qwen | 5.0 | 4.7 | 1/2 | 809ms |
| #41 | Qwen3.5-27B none | Qwen | 4.5 | 4.9 | 0/2 | 815ms |
| #54 | MiMo-V2-Flash none | Xiaomi | 5.5 | 2.9 | 1/2 | 857ms |
| #49 | GLM 4.7 Flash none | Z.ai | 5.5 | 3.9 | 1/2 | 888ms |
| #53 | Grok 4.1 Fast none | X AI | 10.0 | 2.9 | 0/2 | 923ms |
| #36 | Mercury 2 medium | Inception | 10.0 | 5.3 | 2/2 | 1.07s |
| #44 | GPT-5.4 none | OpenAI | 5.5 | 4.5 | 1/2 | 1.07s |
| #55 | LFM2-24B-A2B none | Liquid | 4.5 | 2.6 | 0/2 | 1.09s |
| #45 | Trinity Large Preview none | Arcee AI | 3.5 | 4.2 | 0/2 | 1.09s |
| #22 | Gemini 3.1 Flash Lite Preview none | 10.0 | 7.1 | 2/2 | 1.13s | |
| #47 | GPT-4o-mini none | OpenAI | 4.5 | 4.0 | 0/2 | 1.27s |
| #31 | GLM 5 none | Z.ai | 10.0 | 6.0 | 2/2 | 1.48s |
| #17 | Gemini 3.1 Flash Lite Preview low | 10.0 | 7.3 | 2/2 | 1.49s | |
| #33 | DeepSeek V3.2 none | DeepSeek | 10.0 | 5.5 | 2/2 | 1.52s |
| #20 | Gemini 3 Flash Preview none | 5.5 | 7.2 | 1/2 | 1.58s | |
| #29 | Qwen3.5 Plus 2026-02-15 none | Qwen | 10.0 | 6.2 | 2/2 | 1.67s |
| #12 | Gemini 3.1 Flash Lite Preview medium | 10.0 | 7.5 | 2/2 | 1.91s | |
| #25 | Claude Sonnet 4.6 none | Anthropic | 5.5 | 6.8 | 1/2 | 1.96s |
| #26 | Claude Opus 4.6 medium | Anthropic | 10.0 | 6.6 | 2/2 | 2.43s |
| #11 | Claude Sonnet 4.6 medium | Anthropic | 10.0 | 7.7 | 2/2 | 2.61s |
| #16 | Gemini 2.5 Flash medium | 9.5 | 7.4 | 2/2 | 2.62s | |
| #46 | Kimi K2.5 none | Moonshot AI | 5.5 | 4.1 | 1/2 | 2.67s |
| #52 | GLM 4.7 Flash medium | Z.ai | 5.0 | 3.1 | 1/2 | 2.97s |
| #3 | GPT-5.3-Codex medium | OpenAI | 10.0 | 8.4 | 2/2 | 3.04s |
| #9 | GPT-5.4 medium | OpenAI | 10.0 | 8.0 | 2/2 | 3.11s |
| #27 | GPT-5.2 medium | OpenAI | 9.5 | 6.5 | 2/2 | 3.12s |
| #6 | Gemini 3 Pro Preview medium | 9.5 | 8.2 | 2/2 | 3.26s | |
| #19 | GPT-5.3 Chat none | OpenAI | 9.0 | 7.3 | 1/2 | 3.29s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 10.0 | 7.2 | 2/2 | 4.28s |
| #43 | MiniMax M2.5 medium | Minimax | 8.0 | 4.7 | 1/2 | 4.64s |
| #13 | Step 3.5 Flash medium | Stepfun | 9.0 | 7.4 | 1/2 | 4.98s |
| #30 | Grok 4.1 Fast medium | X AI | 5.5 | 6.2 | 1/2 | 5.30s |
| #15 | GPT-5.2 Chat none | OpenAI | 6.0 | 7.4 | 1/2 | 5.46s |
| #1 | Gemini 3 Flash Preview medium | 10.0 | 10.0 | 2/2 | 6.10s | |
| #5 | Gemini 3 Flash Preview low | 9.5 | 8.2 | 2/2 | 7.02s | |
| #14 | GLM 5 medium | Z.ai | 10.0 | 7.4 | 2/2 | 7.25s |
| #50 | Qwen3 Coder Next medium | Qwen | 4.5 | 3.5 | 0/2 | 7.34s |
| #39 | gpt-oss-120b medium | OpenAI | 9.5 | 5.1 | 2/2 | 7.63s |
| #48 | Qwen3 Coder Next none | Qwen | 4.5 | 4.0 | 0/2 | 7.71s |
| #37 | Qwen3.5-Flash none | Qwen | 5.0 | 5.2 | 1/2 | 8.81s |
| #2 | Gemini 3.1 Pro Preview medium | 10.0 | 9.4 | 2/2 | 9.56s | |
| #10 | Qwen3.5-122B-A10B medium | Qwen | 10.0 | 7.7 | 2/2 | 9.88s |
| #34 | GPT-5 Nano medium | OpenAI | 9.0 | 5.5 | 1/2 | 11.9s |
| #32 | GPT-5 Mini medium | OpenAI | 7.5 | 6.0 | 1/2 | 15.7s |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 10.0 | 6.9 | 2/2 | 17.5s |
| #7 | Qwen3.5-27B medium | Qwen | 10.0 | 8.2 | 2/2 | 19.7s |
| #35 | Qwen3.5-35B-A3B medium | Qwen | 10.0 | 5.5 | 2/2 | 24.4s |
| #4 | Qwen3.5 Plus 2026-02-15 medium | Qwen | 10.0 | 8.3 | 2/2 | 31.9s |
| #18 | DeepSeek V3.2 medium | DeepSeek | 10.0 | 7.3 | 2/2 | 35.8s |
| #24 | Qwen3.5-Flash medium | Qwen | 10.0 | 6.9 | 2/2 | 63.5s |
| #8 | Gemini 3.1 Flash Lite Preview high | 9.0 | 8.2 | 1/2 | 70.1s | |
| #28 | Kimi K2.5 medium | Moonshot AI | 10.0 | 6.4 | 2/2 | 92.5s |