AI BENCHY श्रेणी अपयशे
Samanya Buddhimatta
सूचनांचे पालन केले नाही
Samanya Buddhimatta
सूचनांचे पालन केले नाही
Samanya Buddhimatta मध्ये कोणत्या AI मॉडेल्सना सूचनांचे पालन केले नाही येण्याची शक्यता जास्त आहे ते पाहा, म्हणजे कमकुवत बाजू लवकर ओळखता येतील. क्रम लावा: बरोबर चाचण्या ↑.
संबंधित अपयश कारणे
संबंधित श्रेण्या
| क्रमांक | मॉडेल | कंपनी | सूचनांचे पालन केले नाही संख्या | श्रेणी स्कोअर | बरोबर चाचण्या | प्रतिसाद वेळ (सरासरी) |
|---|---|---|---|---|---|---|
| #3 | GPT-5.3-Codex medium | OpenAI | 1 | 4.0 | 0/1 | 4.87s |
| #7 | Qwen3.5-27B medium | Qwen | 1 | 5.0 | 0/1 | 101.4s |
| #9 | GPT-5.4 medium | OpenAI | 1 | 5.0 | 0/1 | 4.92s |
| #13 | Step 3.5 Flash medium | Stepfun | 1 | 6.0 | 0/1 | 6.54s |
| #14 | GLM 5 medium | Z.ai | 1 | 5.0 | 0/1 | 14.7s |
| #15 | GPT-5.2 Chat none | OpenAI | 1 | 4.0 | 0/1 | 3.20s |
| #16 | Gemini 2.5 Flash medium | 1 | 4.0 | 0/1 | 4.86s | |
| #17 | Gemini 3.1 Flash Lite Preview low | 1 | 3.0 | 0/1 | 1.54s | |
| #18 | DeepSeek V3.2 medium | DeepSeek | 1 | 3.0 | 0/1 | 31.3s |
| #19 | GPT-5.3 Chat none | OpenAI | 1 | 4.0 | 0/1 | 1.99s |
| #21 | MiMo-V2-Flash medium | Xiaomi | 1 | 3.0 | 0/1 | 4.20s |
| #22 | Gemini 3.1 Flash Lite Preview none | 1 | 3.0 | 0/1 | 741ms | |
| #23 | Seed-2.0-Mini medium | Bytedance Seed | 1 | 6.0 | 0/1 | 36.7s |
| #24 | Qwen3.5-Flash medium | Qwen | 1 | 5.0 | 0/1 | 40.1s |
| #25 | Claude Sonnet 4.6 none | Anthropic | 1 | 5.0 | 0/1 | 2.56s |
| #27 | GPT-5.2 medium | OpenAI | 1 | 10.0 | 0/1 | 4.32s |
| #28 | Kimi K2.5 medium | Moonshot AI | 1 | 6.0 | 0/1 | 69.7s |
| #30 | Grok 4.1 Fast medium | X AI | 1 | 3.0 | 0/1 | 16.2s |
| #32 | GPT-5 Mini medium | OpenAI | 1 | 4.0 | 0/1 | 13.5s |
| #34 | GPT-5 Nano medium | OpenAI | 1 | 3.0 | 0/1 | 17.5s |
| #36 | Mercury 2 medium | Inception | 1 | 4.0 | 0/1 | 821ms |
| #39 | gpt-oss-120b medium | OpenAI | 1 | 3.0 | 0/1 | 7.90s |
| #40 | Qwen3.5-122B-A10B none | Qwen | 1 | 5.0 | 0/1 | 1.12s |
| #41 | Qwen3.5-27B none | Qwen | 1 | 5.0 | 0/1 | 2.51s |
| #42 | Qwen3.5-35B-A3B none | Qwen | 1 | 6.0 | 0/1 | 1.19s |
| #43 | MiniMax M2.5 medium | Minimax | 1 | 3.0 | 0/1 | 6.63s |
| #45 | Trinity Large Preview none | Arcee AI | 1 | 3.0 | 0/1 | 2.86s |
| #50 | Qwen3 Coder Next medium | Qwen | 1 | 6.0 | 0/1 | 1.39s |
| #51 | Mercury 2 none | Inception | 1 | 4.0 | 0/1 | 628ms |
| #53 | Grok 4.1 Fast none | X AI | 1 | 3.0 | 0/1 | 1.08s |
| #54 | MiMo-V2-Flash none | Xiaomi | 1 | 4.0 | 0/1 | 1.67s |
| #55 | LFM2-24B-A2B none | Liquid | 1 | 3.0 | 0/1 | 395ms |