AI BENCHY 失败分析
答案错误 失败
看看哪些 AI 模型最常遇到 答案错误,让你在选择前先发现稳定性风险。
| 排名 | 模型 | 公司 | 答案错误 次数 | 分数 | 测试正确 | 响应时间(平均) |
|---|---|---|---|---|---|---|
| #21 | Hy3 preview medium | Tencent | 3 | 8.1 | 15/20 | 16.3s |
| #22 | Gemini 3 PRO Preview medium | 3 | 8.1 | 15/20 | 9.05s | |
| #28 | Qwen3.5-27B medium | Qwen | 3 | 7.9 | 13/20 | 60.1s |
| #34 | Gemma 4 26B A4B medium | 3 | 7.8 | 14/20 | 50.9s | |
| #48 | MiMo-V2.5-Pro medium | Xiaomi | 3 | 7.6 | 12/20 | 21.8s |
| #49 | Gemini 3.1 Flash Lite high | 3 | 7.6 | 11/18 | 62.0s | |
| #51 | Qwen3.5-Flash medium | Qwen | 3 | 7.6 | 12/20 | 63.0s |
| #53 | Claude Sonnet 4.6 medium | Anthropic | 3 | 7.6 | 13/20 | 15.8s |
| #59 | Kimi K2.6 medium | Moonshot AI | 3 | 7.4 | 12/20 | 54.0s |
| #63 | GPT-5.2 medium | OpenAI | 3 | 7.3 | 12/20 | 16.5s |
| #65 | Claude Opus 4.8 none | Anthropic | 3 | 7.3 | 12/20 | 3.51s |
| #71 | Claude Opus 4.6 medium | Anthropic | 3 | 7.2 | 12/20 | 25.5s |
| #156 | Qwen3.5-9B medium | Qwen | 3 | 4.2 | 3/20 | 83.3s |
| #3 | Gemini 3.5 Flash low | 2 | 9.3 | 18/20 | 2.98s | |
| #4 | Gemini 3.1 Pro Preview medium | 2 | 9.3 | 18/20 | 20.8s |