AI BENCHY 失敗分析
不正解 の失敗
どのAIモデルで 不正解 が起きやすいかを確認し、選ぶ前に信頼性のリスクを見極められます。
| 順位 | モデル | 企業 | 不正解 件数 | スコア | 正解テスト | 応答時間(平均) |
|---|---|---|---|---|---|---|
| #12 | Gemini 3 Flash Preview low | 4 | 8.6 | 16/20 | 5.86s | |
| #15 | GPT-5.3-Codex medium | OpenAI | 4 | 8.3 | 14/20 | 16.0s |
| #17 | Grok 4.20 Beta medium | X AI | 4 | 8.2 | 13/18 | 9.81s |
| #20 | Qwen3.5 Plus 2026-02-15 medium | Qwen | 4 | 8.1 | 14/20 | 67.9s |
| #24 | Gemini 3.5 Flash minimal | 4 | 7.9 | 14/20 | 1.58s | |
| #28 | GLM 5 Turbo medium | Z.ai | 4 | 7.9 | 13/20 | 22.7s |
| #29 | Hy3 preview medium | Tencent | 4 | 7.8 | 14/20 | 16.0s |
| #30 | Qwen3.6 35B A3B medium | Qwen | 4 | 7.8 | 14/20 | 17.3s |
| #31 | Grok 4.3 medium | X AI | 4 | 7.8 | 13/20 | 49.2s |
| #37 | Hy3 preview low | Tencent | 4 | 7.7 | 15/20 | 24.6s |
| #45 | Grok Build 0.1 medium | X AI | 4 | 7.6 | 12/20 | 26.4s |
| #47 | Gemma 4 26B A4B medium | 4 | 7.5 | 13/20 | 51.4s | |
| #51 | GLM 5.1 medium | Z.ai | 4 | 7.4 | 12/20 | 32.2s |
| #53 | MiMo-V2.5 medium | Xiaomi | 4 | 7.4 | 12/20 | 20.4s |
| #56 | Qwen3.5-Flash medium | Qwen | 4 | 7.4 | 11/20 | 65.6s |