Kategori AI BENCHY
Peringkat Pemanggilan alat
Lihat model AI mana yang paling baik di Pemanggilan alat, mana yang tetap andal, dan di mana kesenjangan terbesar muncul.
Model yang ditampilkan
15
Rata-rata Skor Pemanggilan alat
8.7
Model terbaik
Gemini 3 Flash Preview 10.0| Peringkat | Model | Perusahaan | Skor Pemanggilan alat | Skor | Tes benar | Waktu respons (rata-rata) |
|---|---|---|---|---|---|---|
| #53 | Gemini 3.1 Flash Lite high | 10.0 | 7.3 | 1/1 | 6.44s | |
| #54 | GPT-5 Mini medium | OpenAI | 10.0 | 7.3 | 1/1 | 18.6s |
| #56 | MiMo-V2.5 medium | Xiaomi | 10.0 | 7.3 | 1/1 | 7.29s |
| #57 | Step 3.7 Flash low | Stepfun | 10.0 | 7.3 | 1/1 | 3.25s |
| #58 | Gemini 3.1 Flash Lite Preview none | 10.0 | 7.2 | 1/1 | 3.39s | |
| #60 | Kimi K2.6 medium | Moonshot AI | 10.0 | 7.2 | 1/1 | 8.92s |
| #61 | Gemini 3.1 Flash Lite low | 10.0 | 7.2 | 1/1 | 5.66s | |
| #62 | Step 3.5 Flash medium | Stepfun | 10.0 | 7.2 | 1/1 | 11.9s |
| #63 | GPT-5.3 Chat none | OpenAI | 10.0 | 7.2 | 1/1 | 8.36s |
| #64 | MiMo-V2-Flash medium | Xiaomi | 10.0 | 7.2 | 1/1 | 27.8s |
| #66 | Qwen3.5-35B-A3B medium | Qwen | 10.0 | 7.1 | 1/1 | 4.65s |
| #67 | MiniMax M3 medium | Minimax | 10.0 | 7.1 | 1/1 | 11.9s |
| #68 | Claude Opus 4.8 none | Anthropic | 10.0 | 7.0 | 1/1 | 5.35s |
| #69 | Claude Opus 4.6 medium | Anthropic | 10.0 | 7.0 | 1/1 | 9.73s |
| #70 | GPT-5.4 Nano medium | OpenAI | 10.0 | 7.0 | 1/1 | 7.71s |