AI BENCHY Category
Instructions following Ranking
See which AI models perform best on Instructions following, which ones stay reliable, and where the biggest gaps appear. Sort by: Response Time (avg) ↑.
| Rank | Model | Company | Instructions following Score | Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #76 | Kimi K2.5 none | Moonshot AI | 6.5 | 5.5 | 1/2 | 2.67s |
| #72 | Hunter Alpha none | OpenRouter | 6.4 | 5.7 | 1/2 | 2.82s |
| #48 | Gemma 4 31B none | 6.5 | 6.9 | 1/2 | 2.84s | |
| #93 | GLM 4.7 Flash medium | Z.ai | 6.2 | 4.6 | 1/2 | 2.97s |
| #7 | GPT-5.3-Codex medium | OpenAI | 10.0 | 8.6 | 2/2 | 3.04s |
| #16 | GPT-5.4 medium | OpenAI | 10.0 | 8.2 | 2/2 | 3.11s |
| #40 | GPT-5.2 medium | OpenAI | 9.9 | 7.5 | 2/2 | 3.12s |
| #12 | Gemini 3 PRO Preview medium | 9.8 | 8.4 | 2/2 | 3.26s | |
| #36 | GPT-5.3 Chat none | OpenAI | 8.3 | 7.7 | 1/2 | 3.29s |
| #23 | MiMo-V2-Pro medium | Xiaomi | 9.9 | 8.1 | 2/2 | 3.36s |
| #31 | GLM 5V Turbo medium | Z.ai | 9.9 | 7.8 | 2/2 | 3.74s |
| #50 | Hunter Alpha medium | OpenRouter | 9.9 | 6.7 | 2/2 | 4.18s |
| #55 | MiMo-V2-Omni none | Xiaomi | 6.5 | 6.5 | 1/2 | 4.18s |
| #41 | MiMo-V2-Flash medium | Xiaomi | 10.0 | 7.5 | 2/2 | 4.28s |
| #47 | Grok 4.20 medium | X AI | 7.3 | 7.0 | 1/2 | 4.42s |