AI BENCHY Category
Instructions following Ranking
See which AI models perform best on Instructions following, which ones stay reliable, and where the biggest gaps appear. Sort by: Response Time (avg) ↓.
| Rank | Model | Company | Instructions following Score | Score | Tests Correct | Response Time (avg) |
|---|---|---|---|---|---|---|
| #12 | Gemini 3 PRO Preview medium | 9.8 | 8.4 | 2/2 | 3.26s | |
| #40 | GPT-5.2 medium | OpenAI | 9.9 | 7.5 | 2/2 | 3.12s |
| #16 | GPT-5.4 medium | OpenAI | 10.0 | 8.2 | 2/2 | 3.11s |
| #7 | GPT-5.3-Codex medium | OpenAI | 10.0 | 8.6 | 2/2 | 3.04s |
| #93 | GLM 4.7 Flash medium | Z.ai | 6.2 | 4.6 | 1/2 | 2.97s |
| #48 | Gemma 4 31B none | 6.5 | 6.9 | 1/2 | 2.84s | |
| #72 | Hunter Alpha none | OpenRouter | 6.4 | 5.7 | 1/2 | 2.82s |
| #76 | Kimi K2.5 none | Moonshot AI | 6.5 | 5.5 | 1/2 | 2.67s |
| #15 | Gemini 2.5 Flash medium | 9.8 | 8.2 | 2/2 | 2.62s | |
| #26 | Claude Sonnet 4.6 medium | Anthropic | 10.0 | 8.0 | 2/2 | 2.61s |
| #65 | MiMo-V2-Pro none | Xiaomi | 6.5 | 6.0 | 1/2 | 2.51s |
| #44 | GPT-5.4 Mini medium | OpenAI | 7.4 | 7.3 | 1/2 | 2.50s |
| #37 | Claude Opus 4.6 medium | Anthropic | 10.0 | 7.6 | 2/2 | 2.43s |
| #77 | GLM 5 Turbo none | Z.ai | 6.5 | 5.5 | 1/2 | 2.13s |
| #58 | GLM 5V Turbo none | Z.ai | 6.5 | 6.2 | 1/2 | 1.97s |