AI BENCHY
Compare Charts Methodology
❤️ Made by XCS
Your ad here

AI BENCHY Category

Instructions following Ranking

See which AI models perform best on Instructions following, which ones stay reliable, and where the biggest gaps appear. Sort by: Avg Score ↑.

Models Shown

55

Average Instructions following Score

8.1

Rank Model Company Instructions following Score Avg Score Tests Correct Response Time (avg)
#55 LFM2-24B-A2B none Liquid 4.5 2.6 0/2 1.09s
#53 Grok 4.1 Fast none X AI 10.0 2.9 0/2 923ms
#54 MiMo-V2-Flash none Xiaomi 5.5 2.9 1/2 857ms
#52 GLM 4.7 Flash medium Z.ai 5.0 3.1 1/2 2.97s
#51 Mercury 2 none Inception 5.5 3.4 1/2 551ms
#50 Qwen3 Coder Next medium Qwen 4.5 3.5 0/2 7.34s
#49 GLM 4.7 Flash none Z.ai 5.5 3.9 1/2 888ms
#47 GPT-4o-mini none OpenAI 4.5 4.0 0/2 1.27s
#48 Qwen3 Coder Next none Qwen 4.5 4.0 0/2 7.71s
#46 Kimi K2.5 none Moonshot AI 5.5 4.1 1/2 2.67s
#45 Trinity Large Preview none Arcee AI 3.5 4.2 0/2 1.09s
#44 GPT-5.4 none OpenAI 5.5 4.5 1/2 1.07s
#42 Qwen3.5-35B-A3B none Qwen 5.0 4.7 1/2 809ms
#43 MiniMax M2.5 medium Minimax 8.0 4.7 1/2 4.64s
#41 Qwen3.5-27B none Qwen 4.5 4.9 0/2 815ms
#40 Qwen3.5-122B-A10B none Qwen 4.5 5.0 0/2 585ms
#39 gpt-oss-120b medium OpenAI 9.5 5.1 2/2 7.63s
#37 Qwen3.5-Flash none Qwen 5.0 5.2 1/2 8.81s
#38 Gemini 2.5 Flash none Google 9.0 5.2 1/2 672ms
#36 Mercury 2 medium Inception 10.0 5.3 2/2 1.07s
#33 DeepSeek V3.2 none DeepSeek 10.0 5.5 2/2 1.52s
#34 GPT-5 Nano medium OpenAI 9.0 5.5 1/2 11.9s
#35 Qwen3.5-35B-A3B medium Qwen 10.0 5.5 2/2 24.4s
#31 GLM 5 none Z.ai 10.0 6.0 2/2 1.48s
#32 GPT-5 Mini medium OpenAI 7.5 6.0 1/2 15.7s
#29 Qwen3.5 Plus 2026-02-15 none Qwen 10.0 6.2 2/2 1.67s
#30 Grok 4.1 Fast medium X AI 5.5 6.2 1/2 5.30s
#28 Kimi K2.5 medium Moonshot AI 10.0 6.4 2/2 92.5s
#27 GPT-5.2 medium OpenAI 9.5 6.5 2/2 3.12s
#26 Claude Opus 4.6 medium Anthropic 10.0 6.6 2/2 2.43s
#25 Claude Sonnet 4.6 none Anthropic 5.5 6.8 1/2 1.96s
#23 Seed-2.0-Mini medium Bytedance Seed 10.0 6.9 2/2 17.5s
#24 Qwen3.5-Flash medium Qwen 10.0 6.9 2/2 63.5s
#22 Gemini 3.1 Flash Lite Preview none Google 10.0 7.1 2/2 1.13s
#20 Gemini 3 Flash Preview none Google 5.5 7.2 1/2 1.58s
#21 MiMo-V2-Flash medium Xiaomi 10.0 7.2 2/2 4.28s
#17 Gemini 3.1 Flash Lite Preview low Google 10.0 7.3 2/2 1.49s
#18 DeepSeek V3.2 medium DeepSeek 10.0 7.3 2/2 35.8s
#19 GPT-5.3 Chat none OpenAI 9.0 7.3 1/2 3.29s
#13 Step 3.5 Flash medium Stepfun 9.0 7.4 1/2 4.98s
#14 GLM 5 medium Z.ai 10.0 7.4 2/2 7.25s
#15 GPT-5.2 Chat none OpenAI 6.0 7.4 1/2 5.46s
#16 Gemini 2.5 Flash medium Google 9.5 7.4 2/2 2.62s
#12 Gemini 3.1 Flash Lite Preview medium Google 10.0 7.5 2/2 1.91s
#10 Qwen3.5-122B-A10B medium Qwen 10.0 7.7 2/2 9.88s
#11 Claude Sonnet 4.6 medium Anthropic 10.0 7.7 2/2 2.61s
#9 GPT-5.4 medium OpenAI 10.0 8.0 2/2 3.11s
#5 Gemini 3 Flash Preview low Google 9.5 8.2 2/2 7.02s
#6 Gemini 3 Pro Preview medium Google 9.5 8.2 2/2 3.26s
#7 Qwen3.5-27B medium Qwen 10.0 8.2 2/2 19.7s
#8 Gemini 3.1 Flash Lite Preview high Google 9.0 8.2 1/2 70.1s
#4 Qwen3.5 Plus 2026-02-15 medium Qwen 10.0 8.3 2/2 31.9s
#3 GPT-5.3-Codex medium OpenAI 10.0 8.4 2/2 3.04s
#2 Gemini 3.1 Pro Preview medium Google 10.0 9.4 2/2 9.56s
#1 Gemini 3 Flash Preview medium Google 10.0 10.0 2/2 6.10s

Top Models by Instructions following Score

Instructions following Score vs Total Cost

Top Models by Response Time (avg)