AI Benchy Leaderboard

Name: AI BENCHY Model Benchmark Results
Creator: AI BENCHY
License: https://aibenchy.com/methodology/

Last updated at: 2026-07-24 Models Evaluated: 222

222/222

Rank	Model	Score	Company	Total Cost	Response Time (avg)
#71#71	Step 3.7 Flashlow	7.3	Stepfun	$0.454	20.68s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 68.2% Flaky tests 5 Input Tokens 103,833 Output Tokens 376,581 Reasoning Tokens 0 Response Time (avg) 20.68s Response Time (total) 455.01s Response Time (max) 124.75s Wrong answer: 8 Invalid tool call: 1 No answer: 1 Anti-AI Tricks : 8.7 Coding : 8.2 Combined : 7.3 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 3.4 Instructions following : 9.8 Puzzle Solving : 5.5 Tool Calling : 10.0 Trivia : 3.0
#73#73	KAT-Coder-Pro V2.5high	7.2	Kwaipilot	$0.482	20.83s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 63.6% Flaky tests 6 Input Tokens 106,076 Output Tokens 9,071 Reasoning Tokens 127,093 Response Time (avg) 20.83s Response Time (total) 458.31s Response Time (max) 199.97s Wrong answer: 10 Invalid tool call: 1 Anti-AI Tricks : 7.0 Coding : 6.4 Combined : 7.3 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 5.1 Instructions following : 9.9 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 3.0
#75#75	Qwen3.7 Plusnone	7.2	Qwen	$0.106 ↓	12.09s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 50.0% Flaky tests 0 Input Tokens 98,824 Output Tokens 58,097 Reasoning Tokens 0 Response Time (avg) 12.09s Response Time (total) 265.89s Response Time (max) 206.03s Wrong answer: 10 Did not follow instructions: 1 Anti-AI Tricks : 6.5 Coding : 5.5 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 5.3 Instructions following : 6.3 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#76#76	Qwen3.5-122B-A10Bmedium	7.1	Qwen	$1.046 ↓	64.16s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 71.2% Flaky tests 4 Input Tokens 124,771 Output Tokens 44,077 Reasoning Tokens 443,141 Response Time (avg) 64.16s Response Time (total) 1411.60s Response Time (max) 519.30s Wrong answer: 5 Timed out: 2 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 6.0 Combined : 6.4 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 3.4 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#77#77	Grok 4.3medium	7.1	X AI	$0.779	47.45s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 68.2% Flaky tests 4 Input Tokens 140,031 Output Tokens 13,739 Reasoning Tokens 227,682 Response Time (avg) 47.45s Response Time (total) 1043.83s Response Time (max) 216.69s Wrong answer: 5 Did not follow instructions: 2 Extra formatting: 1 No answer: 1 Anti-AI Tricks : 10.0 Coding : 5.9 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.4 Instructions following : 9.8 Puzzle Solving : 5.9 Tool Calling : 10.0 Trivia : 3.0
#79#79	Grok 4.20medium	7.1	X AI	$0.777 ↓	29.47s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 63.6% Flaky tests 4 Input Tokens 102,791 Output Tokens 5,363 Reasoning Tokens 253,977 Response Time (avg) 29.47s Response Time (total) 648.35s Response Time (max) 199.66s Wrong answer: 6 Did not follow instructions: 2 Extra formatting: 1 Invalid tool call: 1 Anti-AI Tricks : 8.2 Coding : 6.3 Combined : 8.7 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 3.9 Instructions following : 9.8 Puzzle Solving : 7.7 Tool Calling : 3.0 Trivia : 3.0
#80#80	DeepSeek V3.2medium	7.0	DeepSeek	$0.078 ↑	68.62s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 65.2% Flaky tests 7 Input Tokens 101,047 Output Tokens 11,834 Reasoning Tokens 117,014 Response Time (avg) 68.62s Response Time (total) 1509.53s Response Time (max) 376.10s Wrong answer: 5 API error: 2 Timed out: 2 Did not follow instructions: 1 Invalid tool call: 1 Anti-AI Tricks : 8.2 Coding : 6.0 Combined : 7.3 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 3.4 Instructions following : 10.0 Puzzle Solving : 7.0 Tool Calling : 10.0 Trivia : 3.0
#81#81	Kimi K2.5medium	7.0	Moonshot AI	$0.600 ↑	99.00s
View model card Total Tests 22 Wrong Tests 12 Reliability 10.0 Attempt pass rate 65.2% Flaky tests 8 Input Tokens 118,448 Output Tokens 62,124 Reasoning Tokens 165,243 Response Time (avg) 99.00s Response Time (total) 1485.04s Response Time (max) 281.00s Wrong answer: 5 Did not follow instructions: 2 No answer: 2 Timed out: 2 Invalid tool call: 1 Anti-AI Tricks : 7.3 Coding : 6.1 Combined : 6.7 Data parsing and extraction : 10.0 Domain specific : 3.5 General Intelligence : 6.5 Instructions following : 10.0 Puzzle Solving : 5.3 Tool Calling : 10.0 Trivia : 3.0
#82#82	Mercury 2medium	7.0	Inception	$0.093	2.72s
View model card Total Tests 22 Wrong Tests 12 Reliability 10.0 Attempt pass rate 51.5% Flaky tests 3 Input Tokens 109,572 Output Tokens 10,313 Reasoning Tokens 76,806 Response Time (avg) 2.72s Response Time (total) 57.12s Response Time (max) 14.63s Wrong answer: 8 Did not follow instructions: 3 Invalid tool call: 1 Anti-AI Tricks : 6.9 Coding : 8.2 Combined : 6.7 Data parsing and extraction : 7.3 Domain specific : 2.9 General Intelligence : 4.8 Instructions following : 10.0 Puzzle Solving : 5.4 Tool Calling : 10.0 Trivia : 3.0
#83#83	Gemini 3.5 Flashnone	7.0	Google	$1.079	9.93s
View model card Total Tests 22 Wrong Tests 7 Reliability 10.0 Attempt pass rate 74.2% Flaky tests 3 Input Tokens 13,843 Output Tokens 117,518 Reasoning Tokens 0 Response Time (avg) 9.93s Response Time (total) 178.68s Response Time (max) 64.36s API error: 4 Wrong answer: 3 Anti-AI Tricks : 10.0 Coding : 8.8 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 7.6 General Intelligence : 10.0 Instructions following : 9.8 Puzzle Solving : 10.0 Tool Calling : 3.0 Trivia : 2.8
#85#85	KAT-Coder-Pro V2.5medium	6.9	Kwaipilot	$0.467	24.04s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 63.6% Flaky tests 7 Input Tokens 87,907 Output Tokens 7,213 Reasoning Tokens 128,251 Response Time (avg) 24.04s Response Time (total) 528.92s Response Time (max) 257.00s Wrong answer: 9 API error: 1 Did not follow instructions: 1 Anti-AI Tricks : 8.2 Coding : 7.8 Combined : 6.4 Data parsing and extraction : 7.3 Domain specific : 2.9 General Intelligence : 4.7 Instructions following : 9.9 Puzzle Solving : 5.9 Tool Calling : 10.0 Trivia : 3.0
#86#86	DeepSeek V4 Pronone	6.9	DeepSeek	$0.096	11.55s
View model card Total Tests 22 Wrong Tests 12 Reliability 10.0 Attempt pass rate 51.5% Flaky tests 4 Input Tokens 148,069 Output Tokens 35,551 Reasoning Tokens 0 Response Time (avg) 11.55s Response Time (total) 254.11s Response Time (max) 119.44s Wrong answer: 8 Did not follow instructions: 2 Extra formatting: 1 Invalid tool call: 1 Anti-AI Tricks : 3.2 Coding : 5.6 Combined : 7.9 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.0 Instructions following : 6.3 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#87#87	GPT-5.6 Solnone	6.9	OpenAI	$0.524	2.16s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 59.1% Flaky tests 3 Input Tokens 78,593 Output Tokens 4,357 Reasoning Tokens 0 Response Time (avg) 2.16s Response Time (total) 47.62s Response Time (max) 12.81s Wrong answer: 10 Did not follow instructions: 1 Anti-AI Tricks : 8.3 Coding : 5.5 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 6.5 Instructions following : 8.5 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#88#88	MiMo-V2.5-Promedium	6.9	Xiaomi	$0.187 ↓	33.92s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 66.7% Flaky tests 5 Input Tokens 139,883 Output Tokens 15,521 Reasoning Tokens 130,992 Response Time (avg) 33.92s Response Time (total) 746.19s Response Time (max) 197.54s Extra formatting: 3 Wrong answer: 3 Did not follow instructions: 2 API error: 1 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 6.2 Combined : 6.9 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 5.5 Instructions following : 9.9 Puzzle Solving : 6.7 Tool Calling : 10.0 Trivia : 3.0
#89#89	Qwen3.6 Flashmedium	6.9	Qwen	$0.738 ↓	44.65s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 68.2% Flaky tests 5 Input Tokens 129,041 Output Tokens 20,026 Reasoning Tokens 614,312 Response Time (avg) 44.65s Response Time (total) 982.32s Response Time (max) 578.13s Wrong answer: 8 Did not follow instructions: 1 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 5.0 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 3.5 General Intelligence : 4.8 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 3.0
#90#90	Step 3.7 Flashhigh	6.9	Stepfun	$1.207	64.68s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 63.6% Flaky tests 5 Input Tokens 98,691 Output Tokens 1,032,395 Reasoning Tokens 0 Response Time (avg) 64.68s Response Time (total) 1423.01s Response Time (max) 364.99s Wrong answer: 6 No answer: 4 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 4.0 Combined : 8.7 Data parsing and extraction : 10.0 Domain specific : 4.1 General Intelligence : 5.5 Instructions following : 9.8 Puzzle Solving : 5.3 Tool Calling : 10.0 Trivia : 3.0
#91#91	GPT-5.5none	6.9	OpenAI	$0.544	2.36s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 56.1% Flaky tests 3 Input Tokens 79,285 Output Tokens 4,915 Reasoning Tokens 0 Response Time (avg) 2.36s Response Time (total) 51.88s Response Time (max) 12.24s Wrong answer: 11 Anti-AI Tricks : 6.9 Coding : 5.5 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 10.0 Instructions following : 6.2 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#92#92	Gemini 3.5 Flashminimal	6.8	Google	$0.300	2.65s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 65.2% Flaky tests 1 Input Tokens 100,753 Output Tokens 16,454 Reasoning Tokens 0 Response Time (avg) 2.65s Response Time (total) 58.27s Response Time (max) 25.26s Wrong answer: 5 Invalid tool call: 2 Did not follow instructions: 1 Anti-AI Tricks : 6.5 Coding : 5.6 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 10.0 General Intelligence : 10.0 Instructions following : 6.4 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#93#93	Gemini 3 Flash Previewnone	6.8	Google	$0.085	2.95s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 65.2% Flaky tests 3 Input Tokens 104,210 Output Tokens 10,710 Reasoning Tokens 0 Response Time (avg) 2.95s Response Time (total) 44.26s Response Time (max) 21.19s Wrong answer: 8 No answer: 1 Anti-AI Tricks : 8.3 Coding : 5.5 Combined : 3.8 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 6.4 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#94#94	Qwen3.6 35B A3Bmedium	6.7	Qwen	$0.746 ↑	58.06s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 60.6% Flaky tests 1 Input Tokens 85,139 Output Tokens 61,819 Reasoning Tokens 678,766 Response Time (avg) 58.06s Response Time (total) 1161.18s Response Time (max) 817.57s Wrong answer: 4 API error: 2 Did not follow instructions: 1 Invalid tool call: 1 No answer: 1 Anti-AI Tricks : 10.0 Coding : 7.7 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.4 Instructions following : 10.0 Puzzle Solving : 8.0 Tool Calling : 3.0 Trivia : 3.0
#95#95	Gemini 3.5 Flash-Litelow	6.7	Google	$0.145	2.25s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 66.7% Flaky tests 5 Input Tokens 144,622 Output Tokens 15,302 Reasoning Tokens 24,971 Response Time (avg) 2.25s Response Time (total) 49.58s Response Time (max) 13.50s Wrong answer: 9 No answer: 1 Anti-AI Tricks : 10.0 Coding : 4.1 Combined : 6.3 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 6.1 Instructions following : 9.8 Puzzle Solving : 7.8 Tool Calling : 9.8 Trivia : 4.7
#97#97	KAT-Coder-Pro V2.5none	6.7	Kwaipilot	$0.476	25.56s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 68.2% Flaky tests 7 Input Tokens 98,499 Output Tokens 135,861 Reasoning Tokens 0 Response Time (avg) 25.56s Response Time (total) 562.43s Response Time (max) 335.41s Wrong answer: 10 Invalid tool call: 1 Anti-AI Tricks : 8.7 Coding : 6.1 Combined : 4.1 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 4.8 Instructions following : 9.8 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 3.0
#98#98	GLM 5V Turbomedium	6.7	Z.ai	$0.457	23.08s
View model card Total Tests 21 Wrong Tests 10 Reliability 10.0 Attempt pass rate 65.2% Flaky tests 6 Input Tokens 44,615 Output Tokens 2,347 Reasoning Tokens 98,415 Response Time (avg) 23.08s Response Time (total) 484.63s Response Time (max) 95.88s Wrong answer: 7 Invalid tool call: 2 Did not follow instructions: 1 Anti-AI Tricks : 7.2 Coding : 6.0 Combined : 3.4 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 7.7 Tool Calling : 7.0 Trivia : 3.0
#99#99	Claude Opus 4.7none	6.6	Anthropic	$0.505	3.02s
View model card Total Tests 19 Wrong Tests 3 Reliability 10.0 Attempt pass rate 72.7% Flaky tests 0 Input Tokens 69,576 Output Tokens 6,265 Reasoning Tokens 0 Response Time (avg) 3.02s Response Time (total) 57.44s Response Time (max) 18.27s Wrong answer: 3 Anti-AI Tricks : 8.3 Coding : 3.3 Combined : 4.8 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#101#101	GLM 5.2none	6.6	Z.ai	$0.125 ↓	9.34s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 59.1% Flaky tests 2 Input Tokens 112,359 Output Tokens 14,340 Reasoning Tokens 0 Response Time (avg) 9.34s Response Time (total) 205.46s Response Time (max) 79.65s Wrong answer: 8 Did not follow instructions: 1 Invalid tool call: 1 Anti-AI Tricks : 8.3 Coding : 3.7 Combined : 6.9 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 6.1 Instructions following : 9.8 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#104#104	Gemini 3.5 Flash-Litemedium	6.5	Google	$0.369	6.01s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 69.7% Flaky tests 7 Input Tokens 118,818 Output Tokens 11,677 Reasoning Tokens 121,611 Response Time (avg) 6.01s Response Time (total) 132.30s Response Time (max) 49.03s Wrong answer: 9 No answer: 1 Anti-AI Tricks : 10.0 Coding : 5.5 Combined : 3.8 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 5.4 Instructions following : 9.8 Puzzle Solving : 8.4 Tool Calling : 10.0 Trivia : 3.0
#105#105	Qwen3.6 27Bmedium	6.5	Qwen	$1.038 ↑	106.32s
View model card Total Tests 22 Wrong Tests 12 Reliability 10.0 Attempt pass rate 59.1% Flaky tests 6 Input Tokens 106,167 Output Tokens 32,889 Reasoning Tokens 241,303 Response Time (avg) 106.32s Response Time (total) 2339.12s Response Time (max) 1085.11s Wrong answer: 6 No answer: 3 Invalid tool call: 2 Did not follow instructions: 1 Anti-AI Tricks : 8.3 Coding : 7.7 Combined : 6.7 Data parsing and extraction : 3.5 Domain specific : 2.9 General Intelligence : 6.5 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#106#106	Hy3 previewmedium	6.5	Tencent	$0.018 ↕	16.28s
View model card Total Tests 21 Wrong Tests 7 Reliability 10.0 Attempt pass rate 63.6% Flaky tests 0 Input Tokens 27,030 Output Tokens 73,544 Reasoning Tokens 0 Response Time (avg) 16.28s Response Time (total) 293.12s Response Time (max) 46.04s API error: 3 Wrong answer: 3 Did not follow instructions: 1 Anti-AI Tricks : 10.0 Coding : 5.3 Combined : 5.0 Data parsing and extraction : 6.5 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#107#107	MiMo-V2.5medium	6.5	Xiaomi	$0.082 ↓	32.20s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 69.7% Flaky tests 6 Input Tokens 105,447 Output Tokens 7,120 Reasoning Tokens 230,682 Response Time (avg) 32.20s Response Time (total) 708.46s Response Time (max) 162.44s Wrong answer: 5 Extra formatting: 2 Did not follow instructions: 1 Invalid tool call: 1 No answer: 1 Anti-AI Tricks : 10.0 Coding : 6.2 Combined : 8.7 Data parsing and extraction : 2.7 Domain specific : 5.3 General Intelligence : 5.4 Instructions following : 9.9 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 3.0
#108#108	Laguna XS 2.1medium	6.5	Poolside	$0.068	47.93s
View model card Total Tests 22 Wrong Tests 13 Reliability 10.0 Attempt pass rate 42.4% Flaky tests 1 Input Tokens 118,989 Output Tokens 30,750 Reasoning Tokens 491,833 Response Time (avg) 47.93s Response Time (total) 1054.49s Response Time (max) 422.72s Wrong answer: 11 Invalid tool call: 1 No answer: 1 Anti-AI Tricks : 4.8 Coding : 5.5 Combined : 6.3 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 5.0 Instructions following : 9.8 Puzzle Solving : 5.3 Tool Calling : 10.0 Trivia : 3.0

←

1 2 3 4 8

→

Quick Compare

Gemini 3.6 FlashmediumvsGemini 3.6 Flashhigh Gemini 3.6 FlashhighvsGemini 3 Flash Previewmedium Gemini 3 Flash PreviewmediumvsGemini 3.5 Flashhigh Gemini 3.5 FlashhighvsGPT-5.6 Sollow GPT-5.6 SollowvsGemini 3.6 Flashlow Gemini 3.6 FlashlowvsGPT-5.6 Solmedium GPT-5.6 SolmediumvsGPT-5.6 Solhigh GPT-5.6 SolhighvsGPT-5.5low GPT-5.5lowvsGemini 3.1 Pro Previewmedium Gemini 3.1 Pro PreviewmediumvsQwen3.7 Maxmedium Qwen3.7 MaxmediumvsGemini 3.5 Flashmedium Gemini 3.5 FlashmediumvsGPT-5.5medium

AI Benchy Leaderboard

Filter models

Quick Compare