AI Benchy Leaderboard

Name: AI BENCHY Model Benchmark Results
Creator: AI BENCHY
License: https://aibenchy.com/methodology/

Last updated at: 2026-07-24 Models Evaluated: 222

222/222

Rank	Model	Score	Company	Total Cost	Response Time (avg)
#1🥇 #1	Gemini 3.6 Flashmedium	9.9	Google	$0.831	10.11s
View model card Total Tests 22 Wrong Tests 1 Reliability 10.0 Attempt pass rate 98.5% Flaky tests 1 Input Tokens 66,293 Output Tokens 2,000 Reasoning Tokens 95,464 Response Time (avg) 10.11s Response Time (total) 222.33s Response Time (max) 68.03s Wrong answer: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 8.2 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 10.0
#2🥈 #2	Gemini 3.6 Flashhigh	9.7	Google	$1.785	14.88s
View model card Total Tests 22 Wrong Tests 1 Reliability 10.0 Attempt pass rate 98.5% Flaky tests 1 Input Tokens 87,819 Output Tokens 5,750 Reasoning Tokens 214,596 Response Time (avg) 14.88s Response Time (total) 327.37s Response Time (max) 88.00s Wrong answer: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 10.0 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 4.7
#3🥉 #3	Gemini 3 Flash Previewmedium	9.6	Google	$0.742	19.20s
View model card Total Tests 22 Wrong Tests 1 Reliability 10.0 Attempt pass rate 98.5% Flaky tests 1 Input Tokens 87,861 Output Tokens 5,486 Reasoning Tokens 227,164 Response Time (avg) 19.20s Response Time (total) 422.42s Response Time (max) 117.26s Wrong answer: 1 Anti-AI Tricks : 10.0 Coding : 8.6 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 10.0 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 10.0
#4#4	Gemini 3.5 Flashhigh	9.5	Google	$1.976	15.07s
View model card Total Tests 22 Wrong Tests 2 Reliability 10.0 Attempt pass rate 93.9% Flaky tests 2 Input Tokens 107,137 Output Tokens 8,777 Reasoning Tokens 192,900 Response Time (avg) 15.07s Response Time (total) 331.48s Response Time (max) 145.92s Invalid tool call: 1 Wrong answer: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 8.2 Data parsing and extraction : 10.0 Domain specific : 7.6 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 9.8 Trivia : 10.0
#5#5	GPT-5.6 Sollow	9.5	OpenAI	$0.971	8.79s
View model card Total Tests 22 Wrong Tests 4 Reliability 10.0 Attempt pass rate 86.4% Flaky tests 2 Input Tokens 78,571 Output Tokens 4,476 Reasoning Tokens 14,770 Response Time (avg) 8.79s Response Time (total) 193.33s Response Time (max) 53.91s Wrong answer: 4 Anti-AI Tricks : 8.3 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 10.0
#6#6	Gemini 3.6 Flashlow	9.4	Google	$0.517	4.42s
View model card Total Tests 22 Wrong Tests 1 Reliability 10.0 Attempt pass rate 97.0% Flaky tests 1 Input Tokens 82,715 Output Tokens 5,729 Reasoning Tokens 46,633 Response Time (avg) 4.42s Response Time (total) 97.13s Response Time (max) 28.92s Wrong answer: 1 Anti-AI Tricks : 10.0 Coding : 7.8 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 10.0 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 10.0
#7#7	GPT-5.6 Solmedium	9.4	OpenAI	$1.316	11.35s
View model card Total Tests 22 Wrong Tests 4 Reliability 10.0 Attempt pass rate 90.9% Flaky tests 3 Input Tokens 78,997 Output Tokens 4,696 Reasoning Tokens 26,002 Response Time (avg) 11.35s Response Time (total) 249.73s Response Time (max) 79.40s Wrong answer: 4 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 4.7
#8#8	GPT-5.6 Solhigh	9.4	OpenAI	$1.234	11.73s
View model card Total Tests 22 Wrong Tests 4 Reliability 10.0 Attempt pass rate 89.4% Flaky tests 3 Input Tokens 79,249 Output Tokens 4,855 Reasoning Tokens 23,044 Response Time (avg) 11.73s Response Time (total) 257.99s Response Time (max) 54.79s Wrong answer: 4 Anti-AI Tricks : 8.7 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 4.7
#9#9	GPT-5.5low	9.3	OpenAI	$1.253	10.13s
View model card Total Tests 22 Wrong Tests 3 Reliability 10.0 Attempt pass rate 86.4% Flaky tests 0 Input Tokens 80,058 Output Tokens 5,378 Reasoning Tokens 23,040 Response Time (avg) 10.13s Response Time (total) 222.82s Response Time (max) 56.19s Wrong answer: 3 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#10#10	Gemini 3.1 Pro Previewmedium	9.2	Google	$1.361	21.47s
View model card Total Tests 22 Wrong Tests 2 Reliability 10.0 Attempt pass rate 90.9% Flaky tests 0 Input Tokens 92,287 Output Tokens 5,232 Reasoning Tokens 92,726 Response Time (avg) 21.47s Response Time (total) 322.08s Response Time (max) 88.68s Wrong answer: 2 Anti-AI Tricks : 10.0 Coding : 7.9 Combined : 9.8 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 10.0
#11#11	Qwen3.7 Maxmedium	9.2	Qwen	$1.116 ↓	40.57s
View model card Total Tests 22 Wrong Tests 4 Reliability 10.0 Attempt pass rate 87.9% Flaky tests 2 Input Tokens 106,020 Output Tokens 5,748 Reasoning Tokens 211,004 Response Time (avg) 40.57s Response Time (total) 892.57s Response Time (max) 556.06s Wrong answer: 3 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 8.7 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#12#12	Gemini 3.5 Flashmedium	9.1	Google	$0.642	8.20s
View model card Total Tests 22 Wrong Tests 3 Reliability 10.0 Attempt pass rate 87.9% Flaky tests 1 Input Tokens 69,747 Output Tokens 2,166 Reasoning Tokens 57,436 Response Time (avg) 8.20s Response Time (total) 180.47s Response Time (max) 76.68s Wrong answer: 2 Did not follow instructions: 1 Anti-AI Tricks : 10.0 Coding : 7.9 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 10.0
#13#13	GPT-5.5medium	9.0	OpenAI	$4.137	38.42s
View model card Total Tests 22 Wrong Tests 4 Reliability 10.0 Attempt pass rate 87.9% Flaky tests 3 Input Tokens 80,659 Output Tokens 5,617 Reasoning Tokens 118,819 Response Time (avg) 38.42s Response Time (total) 845.35s Response Time (max) 332.10s Wrong answer: 4 Anti-AI Tricks : 10.0 Coding : 8.8 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 2.8
#14#14	Gemini 3.5 Flashlow	8.9	Google	$0.433	5.55s
View model card Total Tests 22 Wrong Tests 3 Reliability 10.0 Attempt pass rate 87.9% Flaky tests 1 Input Tokens 87,817 Output Tokens 2,239 Reasoning Tokens 31,182 Response Time (avg) 5.55s Response Time (total) 122.19s Response Time (max) 53.55s Wrong answer: 2 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 7.8 Combined : 8.2 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 10.0
#15#15	Grok 4.5high	8.9	X AI	$1.707	76.50s
View model card Total Tests 22 Wrong Tests 5 Reliability 10.0 Attempt pass rate 83.3% Flaky tests 2 Input Tokens 151,562 Output Tokens 5,655 Reasoning Tokens 247,540 Response Time (avg) 76.50s Response Time (total) 1683.07s Response Time (max) 676.83s No answer: 2 Wrong answer: 2 Did not follow instructions: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 4.7 Instructions following : 9.8 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#16#16	GPT-5.3-Codexmedium	8.9	OpenAI	$0.920	16.96s
View model card Total Tests 22 Wrong Tests 6 Reliability 10.0 Attempt pass rate 83.3% Flaky tests 4 Input Tokens 81,268 Output Tokens 6,251 Reasoning Tokens 49,274 Response Time (avg) 16.96s Response Time (total) 373.19s Response Time (max) 100.93s Wrong answer: 4 Did not follow instructions: 2 Anti-AI Tricks : 8.7 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 4.6 Instructions following : 10.0 Puzzle Solving : 9.0 Tool Calling : 10.0 Trivia : 2.8
#17#17	Claude Opus 4.8medium	8.8	Anthropic	$1.931	12.49s
View model card Total Tests 22 Wrong Tests 4 Reliability 10.0 Attempt pass rate 84.9% Flaky tests 1 Input Tokens 138,451 Output Tokens 40,766 Reasoning Tokens 9,075 Response Time (avg) 12.49s Response Time (total) 274.72s Response Time (max) 70.54s Wrong answer: 3 No answer: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 9.9 Data parsing and extraction : 7.1 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#18#18	Claude Opus 4.7medium	8.7	Anthropic	$1.477	7.61s
View model card Total Tests 22 Wrong Tests 4 Reliability 10.0 Attempt pass rate 83.3% Flaky tests 1 Input Tokens 145,252 Output Tokens 24,948 Reasoning Tokens 5,042 Response Time (avg) 7.61s Response Time (total) 159.91s Response Time (max) 65.40s Wrong answer: 3 Timed out: 1 Anti-AI Tricks : 8.3 Coding : 7.6 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#19#19	Muse Spark 1.1medium	8.6	Meta	$1.357	24.97s
View model card Total Tests 22 Wrong Tests 7 Reliability 10.0 Attempt pass rate 72.7% Flaky tests 2 Input Tokens 142,567 Output Tokens 7,905 Reasoning Tokens 269,225 Response Time (avg) 24.97s Response Time (total) 549.31s Response Time (max) 165.38s Wrong answer: 4 Did not follow instructions: 2 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 8.3 Data parsing and extraction : 10.0 Domain specific : 3.5 General Intelligence : 10.0 Instructions following : 6.5 Puzzle Solving : 7.9 Tool Calling : 9.8 Trivia : 3.0
#20#20	Claude Fable 5medium	8.6	Anthropic	$3.478	17.20s
View model card Total Tests 22 Wrong Tests 5 Reliability 10.0 Attempt pass rate 78.8% Flaky tests 1 Input Tokens 89,643 Output Tokens 41,360 Reasoning Tokens 10,269 Response Time (avg) 17.20s Response Time (total) 378.41s Response Time (max) 80.80s No answer: 2 Wrong answer: 2 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#21#21	GPT-5.4medium	8.5	OpenAI	$1.533	23.10s
View model card Total Tests 22 Wrong Tests 7 Reliability 10.0 Attempt pass rate 77.3% Flaky tests 4 Input Tokens 81,127 Output Tokens 6,155 Reasoning Tokens 82,515 Response Time (avg) 23.10s Response Time (total) 508.26s Response Time (max) 100.41s Wrong answer: 5 Did not follow instructions: 2 Anti-AI Tricks : 8.3 Coding : 8.8 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.7 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 3.0
#23#23	Grok 4.5low	8.4	X AI	$0.935	15.56s
View model card Total Tests 22 Wrong Tests 6 Reliability 10.0 Attempt pass rate 75.8% Flaky tests 1 Input Tokens 125,596 Output Tokens 7,505 Reasoning Tokens 106,446 Response Time (avg) 15.56s Response Time (total) 342.32s Response Time (max) 205.28s Wrong answer: 6 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 6.1 Instructions following : 9.8 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#24#24	GPT-5.2medium	8.4	OpenAI	$0.951	22.62s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 72.7% Flaky tests 4 Input Tokens 105,004 Output Tokens 9,914 Reasoning Tokens 44,868 Response Time (avg) 22.62s Response Time (total) 339.28s Response Time (max) 102.93s Did not follow instructions: 3 Wrong answer: 3 No answer: 1 Timed out: 1 Anti-AI Tricks : 6.5 Coding : 10.0 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 3.7 Instructions following : 9.9 Puzzle Solving : 7.5 Tool Calling : 4.7 Trivia : 3.0
#25#25	Grok 4.5medium	8.3	X AI	$1.928	61.71s
View model card Total Tests 22 Wrong Tests 6 Reliability 10.0 Attempt pass rate 78.8% Flaky tests 3 Input Tokens 122,146 Output Tokens 5,514 Reasoning Tokens 275,053 Response Time (avg) 61.71s Response Time (total) 1357.56s Response Time (max) 436.38s Wrong answer: 6 Anti-AI Tricks : 10.0 Coding : 7.6 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 6.5 Instructions following : 9.8 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#26#26	Claude Sonnet 5medium	8.3	Anthropic	$0.922	12.52s
View model card Total Tests 22 Wrong Tests 6 Reliability 10.0 Attempt pass rate 80.3% Flaky tests 3 Input Tokens 145,956 Output Tokens 52,333 Reasoning Tokens 10,874 Response Time (avg) 12.52s Response Time (total) 275.42s Response Time (max) 66.71s Wrong answer: 4 Did not follow instructions: 1 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 9.0 Combined : 7.3 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 4.8 Instructions following : 9.9 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#27#27	Muse Spark 1.1low	8.3	Meta	$0.647	11.45s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 69.7% Flaky tests 4 Input Tokens 142,298 Output Tokens 10,847 Reasoning Tokens 99,467 Response Time (avg) 11.45s Response Time (total) 251.92s Response Time (max) 54.15s Wrong answer: 6 Did not follow instructions: 2 Invalid tool call: 1 Anti-AI Tricks : 7.9 Coding : 10.0 Combined : 6.6 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 10.0 Instructions following : 7.3 Puzzle Solving : 8.3 Tool Calling : 9.8 Trivia : 3.0
#28#28	Gemini 2.5 Flashmedium	8.2	Google	$0.643	21.18s
View model card Total Tests 22 Wrong Tests 7 Reliability 10.0 Attempt pass rate 71.2% Flaky tests 1 Input Tokens 132,498 Output Tokens 12,739 Reasoning Tokens 228,464 Response Time (avg) 21.18s Response Time (total) 465.89s Response Time (max) 140.50s Wrong answer: 6 Did not follow instructions: 1 Anti-AI Tricks : 8.4 Coding : 7.8 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 4.8 Instructions following : 9.8 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#29#29	GPT-5 Minimedium	8.1	OpenAI	$0.237	27.63s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 63.6% Flaky tests 3 Input Tokens 98,374 Output Tokens 14,434 Reasoning Tokens 91,498 Response Time (avg) 27.63s Response Time (total) 607.92s Response Time (max) 111.48s Wrong answer: 5 Did not follow instructions: 3 No answer: 1 Timed out: 1 Anti-AI Tricks : 7.1 Coding : 10.0 Combined : 7.3 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 4.5 Instructions following : 10.0 Puzzle Solving : 5.6 Tool Calling : 10.0 Trivia : 3.0
#30#30	Muse Spark 1.1high	8.1	Meta	$1.694	31.49s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 69.7% Flaky tests 6 Input Tokens 129,423 Output Tokens 8,077 Reasoning Tokens 352,421 Response Time (avg) 31.49s Response Time (total) 661.28s Response Time (max) 196.03s Wrong answer: 4 Did not follow instructions: 2 Invalid tool call: 2 API error: 1 No answer: 1 Anti-AI Tricks : 7.5 Coding : 10.0 Combined : 5.9 Data parsing and extraction : 10.0 Domain specific : 3.5 General Intelligence : 10.0 Instructions following : 6.4 Puzzle Solving : 7.8 Tool Calling : 9.6 Trivia : 3.0
#31#31	Gemini 3.5 Flash-Litehigh	8.1	Google	$0.584	9.48s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 81.8% Flaky tests 7 Input Tokens 105,138 Output Tokens 8,315 Reasoning Tokens 212,507 Response Time (avg) 9.48s Response Time (total) 208.52s Response Time (max) 43.93s Wrong answer: 6 Did not follow instructions: 1 No answer: 1 Anti-AI Tricks : 10.0 Coding : 8.6 Combined : 7.3 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.4 Instructions following : 8.5 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 2.8

Quick Compare

Gemini 3.6 FlashmediumvsGemini 3.6 Flashhigh Gemini 3.6 FlashhighvsGemini 3 Flash Previewmedium Gemini 3 Flash PreviewmediumvsGemini 3.5 Flashhigh Gemini 3.5 FlashhighvsGPT-5.6 Sollow GPT-5.6 SollowvsGemini 3.6 Flashlow Gemini 3.6 FlashlowvsGPT-5.6 Solmedium GPT-5.6 SolmediumvsGPT-5.6 Solhigh GPT-5.6 SolhighvsGPT-5.5low GPT-5.5lowvsGemini 3.1 Pro Previewmedium Gemini 3.1 Pro PreviewmediumvsQwen3.7 Maxmedium Qwen3.7 MaxmediumvsGemini 3.5 Flashmedium Gemini 3.5 FlashmediumvsGPT-5.5medium

AI Benchy Leaderboard

Filter models

Quick Compare