AI Benchy Leaderboard

Name: AI BENCHY Model Benchmark Results
Creator: AI BENCHY
License: https://aibenchy.com/methodology/

Last updated at: 2026-07-24 Models Evaluated: 222

222/222

Rank	Model	Score	Company	Total Cost	Response Time (avg)
#109#109	Qwen3.5-27Bnone	6.5	Qwen	$0.058 ↓	4.76s
View model card Total Tests 22 Wrong Tests 14 Reliability 10.0 Attempt pass rate 40.9% Flaky tests 2 Input Tokens 102,316 Output Tokens 24,321 Reasoning Tokens 0 Response Time (avg) 4.76s Response Time (total) 104.71s Response Time (max) 69.46s Wrong answer: 12 Did not follow instructions: 2 Anti-AI Tricks : 4.8 Coding : 5.8 Combined : 6.4 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 5.0 Instructions following : 6.3 Puzzle Solving : 6.7 Tool Calling : 10.0 Trivia : 3.0
#110#110	Gemini 3.1 Flash Lite Previewlow	6.5	Google	$0.646	16.70s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 59.1% Flaky tests 0 Input Tokens 110,185 Output Tokens 14,717 Reasoning Tokens 397,483 Response Time (avg) 16.70s Response Time (total) 367.47s Response Time (max) 309.35s Wrong answer: 7 Did not follow instructions: 1 Invalid tool call: 1 Anti-AI Tricks : 8.3 Coding : 5.5 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#111#111	Gemini 3.1 Flash Litelow	6.5	Google	$0.621	16.26s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 59.1% Flaky tests 2 Input Tokens 94,224 Output Tokens 7,759 Reasoning Tokens 390,126 Response Time (avg) 16.26s Response Time (total) 357.64s Response Time (max) 318.02s Wrong answer: 9 Invalid tool call: 1 Anti-AI Tricks : 7.3 Coding : 5.5 Combined : 3.2 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#112#112	Gemini 3.1 Flash Lite Previewnone	6.4	Google	$0.052	1.58s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 57.6% Flaky tests 1 Input Tokens 120,942 Output Tokens 14,292 Reasoning Tokens 0 Response Time (avg) 1.58s Response Time (total) 34.72s Response Time (max) 9.27s Wrong answer: 7 Did not follow instructions: 2 No answer: 1 Anti-AI Tricks : 7.5 Coding : 5.5 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#113#113	Qwen3.5 Plus 2026-02-15none	6.4	Qwen	$0.073 ↓	9.85s
View model card Total Tests 22 Wrong Tests 12 Reliability 10.0 Attempt pass rate 48.5% Flaky tests 2 Input Tokens 102,646 Output Tokens 29,370 Reasoning Tokens 0 Response Time (avg) 9.85s Response Time (total) 157.63s Response Time (max) 123.00s Wrong answer: 12 Anti-AI Tricks : 4.8 Coding : 4.3 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.4 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#115#115	Ring-2.6-1Tmedium	6.3	Inclusionai	$0.103 ↑	68.74s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 60.6% Flaky tests 4 Input Tokens 113,604 Output Tokens 123,079 Reasoning Tokens 42,754 Response Time (avg) 68.74s Response Time (total) 1374.86s Response Time (max) 304.19s Wrong answer: 6 API error: 2 Did not follow instructions: 2 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 5.3 Combined : 7.3 Data parsing and extraction : 6.5 Domain specific : 3.5 General Intelligence : 4.1 Instructions following : 9.8 Puzzle Solving : 5.9 Tool Calling : 10.0 Trivia : 3.0
#117#117	Gemma 4 31Bmedium	6.3	Google	$0.102 ↓	75.38s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 68.2% Flaky tests 2 Input Tokens 94,992 Output Tokens 34,468 Reasoning Tokens 223,278 Response Time (avg) 75.38s Response Time (total) 1507.52s Response Time (max) 437.40s API error: 2 Timed out: 2 Wrong answer: 2 Invalid tool call: 1 No answer: 1 Anti-AI Tricks : 10.0 Coding : 4.3 Combined : 2.9 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 9.9 Tool Calling : 3.0 Trivia : 3.0
#118#118	LongCat 2.0none	6.3	Meituan	$0.044	5.18s
View model card Total Tests 22 Wrong Tests 15 Reliability 10.0 Attempt pass rate 36.4% Flaky tests 2 Input Tokens 108,743 Output Tokens 9,372 Reasoning Tokens 0 Response Time (avg) 5.18s Response Time (total) 113.95s Response Time (max) 48.38s Wrong answer: 14 Extra formatting: 1 Anti-AI Tricks : 4.8 Coding : 5.5 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 5.0 Instructions following : 6.5 Puzzle Solving : 4.0 Tool Calling : 10.0 Trivia : 3.0
#119#119	Claude Sonnet 5none	6.3	Anthropic	$0.548	6.04s
View model card Total Tests 22 Wrong Tests 14 Reliability 10.0 Attempt pass rate 45.5% Flaky tests 4 Input Tokens 161,035 Output Tokens 22,511 Reasoning Tokens 0 Response Time (avg) 6.04s Response Time (total) 132.85s Response Time (max) 33.39s Wrong answer: 7 Extra formatting: 4 No answer: 2 Did not follow instructions: 1 Anti-AI Tricks : 5.3 Coding : 4.6 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.7 Instructions following : 6.4 Puzzle Solving : 6.0 Tool Calling : 10.0 Trivia : 3.0
#120#120	MiMo-V2-Flashmedium	6.3	Xiaomi	$0.043 ↑	20.11s
View model card Total Tests 21 Wrong Tests 9 Reliability 10.0 Attempt pass rate 62.1% Flaky tests 3 Input Tokens 40,111 Output Tokens 12,476 Reasoning Tokens 125,039 Response Time (avg) 20.11s Response Time (total) 301.59s Response Time (max) 96.01s Wrong answer: 5 API error: 1 Extra formatting: 1 Did not follow instructions: 1 Timed out: 1 Anti-AI Tricks : 8.1 Coding : 6.0 Combined : 4.9 Data parsing and extraction : 6.5 Domain specific : 5.9 General Intelligence : 4.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#121#121	Qwen3.5-Flashmedium	6.2	Qwen	$0.139 ↓	84.82s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 69.7% Flaky tests 6 Input Tokens 118,499 Output Tokens 12,284 Reasoning Tokens 490,610 Response Time (avg) 84.82s Response Time (total) 1781.22s Response Time (max) 515.38s Wrong answer: 4 Timed out: 3 API error: 1 Did not follow instructions: 1 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 3.7 Combined : 6.4 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 6.1 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 3.0
#122#122	Gemma 4 31Bnone	6.2	Google	$0.020 ↓	5.34s
View model card Total Tests 22 Wrong Tests 12 Reliability 10.0 Attempt pass rate 48.5% Flaky tests 1 Input Tokens 125,728 Output Tokens 13,317 Reasoning Tokens 0 Response Time (avg) 5.34s Response Time (total) 106.82s Response Time (max) 29.95s Wrong answer: 9 API error: 2 Did not follow instructions: 1 Anti-AI Tricks : 6.5 Coding : 5.5 Combined : 3.8 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 6.5 Puzzle Solving : 6.5 Tool Calling : 3.0 Trivia : 3.0
#123#123	Seed-2.0-Litenone	6.2	Bytedance Seed	$0.066	4.40s
View model card Total Tests 22 Wrong Tests 14 Reliability 10.0 Attempt pass rate 43.9% Flaky tests 4 Input Tokens 142,197 Output Tokens 14,746 Reasoning Tokens 0 Response Time (avg) 4.40s Response Time (total) 96.84s Response Time (max) 44.58s Wrong answer: 13 No answer: 1 Anti-AI Tricks : 3.0 Coding : 5.6 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 5.3 Tool Calling : 10.0 Trivia : 3.0
#124#124	GPT-5.6 Lunalow	6.2	OpenAI	$0.249	5.04s
View model card Total Tests 22 Wrong Tests 12 Reliability 10.0 Attempt pass rate 56.1% Flaky tests 5 Input Tokens 96,346 Output Tokens 8,211 Reasoning Tokens 17,227 Response Time (avg) 5.04s Response Time (total) 110.88s Response Time (max) 19.44s Wrong answer: 10 Did not follow instructions: 1 Invalid tool call: 1 Anti-AI Tricks : 8.3 Coding : 5.5 Combined : 2.8 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 5.0 Instructions following : 8.5 Puzzle Solving : 7.6 Tool Calling : 10.0 Trivia : 3.0
#125#125	Gemini 2.5 Flashnone	6.2	Google	$0.017	6.20s
View model card Total Tests 22 Wrong Tests 13 Reliability 10.0 Attempt pass rate 43.9% Flaky tests 1 Input Tokens 39,877 Output Tokens 1,890 Reasoning Tokens 0 Response Time (avg) 6.20s Response Time (total) 136.37s Response Time (max) 118.00s Wrong answer: 12 Invalid tool call: 1 Anti-AI Tricks : 3.0 Coding : 5.5 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 5.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#126#126	Qwen3.5-35B-A3Bmedium	6.2	Qwen	$0.837 ↓	112.47s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 66.7% Flaky tests 6 Input Tokens 130,388 Output Tokens 40,630 Reasoning Tokens 786,040 Response Time (avg) 112.47s Response Time (total) 2474.28s Response Time (max) 950.25s Timed out: 5 No answer: 2 Wrong answer: 2 API error: 1 Invalid tool call: 1 Anti-AI Tricks : 10.0 Coding : 5.9 Combined : 3.8 Data parsing and extraction : 7.3 Domain specific : 4.1 General Intelligence : 2.8 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 3.0
#127#127	Gemini 3.1 Flash Liteminimal	6.1	Google	$0.047	1.86s
View model card Total Tests 22 Wrong Tests 12 Reliability 10.0 Attempt pass rate 51.5% Flaky tests 3 Input Tokens 119,065 Output Tokens 11,118 Reasoning Tokens 0 Response Time (avg) 1.86s Response Time (total) 40.88s Response Time (max) 12.97s Wrong answer: 8 Did not follow instructions: 3 No answer: 1 Anti-AI Tricks : 8.3 Coding : 5.5 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 4.0 Instructions following : 10.0 Puzzle Solving : 6.0 Tool Calling : 10.0 Trivia : 3.0
#128#128	gpt-oss-120bmedium	6.1	OpenAI	$0.019 ↓	21.91s
View model card Total Tests 22 Wrong Tests 13 Reliability 10.0 Attempt pass rate 50.0% Flaky tests 5 Input Tokens 108,747 Output Tokens 29,772 Reasoning Tokens 68,044 Response Time (avg) 21.91s Response Time (total) 328.70s Response Time (max) 68.16s Wrong answer: 9 Did not follow instructions: 3 Invalid tool call: 1 Anti-AI Tricks : 6.7 Coding : 5.9 Combined : 6.5 Data parsing and extraction : 6.4 Domain specific : 2.9 General Intelligence : 4.3 Instructions following : 9.9 Puzzle Solving : 5.3 Tool Calling : 9.8 Trivia : 3.0
#129#129	Gemini 3.1 Flash Litenone	6.1	Google	$0.046	1.75s
View model card Total Tests 22 Wrong Tests 13 Reliability 10.0 Attempt pass rate 50.0% Flaky tests 4 Input Tokens 118,050 Output Tokens 10,723 Reasoning Tokens 0 Response Time (avg) 1.75s Response Time (total) 38.60s Response Time (max) 16.25s Wrong answer: 11 Did not follow instructions: 1 No answer: 1 Anti-AI Tricks : 7.5 Coding : 5.5 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 4.0 Instructions following : 10.0 Puzzle Solving : 6.3 Tool Calling : 10.0 Trivia : 3.0
#131#131	Qwen3.6 Flashnone	6.1	Qwen	$0.062 ↓	3.74s
View model card Total Tests 22 Wrong Tests 15 Reliability 10.0 Attempt pass rate 34.9% Flaky tests 1 Input Tokens 139,788 Output Tokens 30,947 Reasoning Tokens 0 Response Time (avg) 3.74s Response Time (total) 82.38s Response Time (max) 48.79s Wrong answer: 12 Invalid tool call: 2 Did not follow instructions: 1 Anti-AI Tricks : 3.1 Coding : 5.4 Combined : 3.8 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 6.3 Puzzle Solving : 3.5 Tool Calling : 10.0 Trivia : 3.0
#132#132	Qwen3.5-Flashnone	6.1	Qwen	$0.073 ↓	25.28s
View model card Total Tests 22 Wrong Tests 14 Reliability 10.0 Attempt pass rate 39.4% Flaky tests 2 Input Tokens 282,347 Output Tokens 209,201 Reasoning Tokens 0 Response Time (avg) 25.28s Response Time (total) 556.24s Response Time (max) 480.96s Wrong answer: 13 Invalid tool call: 1 Anti-AI Tricks : 3.5 Coding : 5.5 Combined : 2.9 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 10.0 Instructions following : 6.3 Puzzle Solving : 3.1 Tool Calling : 10.0 Trivia : 3.0
#133#133	Qwen3.5 Plus 2026-04-20none	6.1	Qwen	$0.122 ↓	13.56s
View model card Total Tests 22 Wrong Tests 14 Reliability 10.0 Attempt pass rate 43.9% Flaky tests 4 Input Tokens 94,468 Output Tokens 51,487 Reasoning Tokens 0 Response Time (avg) 13.56s Response Time (total) 298.31s Response Time (max) 206.05s Wrong answer: 12 Did not follow instructions: 2 Anti-AI Tricks : 4.8 Coding : 3.9 Combined : 6.4 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.8 Instructions following : 6.2 Puzzle Solving : 6.7 Tool Calling : 10.0 Trivia : 3.0
#134#134	Qwen3.5-35B-A3Bnone	6.1	Qwen	$0.106 ↓	12.72s
View model card Total Tests 22 Wrong Tests 15 Reliability 10.0 Attempt pass rate 43.9% Flaky tests 4 Input Tokens 134,521 Output Tokens 86,614 Reasoning Tokens 0 Response Time (avg) 12.72s Response Time (total) 279.90s Response Time (max) 209.15s Wrong answer: 12 Did not follow instructions: 2 Invalid tool call: 1 Anti-AI Tricks : 3.4 Coding : 5.5 Combined : 3.8 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 6.5 Instructions following : 6.3 Puzzle Solving : 3.7 Tool Calling : 10.0 Trivia : 3.0
#135#135	GPT-5 Nanomedium	6.1	OpenAI	$0.114	54.87s
View model card Total Tests 22 Wrong Tests 13 Reliability 10.0 Attempt pass rate 56.1% Flaky tests 8 Input Tokens 94,935 Output Tokens 12,042 Reasoning Tokens 261,056 Response Time (avg) 54.87s Response Time (total) 822.99s Response Time (max) 227.89s Wrong answer: 9 Did not follow instructions: 2 No answer: 1 Timed out: 1 Anti-AI Tricks : 6.5 Coding : 7.0 Combined : 6.4 Data parsing and extraction : 3.7 Domain specific : 5.2 General Intelligence : 4.1 Instructions following : 9.8 Puzzle Solving : 5.3 Tool Calling : 10.0 Trivia : 3.0
#136#136	Nemotron 3 Ultranone	6.1	NVIDIA	$0.072 ↕	3.87s
View model card Total Tests 22 Wrong Tests 14 Reliability 10.0 Attempt pass rate 42.4% Flaky tests 2 Input Tokens 101,275 Output Tokens 9,474 Reasoning Tokens 0 Response Time (avg) 3.87s Response Time (total) 85.15s Response Time (max) 37.50s Wrong answer: 12 API error: 1 Did not follow instructions: 1 Anti-AI Tricks : 3.5 Coding : 5.5 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.0 Instructions following : 10.0 Puzzle Solving : 5.9 Tool Calling : 10.0 Trivia : 3.0
#139#139	GPT-5.6 Terranone	6.0	OpenAI	$0.349	1.65s
View model card Total Tests 22 Wrong Tests 14 Reliability 10.0 Attempt pass rate 42.4% Flaky tests 3 Input Tokens 102,259 Output Tokens 6,203 Reasoning Tokens 0 Response Time (avg) 1.65s Response Time (total) 36.28s Response Time (max) 10.07s Wrong answer: 11 Did not follow instructions: 1 Invalid tool call: 1 No answer: 1 Anti-AI Tricks : 4.8 Coding : 5.5 Combined : 2.9 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.0 Instructions following : 8.5 Puzzle Solving : 5.3 Tool Calling : 9.6 Trivia : 3.0
#141#141	Mimo V2 Omnimedium	5.9	Xiaomi	$0.683 ↓	41.16s
View model card Total Tests 21 Wrong Tests 11 Reliability 10.0 Attempt pass rate 53.0% Flaky tests 3 Input Tokens 37,007 Output Tokens 1,952 Reasoning Tokens 357,306 Response Time (avg) 41.16s Response Time (total) 823.26s Response Time (max) 299.23s Wrong answer: 5 Did not follow instructions: 2 No answer: 2 API error: 1 Extra formatting: 1 Anti-AI Tricks : 10.0 Coding : 3.3 Combined : 5.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 5.4 Instructions following : 8.3 Puzzle Solving : 5.9 Tool Calling : 10.0 Trivia : 3.0
#142#142	Hy3 previewhigh	5.9	Tencent	$0.048 ↕	56.57s
View model card Total Tests 21 Wrong Tests 10 Reliability 10.0 Attempt pass rate 53.0% Flaky tests 2 Input Tokens 25,987 Output Tokens 216,719 Reasoning Tokens 0 Response Time (avg) 56.57s Response Time (total) 848.59s Response Time (max) 149.94s API error: 7 Wrong answer: 3 Anti-AI Tricks : 6.4 Coding : 5.3 Combined : 5.0 Data parsing and extraction : 6.5 Domain specific : 5.3 General Intelligence : 3.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#143#143	GPT-5.4 Mininone	5.9	OpenAI	$0.095	1.53s
View model card Total Tests 22 Wrong Tests 16 Reliability 10.0 Attempt pass rate 33.3% Flaky tests 3 Input Tokens 79,067 Output Tokens 7,880 Reasoning Tokens 0 Response Time (avg) 1.53s Response Time (total) 33.74s Response Time (max) 9.92s Wrong answer: 13 Did not follow instructions: 3 Anti-AI Tricks : 3.1 Coding : 5.5 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 3.5 General Intelligence : 4.8 Instructions following : 6.3 Puzzle Solving : 5.4 Tool Calling : 3.0 Trivia : 3.0
#145#145	Kimi K2.6none	5.8	Moonshot AI	$0.184 ↓	19.58s
View model card Total Tests 22 Wrong Tests 15 Reliability 10.0 Attempt pass rate 34.9% Flaky tests 2 Input Tokens 116,970 Output Tokens 30,253 Reasoning Tokens 0 Response Time (avg) 19.58s Response Time (total) 430.85s Response Time (max) 238.89s Wrong answer: 11 Did not follow instructions: 3 No answer: 1 Anti-AI Tricks : 4.6 Coding : 5.5 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.4 Instructions following : 6.5 Puzzle Solving : 3.1 Tool Calling : 10.0 Trivia : 3.0

←

1 3 4 5 8

→

Quick Compare

Gemini 3.6 FlashmediumvsGemini 3.6 Flashhigh Gemini 3.6 FlashhighvsGemini 3 Flash Previewmedium Gemini 3 Flash PreviewmediumvsGemini 3.5 Flashhigh Gemini 3.5 FlashhighvsGPT-5.6 Sollow GPT-5.6 SollowvsGemini 3.6 Flashlow Gemini 3.6 FlashlowvsGPT-5.6 Solmedium GPT-5.6 SolmediumvsGPT-5.6 Solhigh GPT-5.6 SolhighvsGPT-5.5low GPT-5.5lowvsGemini 3.1 Pro Previewmedium Gemini 3.1 Pro PreviewmediumvsQwen3.7 Maxmedium Qwen3.7 MaxmediumvsGemini 3.5 Flashmedium Gemini 3.5 FlashmediumvsGPT-5.5medium

AI Benchy Leaderboard

Filter models

Quick Compare