AI Benchy Leaderboard

Name: AI BENCHY Model Benchmark Results
Creator: AI BENCHY
License: https://aibenchy.com/methodology/

Last updated at: 2026-07-24 Models Evaluated: 222

222/222

Rank	Model	Score	Company	Total Cost	Response Time (avg)
#34#34	GPT-5.2 Chatnone	8.0	OpenAI	$0.604	7.65s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 74.2% Flaky tests 4 Input Tokens 101,248 Output Tokens 30,424 Reasoning Tokens 0 Response Time (avg) 7.65s Response Time (total) 168.39s Response Time (max) 38.52s Wrong answer: 6 Did not follow instructions: 1 No answer: 1 Anti-AI Tricks : 8.7 Coding : 8.8 Combined : 7.3 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.4 Instructions following : 9.8 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#35#35	GLM 5.2high	8.0	Z.ai	$0.800 ↓	62.65s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 71.2% Flaky tests 3 Input Tokens 83,813 Output Tokens 69,688 Reasoning Tokens 225,659 Response Time (avg) 62.65s Response Time (total) 1378.34s Response Time (max) 599.43s Timed out: 3 Wrong answer: 3 Did not follow instructions: 1 No answer: 1 Anti-AI Tricks : 10.0 Coding : 6.4 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.7 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 6.0 Tool Calling : 10.0 Trivia : 3.0
#38#38	GPT-5.6 Terrahigh	8.0	OpenAI	$1.055	11.32s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 68.2% Flaky tests 2 Input Tokens 81,047 Output Tokens 5,055 Reasoning Tokens 51,736 Response Time (avg) 11.32s Response Time (total) 249.14s Response Time (max) 91.49s Wrong answer: 7 Invalid tool call: 1 Anti-AI Tricks : 8.3 Coding : 7.6 Combined : 8.7 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.1 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#39#39	Seed-2.0-Litemedium	7.9	Bytedance Seed	$0.234	48.53s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 74.2% Flaky tests 4 Input Tokens 129,897 Output Tokens 12,533 Reasoning Tokens 88,047 Response Time (avg) 48.53s Response Time (total) 1067.74s Response Time (max) 254.92s Wrong answer: 5 Did not follow instructions: 2 No answer: 1 Anti-AI Tricks : 8.3 Coding : 8.0 Combined : 6.4 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 6.7 Instructions following : 10.0 Puzzle Solving : 9.0 Tool Calling : 10.0 Trivia : 3.0
#40#40	Qwen3.7 Plusmedium	7.9	Qwen	$0.267 ↓	51.51s
View model card Total Tests 22 Wrong Tests 7 Reliability 10.0 Attempt pass rate 75.8% Flaky tests 3 Input Tokens 115,233 Output Tokens 6,162 Reasoning Tokens 173,267 Response Time (avg) 51.51s Response Time (total) 1133.15s Response Time (max) 315.30s Wrong answer: 5 Invalid tool call: 1 Timed out: 1 Anti-AI Tricks : 10.0 Coding : 6.1 Combined : 8.2 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#41#41	Qwen3.6 Plusmedium	7.8	Qwen	$0.405 ↑	43.12s
View model card Total Tests 22 Wrong Tests 7 Reliability 10.0 Attempt pass rate 71.2% Flaky tests 2 Input Tokens 97,689 Output Tokens 6,412 Reasoning Tokens 184,825 Response Time (avg) 43.12s Response Time (total) 905.53s Response Time (max) 291.55s Wrong answer: 5 API error: 1 Did not follow instructions: 1 Anti-AI Tricks : 10.0 Coding : 6.1 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 5.1 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#43#43	GPT-5.6 Terramedium	7.8	OpenAI	$0.676	7.11s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 69.7% Flaky tests 2 Input Tokens 79,175 Output Tokens 4,878 Reasoning Tokens 26,952 Response Time (avg) 7.11s Response Time (total) 156.42s Response Time (max) 41.68s Wrong answer: 8 Anti-AI Tricks : 8.3 Coding : 6.1 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.5 Instructions following : 10.0 Puzzle Solving : 8.4 Tool Calling : 10.0 Trivia : 3.0
#44#44	Claude Sonnet 4.6medium	7.8	Anthropic	$2.057	25.91s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 66.7% Flaky tests 2 Input Tokens 106,292 Output Tokens 80,748 Reasoning Tokens 35,117 Response Time (avg) 25.91s Response Time (total) 362.78s Response Time (max) 140.96s Wrong answer: 4 Extra formatting: 3 Timed out: 1 Anti-AI Tricks : 6.5 Coding : 5.7 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#45#45	Claude Opus 4.8low	7.8	Anthropic	$2.077	12.74s
View model card Total Tests 22 Wrong Tests 6 Reliability 10.0 Attempt pass rate 80.3% Flaky tests 3 Input Tokens 156,525 Output Tokens 43,141 Reasoning Tokens 8,617 Response Time (avg) 12.74s Response Time (total) 280.29s Response Time (max) 127.97s Wrong answer: 4 Extra formatting: 1 No answer: 1 Anti-AI Tricks : 10.0 Coding : 6.6 Combined : 9.9 Data parsing and extraction : 6.3 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 9.8 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#46#46	GLM 5medium	7.7	Z.ai	$0.307 ↑	33.54s
View model card Total Tests 21 Wrong Tests 6 Reliability 10.0 Attempt pass rate 78.8% Flaky tests 4 Input Tokens 35,224 Output Tokens 21,570 Reasoning Tokens 102,996 Response Time (avg) 33.54s Response Time (total) 435.99s Response Time (max) 99.85s Wrong answer: 3 Did not follow instructions: 1 No answer: 1 Timed out: 1 Anti-AI Tricks : 10.0 Coding : 10.0 Combined : 5.0 Data parsing and extraction : 7.1 Domain specific : 3.5 General Intelligence : 6.1 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#47#47	Claude Opus 4.6medium	7.7	Anthropic	$3.059	34.27s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 63.6% Flaky tests 3 Input Tokens 108,615 Output Tokens 72,286 Reasoning Tokens 28,315 Response Time (avg) 34.27s Response Time (total) 513.99s Response Time (max) 151.51s Extra formatting: 5 Wrong answer: 3 Did not follow instructions: 1 Anti-AI Tricks : 6.4 Coding : 5.7 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#48#48	GPT-5.6 Lunahigh	7.7	OpenAI	$1.017	18.68s
View model card Total Tests 22 Wrong Tests 7 Reliability 10.0 Attempt pass rate 72.7% Flaky tests 3 Input Tokens 80,918 Output Tokens 5,088 Reasoning Tokens 150,910 Response Time (avg) 18.68s Response Time (total) 411.05s Response Time (max) 111.09s Wrong answer: 7 Anti-AI Tricks : 8.3 Coding : 5.5 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 5.0 Instructions following : 9.9 Puzzle Solving : 7.6 Tool Calling : 10.0 Trivia : 3.0
#49#49	DeepSeek V4 Flashhigh	7.7	DeepSeek	$0.042 ↓	49.75s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 72.7% Flaky tests 5 Input Tokens 108,392 Output Tokens 14,478 Reasoning Tokens 153,687 Response Time (avg) 49.75s Response Time (total) 1094.41s Response Time (max) 218.13s Wrong answer: 6 Did not follow instructions: 2 Invalid tool call: 1 Anti-AI Tricks : 8.3 Coding : 7.8 Combined : 6.4 Data parsing and extraction : 10.0 Domain specific : 4.1 General Intelligence : 6.1 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 3.0
#50#50	DeepSeek V4 Prohigh	7.7	DeepSeek	$0.200	79.14s
View model card Total Tests 22 Wrong Tests 12 Reliability 10.0 Attempt pass rate 63.6% Flaky tests 6 Input Tokens 90,748 Output Tokens 10,462 Reasoning Tokens 178,719 Response Time (avg) 79.14s Response Time (total) 1740.97s Response Time (max) 416.76s Wrong answer: 6 Did not follow instructions: 2 API error: 1 Extra formatting: 1 No answer: 1 Timed out: 1 Anti-AI Tricks : 5.7 Coding : 6.3 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 10.0 Instructions following : 7.8 Puzzle Solving : 6.9 Tool Calling : 9.8 Trivia : 3.0
#52#52	Grok Build 0.1medium	7.6	X AI	$1.097	52.06s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 63.6% Flaky tests 0 Input Tokens 106,751 Output Tokens 7,993 Reasoning Tokens 486,670 Response Time (avg) 52.06s Response Time (total) 1145.27s Response Time (max) 252.69s Wrong answer: 5 Extra formatting: 3 Anti-AI Tricks : 8.3 Coding : 5.7 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.4 Instructions following : 9.8 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#53#53	GLM 5 Turbomedium	7.6	Z.ai	$0.323 ↑	23.00s
View model card Total Tests 21 Wrong Tests 7 Reliability 10.0 Attempt pass rate 71.2% Flaky tests 4 Input Tokens 35,593 Output Tokens 12,245 Reasoning Tokens 62,277 Response Time (avg) 23.00s Response Time (total) 482.97s Response Time (max) 194.23s Wrong answer: 4 Did not follow instructions: 1 No answer: 1 Timed out: 1 Anti-AI Tricks : 10.0 Coding : 8.2 Combined : 5.0 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 6.1 Instructions following : 10.0 Puzzle Solving : 8.7 Tool Calling : 10.0 Trivia : 3.0
#54#54	GPT-5.6 Lunamedium	7.6	OpenAI	$0.352	7.28s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 65.2% Flaky tests 1 Input Tokens 89,676 Output Tokens 5,699 Reasoning Tokens 37,980 Response Time (avg) 7.28s Response Time (total) 160.27s Response Time (max) 29.85s Wrong answer: 8 Anti-AI Tricks : 8.3 Coding : 5.4 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.1 Instructions following : 9.9 Puzzle Solving : 7.8 Tool Calling : 10.0 Trivia : 3.0
#56#56	Kimi K2.7 Codemedium	7.5	Moonshot AI	$0.740 ↓	84.25s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 65.2% Flaky tests 4 Input Tokens 72,073 Output Tokens 83,714 Reasoning Tokens 178,793 Response Time (avg) 84.25s Response Time (total) 1769.22s Response Time (max) 365.80s Wrong answer: 5 Timed out: 3 API error: 1 Did not follow instructions: 1 Anti-AI Tricks : 7.3 Coding : 7.8 Combined : 7.3 Data parsing and extraction : 10.0 Domain specific : 5.5 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 5.9 Tool Calling : 3.0 Trivia : 3.0
#57#57	GPT-5.4 Nanomedium	7.5	OpenAI	$0.138	13.24s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 65.2% Flaky tests 4 Input Tokens 82,819 Output Tokens 7,100 Reasoning Tokens 90,022 Response Time (avg) 13.24s Response Time (total) 291.33s Response Time (max) 94.06s Wrong answer: 8 Did not follow instructions: 2 Anti-AI Tricks : 8.3 Coding : 6.1 Combined : 9.9 Data parsing and extraction : 10.0 Domain specific : 5.9 General Intelligence : 4.5 Instructions following : 9.8 Puzzle Solving : 4.1 Tool Calling : 10.0 Trivia : 3.0
#58#58	GPT-5.3 Chatnone	7.5	OpenAI	$0.571	6.88s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 68.2% Flaky tests 5 Input Tokens 78,990 Output Tokens 30,854 Reasoning Tokens 0 Response Time (avg) 6.88s Response Time (total) 151.31s Response Time (max) 18.33s Wrong answer: 7 Did not follow instructions: 2 Anti-AI Tricks : 6.7 Coding : 5.6 Combined : 10.0 Data parsing and extraction : 10.0 Domain specific : 3.5 General Intelligence : 4.6 Instructions following : 9.8 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#59#59	GPT-5.6 Terralow	7.5	OpenAI	$0.519	5.31s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 74.2% Flaky tests 6 Input Tokens 80,295 Output Tokens 4,714 Reasoning Tokens 16,469 Response Time (avg) 5.31s Response Time (total) 116.82s Response Time (max) 19.85s Wrong answer: 8 Invalid tool call: 1 Anti-AI Tricks : 8.3 Coding : 6.6 Combined : 8.7 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.8 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 4.7 Trivia : 3.0
#60#60	GPT-5.4 Minimedium	7.5	OpenAI	$0.756	25.94s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 71.2% Flaky tests 6 Input Tokens 97,155 Output Tokens 6,211 Reasoning Tokens 145,544 Response Time (avg) 25.94s Response Time (total) 570.66s Response Time (max) 138.75s Wrong answer: 6 Did not follow instructions: 3 Invalid tool call: 1 Anti-AI Tricks : 8.6 Coding : 8.4 Combined : 6.9 Data parsing and extraction : 10.0 Domain specific : 4.1 General Intelligence : 4.5 Instructions following : 9.8 Puzzle Solving : 7.8 Tool Calling : 4.7 Trivia : 3.0
#61#61	Qwen3.5 Plus 2026-02-15medium	7.5	Qwen	$0.437 ↓	89.19s
View model card Total Tests 22 Wrong Tests 8 Reliability 10.0 Attempt pass rate 71.2% Flaky tests 4 Input Tokens 113,560 Output Tokens 9,823 Reasoning Tokens 250,881 Response Time (avg) 89.19s Response Time (total) 1337.92s Response Time (max) 304.85s Wrong answer: 4 Timed out: 2 API error: 1 Invalid tool call: 1 Anti-AI Tricks : 8.2 Coding : 6.6 Combined : 6.9 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.7 Instructions following : 10.0 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#62#62	Qwen3.5-27Bmedium	7.4	Qwen	$0.981 ↓	111.94s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 72.7% Flaky tests 5 Input Tokens 111,635 Output Tokens 15,999 Reasoning Tokens 598,430 Response Time (avg) 111.94s Response Time (total) 2462.67s Response Time (max) 1026.43s Wrong answer: 4 Did not follow instructions: 2 Extra formatting: 1 Invalid tool call: 1 Timed out: 1 Anti-AI Tricks : 8.7 Coding : 6.2 Combined : 7.3 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 6.1 Instructions following : 10.0 Puzzle Solving : 8.2 Tool Calling : 10.0 Trivia : 3.0
#65#65	Gemini 3 Flash Previewlow	7.4	Google	$0.177	6.28s
View model card Total Tests 22 Wrong Tests 6 Reliability 10.0 Attempt pass rate 75.8% Flaky tests 2 Input Tokens 123,684 Output Tokens 9,572 Reasoning Tokens 28,518 Response Time (avg) 6.28s Response Time (total) 138.06s Response Time (max) 17.13s Wrong answer: 6 Anti-AI Tricks : 10.0 Coding : 5.8 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 10.0
#66#66	KAT-Coder-Pro V2.5low	7.4	Kwaipilot	$0.387	19.47s
View model card Total Tests 22 Wrong Tests 11 Reliability 10.0 Attempt pass rate 69.7% Flaky tests 8 Input Tokens 87,673 Output Tokens 7,166 Reasoning Tokens 101,474 Response Time (avg) 19.47s Response Time (total) 428.31s Response Time (max) 209.15s Wrong answer: 10 API error: 1 Anti-AI Tricks : 6.9 Coding : 7.8 Combined : 6.4 Data parsing and extraction : 10.0 Domain specific : 4.1 General Intelligence : 4.1 Instructions following : 10.0 Puzzle Solving : 6.4 Tool Calling : 10.0 Trivia : 3.0
#67#67	Claude Sonnet 4.6none	7.3	Anthropic	$0.661	8.12s
View model card Total Tests 22 Wrong Tests 10 Reliability 10.0 Attempt pass rate 57.6% Flaky tests 1 Input Tokens 123,264 Output Tokens 19,362 Reasoning Tokens 0 Response Time (avg) 8.12s Response Time (total) 121.78s Response Time (max) 51.18s Wrong answer: 5 Extra formatting: 4 Did not follow instructions: 1 Anti-AI Tricks : 4.8 Coding : 5.5 Combined : 9.8 Data parsing and extraction : 10.0 Domain specific : 7.7 General Intelligence : 6.1 Instructions following : 6.5 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#68#68	Gemini 3.1 Flash Lite Previewmedium	7.3	Google	$0.115	4.61s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 59.1% Flaky tests 0 Input Tokens 117,480 Output Tokens 10,589 Reasoning Tokens 46,394 Response Time (avg) 4.61s Response Time (total) 101.39s Response Time (max) 18.34s Wrong answer: 7 Did not follow instructions: 1 Invalid tool call: 1 Anti-AI Tricks : 9.1 Coding : 5.5 Combined : 7.2 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 10.0 Instructions following : 10.0 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#69#69	Gemini 3.1 Flash Litemedium	7.3	Google	$0.117	4.27s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 62.1% Flaky tests 2 Input Tokens 104,918 Output Tokens 9,168 Reasoning Tokens 51,130 Response Time (avg) 4.27s Response Time (total) 94.02s Response Time (max) 26.22s Wrong answer: 7 Did not follow instructions: 1 Invalid tool call: 1 Anti-AI Tricks : 9.1 Coding : 5.5 Combined : 7.2 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 7.6 Tool Calling : 10.0 Trivia : 3.0
#70#70	Claude Opus 4.8none	7.3	Anthropic	$1.166	4.91s
View model card Total Tests 22 Wrong Tests 9 Reliability 10.0 Attempt pass rate 63.6% Flaky tests 2 Input Tokens 149,206 Output Tokens 16,797 Reasoning Tokens 0 Response Time (avg) 4.91s Response Time (total) 108.03s Response Time (max) 35.03s Wrong answer: 4 Extra formatting: 3 Did not follow instructions: 1 No answer: 1 Anti-AI Tricks : 6.5 Coding : 5.5 Combined : 9.8 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 9.9 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0

←

1 2 3 8

→

Quick Compare

Gemini 3.6 FlashmediumvsGemini 3.6 Flashhigh Gemini 3.6 FlashhighvsGemini 3 Flash Previewmedium Gemini 3 Flash PreviewmediumvsGemini 3.5 Flashhigh Gemini 3.5 FlashhighvsGPT-5.6 Sollow GPT-5.6 SollowvsGemini 3.6 Flashlow Gemini 3.6 FlashlowvsGPT-5.6 Solmedium GPT-5.6 SolmediumvsGPT-5.6 Solhigh GPT-5.6 SolhighvsGPT-5.5low GPT-5.5lowvsGemini 3.1 Pro Previewmedium Gemini 3.1 Pro PreviewmediumvsQwen3.7 Maxmedium Qwen3.7 MaxmediumvsGemini 3.5 Flashmedium Gemini 3.5 FlashmediumvsGPT-5.5medium

AI Benchy Leaderboard

Filter models

Quick Compare