AI Benchy Leaderboard

Name: AI BENCHY Model Benchmark Results
Creator: AI BENCHY
License: https://aibenchy.com/methodology/

Last updated at: 2026-07-20 Models Evaluated: 210

210/210

Rank	Model	Score	Company	Total Cost	Response Time (avg)
#151#151	GLM 5.1none	5.5	Z.ai	$0.164 ↓	6.70s
View model card Total Tests 22 Wrong Tests 15 Reliability 10.0 Attempt pass rate 40.9% Flaky tests 5 Input Tokens 124,209 Output Tokens 14,393 Reasoning Tokens 0 Response Time (avg) 6.70s Response Time (total) 147.38s Response Time (max) 61.20s Wrong answer: 13 Invalid tool call: 1 No answer: 1 Anti-AI Tricks : 4.0 Coding : 3.9 Combined : 2.8 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 5.0 Instructions following : 9.8 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 3.0
#152#152	Qwen3.6 27Bnone	5.5	Qwen	$0.087 ↑	10.65s
View model card Total Tests 22 Wrong Tests 15 Reliability 10.0 Attempt pass rate 45.5% Flaky tests 6 Input Tokens 95,796 Output Tokens 16,155 Reasoning Tokens 0 Response Time (avg) 10.65s Response Time (total) 234.39s Response Time (max) 156.31s Wrong answer: 11 Did not follow instructions: 2 Invalid tool call: 2 Anti-AI Tricks : 3.8 Coding : 5.5 Combined : 3.2 Data parsing and extraction : 7.3 Domain specific : 7.7 General Intelligence : 5.2 Instructions following : 6.2 Puzzle Solving : 5.3 Tool Calling : 9.5 Trivia : 3.0
#153#153	Hy3 previewlow	5.5	Tencent	$0.015 ↕	24.56s
View model card Total Tests 21 Wrong Tests 11 Reliability 10.0 Attempt pass rate 50.0% Flaky tests 2 Input Tokens 21,045 Output Tokens 63,460 Reasoning Tokens 0 Response Time (avg) 24.56s Response Time (total) 368.35s Response Time (max) 78.74s API error: 7 Wrong answer: 4 Anti-AI Tricks : 8.3 Coding : 5.3 Combined : 5.0 Data parsing and extraction : 6.5 Domain specific : 5.9 General Intelligence : 3.0 Instructions following : 10.0 Puzzle Solving : 5.3 Tool Calling : 2.8 Trivia : 3.0
#154#154	MiMo-V2.5-Pronone	5.5	Xiaomi	$0.068 ↓	4.12s
View model card Total Tests 22 Wrong Tests 16 Reliability 10.0 Attempt pass rate 37.9% Flaky tests 4 Input Tokens 124,799 Output Tokens 15,362 Reasoning Tokens 0 Response Time (avg) 4.12s Response Time (total) 90.55s Response Time (max) 53.13s Wrong answer: 11 Did not follow instructions: 4 No answer: 1 Anti-AI Tricks : 3.3 Coding : 4.3 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.0 Instructions following : 6.4 Puzzle Solving : 6.7 Tool Calling : 10.0 Trivia : 3.0
#155#155	Kimi K2.5none	5.5	Moonshot AI	$0.127 ↑	19.15s
View model card Total Tests 22 Wrong Tests 16 Reliability 10.0 Attempt pass rate 34.9% Flaky tests 4 Input Tokens 89,322 Output Tokens 26,638 Reasoning Tokens 0 Response Time (avg) 19.15s Response Time (total) 287.30s Response Time (max) 102.83s Wrong answer: 15 No answer: 1 Anti-AI Tricks : 3.6 Coding : 5.5 Combined : 2.8 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 6.5 Puzzle Solving : 3.0 Tool Calling : 10.0 Trivia : 3.0
#156#156	Gemma 4 26B A4Bnone	5.5	Google	$0.015 ↓	7.64s
View model card Total Tests 22 Wrong Tests 14 Reliability 10.0 Attempt pass rate 42.4% Flaky tests 2 Input Tokens 131,282 Output Tokens 15,781 Reasoning Tokens 0 Response Time (avg) 7.64s Response Time (total) 167.98s Response Time (max) 57.10s Wrong answer: 10 Did not follow instructions: 2 Invalid tool call: 1 Timed out: 1 Anti-AI Tricks : 8.3 Coding : 3.7 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 4.0 Instructions following : 6.3 Puzzle Solving : 6.2 Tool Calling : 10.0 Trivia : 3.0
#157#157	Mimo V2 Omninone	5.5	Xiaomi	$0.021 ↓	2.44s
View model card Total Tests 21 Wrong Tests 13 Reliability 10.0 Attempt pass rate 37.9% Flaky tests 1 Input Tokens 40,852 Output Tokens 3,314 Reasoning Tokens 0 Response Time (avg) 2.44s Response Time (total) 48.81s Response Time (max) 6.81s Wrong answer: 10 API error: 1 Extra formatting: 1 Did not follow instructions: 1 Anti-AI Tricks : 3.6 Coding : 4.4 Combined : 1.5 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.1 Instructions following : 6.5 Puzzle Solving : 10.0 Tool Calling : 10.0 Trivia : 3.0
#158#158	KAT-Coder-Air V2.5low	5.4	Kwaipilot	$0.041	10.09s
View model card Total Tests 22 Wrong Tests 15 Reliability 9.9 Attempt pass rate 45.5% Flaky tests 7 Input Tokens 61,085 Output Tokens 5,905 Reasoning Tokens 46,990 Response Time (avg) 10.09s Response Time (total) 222.03s Response Time (max) 86.23s Wrong answer: 7 Extra formatting: 4 API error: 2 Did not follow instructions: 2 Anti-AI Tricks : 7.3 Coding : 3.5 Combined : 6.4 Data parsing and extraction : 6.5 Domain specific : 2.9 General Intelligence : 5.0 Instructions following : 9.8 Puzzle Solving : 3.1 Tool Calling : 10.0 Trivia : 3.0
#159#159	GPT-5.6 Lunanone	5.4	OpenAI	$0.142	1.50s
View model card Total Tests 22 Wrong Tests 16 Reliability 10.0 Attempt pass rate 34.9% Flaky tests 3 Input Tokens 101,323 Output Tokens 6,709 Reasoning Tokens 0 Response Time (avg) 1.50s Response Time (total) 32.91s Response Time (max) 10.57s Wrong answer: 14 Extra formatting: 1 Invalid tool call: 1 Anti-AI Tricks : 4.8 Coding : 3.8 Combined : 3.2 Data parsing and extraction : 10.0 Domain specific : 2.9 General Intelligence : 5.0 Instructions following : 7.1 Puzzle Solving : 5.3 Tool Calling : 10.0 Trivia : 3.0
#160#160	Laguna XS 2.1none	5.3	Poolside	$0.008	1.55s
View model card Total Tests 22 Wrong Tests 17 Reliability 10.0 Attempt pass rate 30.3% Flaky tests 3 Input Tokens 91,598 Output Tokens 13,377 Reasoning Tokens 0 Response Time (avg) 1.55s Response Time (total) 34.19s Response Time (max) 19.02s Wrong answer: 14 Did not follow instructions: 1 Invalid tool call: 1 Timed out: 1 Anti-AI Tricks : 5.3 Coding : 4.3 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.0 Instructions following : 3.8 Puzzle Solving : 3.0 Tool Calling : 10.0 Trivia : 3.0
#161#161	Qwen3.6 35B A3Bnone	5.3	Qwen	$0.061 ↑	5.52s
View model card Total Tests 22 Wrong Tests 18 Reliability 10.0 Attempt pass rate 31.8% Flaky tests 6 Input Tokens 93,979 Output Tokens 46,957 Reasoning Tokens 0 Response Time (avg) 5.52s Response Time (total) 110.40s Response Time (max) 39.54s Wrong answer: 13 API error: 2 Did not follow instructions: 2 No answer: 1 Anti-AI Tricks : 3.6 Coding : 5.5 Combined : 3.8 Data parsing and extraction : 10.0 Domain specific : 3.5 General Intelligence : 4.4 Instructions following : 6.2 Puzzle Solving : 3.2 Tool Calling : 3.0 Trivia : 3.0
#162#162	Ling-2.6-1Tnone	5.3	Inclusionai	$0.016 ↑	8.58s
View model card Total Tests 22 Wrong Tests 18 Reliability 10.0 Attempt pass rate 18.2% Flaky tests 0 Input Tokens 106,414 Output Tokens 11,555 Reasoning Tokens 0 Response Time (avg) 8.58s Response Time (total) 163.06s Response Time (max) 25.72s Wrong answer: 12 API error: 3 Did not follow instructions: 2 Invalid tool call: 1 Anti-AI Tricks : 3.4 Coding : 3.8 Combined : 6.5 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 5.0 Instructions following : 6.4 Puzzle Solving : 3.1 Tool Calling : 3.0 Trivia : 3.0
#163#163	Gemini 3.1 Flash Lite Previewhigh	5.3	Google	$2.310	68.14s
View model card Total Tests 16 Wrong Tests 3 Reliability N/A Attempt pass rate 59.1% Flaky tests 0 Input Tokens 28,980 Output Tokens 1,283 Reasoning Tokens 1,533,310 Response Time (avg) 68.14s Response Time (total) 1090.28s Response Time (max) 280.52s Wrong answer: 2 Did not follow instructions: 1 Anti-AI Tricks : 7.5 Coding : 0.0 Combined : 5.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 9.8 Puzzle Solving : 7.7 Tool Calling : 10.0 Trivia : 0.0
#164#164	Inklingnone	5.2	Thinkingmachines	$0.147	3.50s
View model card Total Tests 22 Wrong Tests 16 Reliability 9.9 Attempt pass rate 28.8% Flaky tests 1 Input Tokens 104,111 Output Tokens 10,551 Reasoning Tokens 0 Response Time (avg) 3.50s Response Time (total) 77.09s Response Time (max) 48.02s Wrong answer: 13 Extra formatting: 1 Did not follow instructions: 1 Invalid tool call: 1 Anti-AI Tricks : 4.8 Coding : 4.5 Combined : 2.9 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 5.0 Instructions following : 6.3 Puzzle Solving : 5.6 Tool Calling : 3.0 Trivia : 3.0
#165#165	Mistral Small 4none	5.1	Mistral	$0.022	1.20s
View model card Total Tests 22 Wrong Tests 17 Reliability 10.0 Attempt pass rate 25.8% Flaky tests 1 Input Tokens 104,708 Output Tokens 9,812 Reasoning Tokens 0 Response Time (avg) 1.20s Response Time (total) 26.38s Response Time (max) 13.16s Wrong answer: 16 Did not follow instructions: 1 Anti-AI Tricks : 3.4 Coding : 3.7 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.0 Instructions following : 6.5 Puzzle Solving : 3.1 Tool Calling : 10.0 Trivia : 3.0
#166#166	Qwen3 Coder Nextnone	5.1	Qwen	$0.025 ↓	9.12s
View model card Total Tests 22 Wrong Tests 17 Reliability 10.0 Attempt pass rate 25.8% Flaky tests 1 Input Tokens 134,218 Output Tokens 11,808 Reasoning Tokens 0 Response Time (avg) 9.12s Response Time (total) 145.94s Response Time (max) 45.14s Wrong answer: 14 Extra formatting: 1 Did not follow instructions: 1 No answer: 1 Anti-AI Tricks : 3.6 Coding : 4.6 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 5.3 General Intelligence : 10.0 Instructions following : 6.3 Puzzle Solving : 3.0 Tool Calling : 10.0 Trivia : 3.0
#167#167	Mistral Small 4medium	5.1	Mistral	$0.096	10.77s
View model card Total Tests 22 Wrong Tests 17 Reliability 10.0 Attempt pass rate 42.4% Flaky tests 8 Input Tokens 140,494 Output Tokens 39,462 Reasoning Tokens 92,362 Response Time (avg) 10.77s Response Time (total) 236.94s Response Time (max) 59.15s Wrong answer: 12 API error: 2 Did not follow instructions: 2 No answer: 1 Anti-AI Tricks : 5.6 Coding : 4.4 Combined : 3.0 Data parsing and extraction : 7.3 Domain specific : 5.3 General Intelligence : 4.8 Instructions following : 7.3 Puzzle Solving : 3.4 Tool Calling : 10.0 Trivia : 3.0
#168#168	MiMo-V2.5none	5.1	Xiaomi	$0.025 ↓	4.62s
View model card Total Tests 22 Wrong Tests 17 Reliability 10.0 Attempt pass rate 25.8% Flaky tests 1 Input Tokens 141,043 Output Tokens 16,464 Reasoning Tokens 0 Response Time (avg) 4.62s Response Time (total) 101.57s Response Time (max) 55.36s Wrong answer: 14 Extra formatting: 1 Did not follow instructions: 1 No answer: 1 Anti-AI Tricks : 3.5 Coding : 5.5 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 3.0 General Intelligence : 4.4 Instructions following : 6.5 Puzzle Solving : 5.4 Tool Calling : 10.0 Trivia : 3.0
#169#169	Qwen3.5-9Bnone	5.1	Qwen	$0.021 ↑	19.17s
View model card Total Tests 22 Wrong Tests 18 Reliability 10.0 Attempt pass rate 19.7% Flaky tests 1 Input Tokens 144,407 Output Tokens 37,484 Reasoning Tokens 0 Response Time (avg) 19.17s Response Time (total) 421.74s Response Time (max) 382.06s Wrong answer: 14 Did not follow instructions: 2 Invalid tool call: 2 Anti-AI Tricks : 3.1 Coding : 3.9 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 4.4 Instructions following : 6.5 Puzzle Solving : 3.2 Tool Calling : 10.0 Trivia : 3.0
#170#170	GLM 5 Turbonone	5.1	Z.ai	$0.047 ↑	2.82s
View model card Total Tests 21 Wrong Tests 15 Reliability 10.0 Attempt pass rate 30.3% Flaky tests 2 Input Tokens 32,525 Output Tokens 1,815 Reasoning Tokens 0 Response Time (avg) 2.82s Response Time (total) 59.29s Response Time (max) 8.21s Wrong answer: 13 Did not follow instructions: 2 Anti-AI Tricks : 3.0 Coding : 3.9 Combined : 1.5 Data parsing and extraction : 10.0 Domain specific : 5.3 General Intelligence : 4.2 Instructions following : 6.5 Puzzle Solving : 5.5 Tool Calling : 10.0 Trivia : 3.0
#171#171	North Mini Codenone	5.1	Cohere	$0.000	29.95s
View model card Total Tests 22 Wrong Tests 18 Reliability 8.7 Attempt pass rate 18.2% Flaky tests 0 Input Tokens 130,492 Output Tokens 26,786 Reasoning Tokens 0 Response Time (avg) 29.95s Response Time (total) 658.82s Response Time (max) 159.85s Wrong answer: 12 Extra formatting: 2 Did not follow instructions: 2 Invalid tool call: 2 Anti-AI Tricks : 3.0 Coding : 3.9 Combined : 3.2 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 3.9 Instructions following : 6.5 Puzzle Solving : 3.5 Tool Calling : 9.5 Trivia : 3.0
#172#172	MiniMax M2.7medium	5.0	Minimax	$0.163 ↓	41.28s
View model card Total Tests 22 Wrong Tests 17 Reliability 10.0 Attempt pass rate 45.5% Flaky tests 9 Input Tokens 114,518 Output Tokens 18,558 Reasoning Tokens 119,036 Response Time (avg) 41.28s Response Time (total) 866.81s Response Time (max) 196.21s Wrong answer: 6 Did not follow instructions: 5 No answer: 2 Timed out: 2 API error: 1 Invalid tool call: 1 Anti-AI Tricks : 7.9 Coding : 5.7 Combined : 3.8 Data parsing and extraction : 6.3 Domain specific : 3.0 General Intelligence : 3.9 Instructions following : 3.8 Puzzle Solving : 5.9 Tool Calling : 4.7 Trivia : 3.0
#173#173	DeepSeek V3.2none	5.0	DeepSeek	$0.054 ↑	18.25s
View model card Total Tests 22 Wrong Tests 16 Reliability 10.0 Attempt pass rate 37.9% Flaky tests 6 Input Tokens 135,780 Output Tokens 42,097 Reasoning Tokens 0 Response Time (avg) 18.25s Response Time (total) 401.60s Response Time (max) 115.89s Wrong answer: 7 API error: 4 Extra formatting: 2 Invalid tool call: 2 Did not follow instructions: 1 Anti-AI Tricks : 3.2 Coding : 3.1 Combined : 4.8 Data parsing and extraction : 6.3 Domain specific : 2.9 General Intelligence : 4.7 Instructions following : 10.0 Puzzle Solving : 7.6 Tool Calling : 10.0 Trivia : 3.0
#174#174	GPT-4o-mininone	5.0	OpenAI	$0.010	1.99s
View model card Total Tests 22 Wrong Tests 17 Reliability 10.0 Attempt pass rate 22.7% Flaky tests 0 Input Tokens 53,136 Output Tokens 2,911 Reasoning Tokens 0 Response Time (avg) 1.99s Response Time (total) 29.86s Response Time (max) 7.58s Wrong answer: 15 Did not follow instructions: 1 No answer: 1 Anti-AI Tricks : 4.8 Coding : 3.2 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 4.0 Instructions following : 6.3 Puzzle Solving : 3.5 Tool Calling : 10.0 Trivia : 3.0
#175#175	Qwen3.6 Plus Previewmedium	4.9	Qwen	$0.000	15.25s
View model card Total Tests 19 Wrong Tests 10 Reliability N/A Attempt pass rate 40.9% Flaky tests 0 Input Tokens 32,639 Output Tokens 1,153 Reasoning Tokens 62,197 Response Time (avg) 15.25s Response Time (total) 182.96s Response Time (max) 43.55s API error: 8 Wrong answer: 2 Anti-AI Tricks : 8.3 Coding : 9.8 Combined : 5.0 Data parsing and extraction : 10.0 Domain specific : 3.0 General Intelligence : 3.0 Instructions following : 6.5 Puzzle Solving : 5.3 Tool Calling : 10.0 Trivia : 3.0
#176#176	GLM 4.7 Flashnone	4.9	Z.ai	$0.016	9.15s
View model card Total Tests 22 Wrong Tests 16 Reliability 10.0 Attempt pass rate 34.9% Flaky tests 3 Input Tokens 101,504 Output Tokens 22,992 Reasoning Tokens 0 Response Time (avg) 9.15s Response Time (total) 137.18s Response Time (max) 97.15s Wrong answer: 13 Invalid tool call: 2 Did not follow instructions: 1 Anti-AI Tricks : 5.2 Coding : 4.3 Combined : 3.0 Data parsing and extraction : 7.3 Domain specific : 7.7 General Intelligence : 4.0 Instructions following : 6.5 Puzzle Solving : 6.4 Tool Calling : 2.8 Trivia : 3.0
#177#177	Nemotron 3 Supernone	4.9	NVIDIA	$0.008 ↑	5.97s
View model card Total Tests 22 Wrong Tests 17 Reliability 8.5 Attempt pass rate 30.3% Flaky tests 3 Input Tokens 63,519 Output Tokens 6,434 Reasoning Tokens 0 Response Time (avg) 5.97s Response Time (total) 131.31s Response Time (max) 20.00s Wrong answer: 15 Did not follow instructions: 2 Anti-AI Tricks : 4.8 Coding : 3.3 Combined : 3.0 Data parsing and extraction : 10.0 Domain specific : 3.6 General Intelligence : 4.6 Instructions following : 6.3 Puzzle Solving : 5.5 Tool Calling : 4.7 Trivia : 3.0
#178#178	Ling-2.6-flashnone	4.9	Inclusionai	$0.002 ↑	10.68s
View model card Total Tests 22 Wrong Tests 16 Reliability 10.0 Attempt pass rate 30.3% Flaky tests 2 Input Tokens 114,375 Output Tokens 14,903 Reasoning Tokens 0 Response Time (avg) 10.68s Response Time (total) 213.51s Response Time (max) 36.03s Wrong answer: 9 Invalid tool call: 3 API error: 2 Did not follow instructions: 2 Anti-AI Tricks : 6.8 Coding : 5.3 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 3.0 General Intelligence : 4.0 Instructions following : 9.8 Puzzle Solving : 2.9 Tool Calling : 3.0 Trivia : 3.0
#179#179	Ring-2.6-1Tnone	4.8	Inclusionai	$0.026 ↕	55.10s
View model card Total Tests 22 Wrong Tests 13 Reliability 10.0 Attempt pass rate 45.5% Flaky tests 2 Input Tokens 7,599 Output Tokens 39,954 Reasoning Tokens 0 Response Time (avg) 55.10s Response Time (total) 881.55s Response Time (max) 143.82s API error: 6 Wrong answer: 5 Did not follow instructions: 2 Anti-AI Tricks : 9.2 Coding : 5.3 Combined : 3.0 Data parsing and extraction : 3.0 Domain specific : 5.3 General Intelligence : 4.3 Instructions following : 9.8 Puzzle Solving : 7.7 Tool Calling : 3.0 Trivia : 3.0
#180#180	GPT-5.4 Nanonone	4.8	OpenAI	$0.041	2.57s
View model card Total Tests 22 Wrong Tests 18 Reliability 10.0 Attempt pass rate 28.8% Flaky tests 5 Input Tokens 115,924 Output Tokens 13,794 Reasoning Tokens 0 Response Time (avg) 2.57s Response Time (total) 56.51s Response Time (max) 25.50s Wrong answer: 15 Did not follow instructions: 2 No answer: 1 Anti-AI Tricks : 3.5 Coding : 4.6 Combined : 3.0 Data parsing and extraction : 6.5 Domain specific : 2.9 General Intelligence : 3.8 Instructions following : 6.3 Puzzle Solving : 5.4 Tool Calling : 10.0 Trivia : 3.0

←

1 2 3 4 5 6 7

→

Quick Compare

Gemini 3 Flash PreviewmediumvsGemini 3.5 Flashhigh Gemini 3 Flash PreviewmediumvsGPT-5.6 Sollow Gemini 3 Flash PreviewmediumvsGPT-5.6 Solmedium Gemini 3 Flash PreviewmediumvsGPT-5.6 Solhigh Gemini 3 Flash PreviewmediumvsGPT-5.5low Gemini 3 Flash PreviewmediumvsGemini 3.1 Pro Previewmedium Gemini 3 Flash PreviewmediumvsNemotron 3 UltramediumFree Available Gemini 3 Flash PreviewmediumvsNorth Mini CodemediumFree Available Gemini 3.5 FlashhighvsGPT-5.6 Sollow GPT-5.6 SollowvsGPT-5.6 Solmedium GPT-5.6 SolmediumvsGPT-5.6 Solhigh GPT-5.6 SolhighvsGPT-5.5low

AI Benchy Leaderboard

Filter models

Quick Compare