Gemma 4 31B vs Gemini 3 Flash Preview vs Gemini 3 PRO Preview vs Gemini 3.1 Pro Preview benchmark comparisonGemini 3 Flash Preview leads on Score with 9.6. Gemma 4 31B leads on Reliability with 10.0. Gemma 4 31B has the lowest Total Cost at $0.033. Gemini 3 PRO Preview is fastest at 9.05s.
Recommended model: Gemini 3 Flash Preview - It has the best score here (9.6), while responding about 1.5x faster than the other models in this comparison.
6.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
9.6Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
6.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
9.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
Rank
#91
#2
#94
#7
Reliability
10.0First-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
10.0First-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
N/AFirst-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
10.0First-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
Consistency
9.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
9.7Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
Tests Correct
A test is fully passed only if every run passed for that test.API error: 2Timed out: 2Wrong answer: 2No answer: 1Response Time (avg)56.55sResponse Time (max)437.40sResponse Time (total)1074.41sA test is fully passed only if every run passed for that test.โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)18.64sResponse Time (max)117.26sResponse Time (total)391.35sA test is fully passed only if every run passed for that test.โฆ
A test is fully passed only if every run passed for that test.API error: 4Wrong answer: 3Response Time (avg)9.05sResponse Time (max)26.24sResponse Time (total)90.53sA test is fully passed only if every run passed for that test.โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)20.14sResponse Time (max)88.68sResponse Time (total)281.92sA test is fully passed only if every run passed for that test.โฆ
Attempt pass rate
69.8%Attempt pass rate = passed attempts / total attempts across runs.โฆ
98.4%Attempt pass rate = passed attempts / total attempts across runs.โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
90.5%Attempt pass rate = passed attempts / total attempts across runs.โฆ
Flaky tests
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
Total Runs
63Total Runsโฆ
63Total Runsโฆ
63Total Runsโฆ
63Total Runsโฆ
Cost per result
0.257Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
3.335Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
1.406Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
5.546Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
Total Cost
$0.033Total Cost (Current Price)โฆ
$0.667Total Cost (Current Price)โฆ
$0.385Total Cost (Current Price)โฆ
$1.054Total Cost (Current Price)โฆ
Input Price
$0.120 / 1MInput Priceโฆ
$0.500 / 1MInput Priceโฆ
$9.506 / 1MInput Priceโฆ
$2.000 / 1MInput Priceโฆ
Output Price
$0.350 / 1MOutput Priceโฆ
$3.000 / 1MOutput Priceโฆ
$9.506 / 1MOutput Priceโฆ
$12.000 / 1MOutput Priceโฆ
Total Input Tokens
17,957Total Input Tokensโฆ
37,017Total Input Tokensโฆ
28,848Total Input Tokensโฆ
41,617Total Input Tokensโฆ
Output Tokens
22,356Output Tokensโฆ
2,006Output Tokensโฆ
1,490Output Tokensโฆ
1,977Output Tokensโฆ
Reasoning Tokens
65,726Reasoning Tokensโฆ
214,153Reasoning Tokensโฆ
10,102Reasoning Tokensโฆ
78,896Reasoning Tokensโฆ
Response Time (avg)
56.55sResponse Time (avg)โฆ
18.64sResponse Time (avg)โฆ
9.05sResponse Time (avg)โฆ
20.14sResponse Time (avg)โฆ
Response Time (max)
437.40sResponse Time (max)โฆ
117.26sResponse Time (max)โฆ
26.24sResponse Time (max)โฆ
88.68sResponse Time (max)โฆ
Response Time (total)
1074.41sResponse Time (total)โฆ
391.35sResponse Time (total)โฆ
90.53sResponse Time (total)โฆ
281.92sResponse Time (total)โฆ
Generation showcase
Hamster playing table tennis
Prompt: Create a detailed SVG illustration of a hamster playing table tennis.
#91 Gemma 4 31B
medium
Cost
$0.002
Time
45.7s
Tokens
2,696 tok
#2 Gemini 3 Flash Preview
medium
Cost
$0.010
Time
17.9s
Tokens
3,236 tok
#94 Gemini 3 PRO Preview
medium
No endpoints found for google/gemini-3-pro-preview.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.89sResponse Time (max)26.66sResponse Time (total)51.55sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.88sResponse Time (max)5.73sResponse Time (total)15.53sA test is fully passed only if every run passed for that test.โฆ
3.88sResponse Time (avg)โฆ
494Total Input Tokensโฆ
330Output Tokensโฆ
3,216Reasoning Tokensโฆ
Gemini 3 PRO PreviewArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.99sResponse Time (max)26.24sResponse Time (total)29.99sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.90sResponse Time (max)9.52sResponse Time (total)15.80sA test is fully passed only if every run passed for that test.โฆ
4.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
5.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
22.2%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Timed out: 2No answer: 1Response Time (avg)219.76sResponse Time (max)437.40sResponse Time (total)659.27sA test is fully passed only if every run passed for that test.โฆ
8.6Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
88.9%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)84.40sResponse Time (max)117.26sResponse Time (total)253.21sA test is fully passed only if every run passed for that test.โฆ
84.40sResponse Time (avg)โฆ
8,122Total Input Tokensโฆ
462Output Tokensโฆ
161,084Reasoning Tokensโฆ
Gemini 3 PRO PreviewArchived model: this model is no longer updated or tested on new tests.
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.API error: 3Response Time (avg)0msResponse Time (max)0msResponse Time (total)0msA test is fully passed only if every run passed for that test.โฆ
7.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
9.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)40.17sResponse Time (max)88.68sResponse Time (total)120.52sA test is fully passed only if every run passed for that test.โฆ
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0msA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)22.42sResponse Time (max)22.42sResponse Time (total)22.42sA test is fully passed only if every run passed for that test.โฆ
22.42sResponse Time (avg)โฆ
12,873Total Input Tokensโฆ
351Output Tokensโฆ
10,485Reasoning Tokensโฆ
Gemini 3 PRO PreviewArchived model: this model is no longer updated or tested on new tests.
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)10.37sResponse Time (max)10.37sResponse Time (total)10.37sA test is fully passed only if every run passed for that test.โฆ
9.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)40.61sResponse Time (max)40.61sResponse Time (total)40.61sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)21.11sResponse Time (max)21.94sResponse Time (total)42.21sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.43sResponse Time (max)6.18sResponse Time (total)10.86sA test is fully passed only if every run passed for that test.โฆ
5.43sResponse Time (avg)โฆ
7,548Total Input Tokensโฆ
279Output Tokensโฆ
4,893Reasoning Tokensโฆ
Gemini 3 PRO PreviewArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)10.84sResponse Time (max)10.84sResponse Time (total)10.84sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.72sResponse Time (max)7.72sResponse Time (total)7.72sA test is fully passed only if every run passed for that test.โฆ
7.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)38.48sResponse Time (max)68.92sResponse Time (total)115.43sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.27sResponse Time (max)34.09sResponse Time (total)45.80sA test is fully passed only if every run passed for that test.โฆ
15.27sResponse Time (avg)โฆ
633Total Input Tokensโฆ
12Output Tokensโฆ
21,684Reasoning Tokensโฆ
Gemini 3 PRO PreviewArchived model: this model is no longer updated or tested on new tests.
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)7.01sResponse Time (max)7.01sResponse Time (total)7.01sA test is fully passed only if every run passed for that test.โฆ
7.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)32.73sResponse Time (max)32.73sResponse Time (total)32.73sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.57sResponse Time (max)9.57sResponse Time (total)9.57sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.19sResponse Time (max)5.19sResponse Time (total)5.19sA test is fully passed only if every run passed for that test.โฆ
5.19sResponse Time (avg)โฆ
486Total Input Tokensโฆ
72Output Tokensโฆ
1,905Reasoning Tokensโฆ
Gemini 3 PRO PreviewArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.34sResponse Time (max)9.34sResponse Time (total)9.34sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.77sResponse Time (max)11.77sResponse Time (total)11.77sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.76sResponse Time (max)17.53sResponse Time (total)25.52sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.04sResponse Time (max)4.70sResponse Time (total)8.08sA test is fully passed only if every run passed for that test.โฆ
4.04sResponse Time (avg)โฆ
615Total Input Tokensโฆ
72Output Tokensโฆ
2,709Reasoning Tokensโฆ
Gemini 3 PRO PreviewArchived model: this model is no longer updated or tested on new tests.
9.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.26sResponse Time (max)3.26sResponse Time (total)3.26sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.56sResponse Time (max)9.56sResponse Time (total)9.56sA test is fully passed only if every run passed for that test.โฆ
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)26.91sResponse Time (max)61.08sResponse Time (total)80.72sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.05sResponse Time (max)5.64sResponse Time (total)12.15sA test is fully passed only if every run passed for that test.โฆ
4.05sResponse Time (avg)โฆ
558Total Input Tokensโฆ
183Output Tokensโฆ
4,365Reasoning Tokensโฆ
Gemini 3 PRO PreviewArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.88sResponse Time (max)4.23sResponse Time (total)7.77sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.90sResponse Time (max)8.49sResponse Time (total)13.79sA test is fully passed only if every run passed for that test.โฆ
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0msA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.60sResponse Time (max)12.60sResponse Time (total)12.60sA test is fully passed only if every run passed for that test.โฆ
12.60sResponse Time (avg)โฆ
5,532Total Input Tokensโฆ
234Output Tokensโฆ
1,487Reasoning Tokensโฆ
Gemini 3 PRO PreviewArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.96sResponse Time (max)11.96sResponse Time (total)11.96sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.15sResponse Time (max)23.15sResponse Time (total)23.15sA test is fully passed only if every run passed for that test.โฆ
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)90.14sResponse Time (max)90.14sResponse Time (total)90.14sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.50sResponse Time (max)5.50sResponse Time (total)5.50sA test is fully passed only if every run passed for that test.โฆ
5.50sResponse Time (avg)โฆ
156Total Input Tokensโฆ
11Output Tokensโฆ
2,325Reasoning Tokensโฆ
Gemini 3 PRO PreviewArchived model: this model is no longer updated or tested on new tests.
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0msA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.27sResponse Time (max)6.27sResponse Time (total)6.27sA test is fully passed only if every run passed for that test.โฆ