8.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
6.4Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
6.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
Rank
#35
#72
#68
#1
Reliability
N/AFirst-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
N/AFirst-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
N/AFirst-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
N/AFirst-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
Consistency
9.1Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
7.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
Tests Correct
A test is fully passed only if every run passed for that test.Did not follow instructions: 3Wrong answer: 3Response Time (avg)9.81sResponse Time (max)31.36sResponse Time (total)176.62sA test is fully passed only if every run passed for that test.โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.11sResponse Time (max)82.37sResponse Time (total)217.93sA test is fully passed only if every run passed for that test.โฆ
Attempt pass rate
74.1%Attempt pass rate = passed attempts / total attempts across runs.โฆ
57.4%Attempt pass rate = passed attempts / total attempts across runs.โฆ
64.8%Attempt pass rate = passed attempts / total attempts across runs.โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
Flaky tests
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
6Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
6Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
Total Runs
52Total Runsโฆ
52Total Runsโฆ
54Total Runsโฆ
18Total Runsโฆ
Cost per result
5.269Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
72.473Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
0.613Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
0.600Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
Total Cost
$0.633Total Costโฆ
$5.074Total Costโฆ
$0.056Total Costโฆ
$0.108Total Costโฆ
Input Price
$0.000 / 1MInput Priceโฆ
$0.000 / 1MInput Priceโฆ
$0.200 / 1MInput Priceโฆ
$0.500 / 1MInput Priceโฆ
Output Price
$0.000 / 1MOutput Priceโฆ
$0.000 / 1MOutput Priceโฆ
$0.500 / 1MOutput Priceโฆ
$3.000 / 1MOutput Priceโฆ
Output Tokens
1,568Output Tokensโฆ
299,034Output Tokensโฆ
2,010Output Tokensโฆ
655Output Tokensโฆ
Reasoning Tokens
91,909Reasoning Tokensโฆ
309,670Reasoning Tokensโฆ
91,298Reasoning Tokensโฆ
33,749Reasoning Tokensโฆ
Response Time (avg)
9.81sResponse Time (avg)โฆ
9.80sResponse Time (avg)โฆ
23.88sResponse Time (avg)โฆ
12.11sResponse Time (avg)โฆ
Response Time (max)
31.36sResponse Time (max)โฆ
35.28sResponse Time (max)โฆ
121.79sResponse Time (max)โฆ
82.37sResponse Time (max)โฆ
Response Time (total)
176.62sResponse Time (total)โฆ
156.75sResponse Time (total)โฆ
262.66sResponse Time (total)โฆ
217.93sResponse Time (total)โฆ
Top Models by Score
Score vs Total Cost
Response Time (avg)
Score vs Response Time (avg)
Total Output Tokens
Score vs Total Output Tokens
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
8.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
91.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.16sResponse Time (max)3.44sResponse Time (total)12.65sA test is fully passed only if every run passed for that test.โฆ
6.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
5.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
75.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Extra formatting: 1Wrong answer: 1Response Time (avg)3.46sResponse Time (max)4.38sResponse Time (total)13.86sA test is fully passed only if every run passed for that test.โฆ
8.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
91.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.81sResponse Time (max)5.65sResponse Time (total)7.62sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.26sResponse Time (max)5.01sResponse Time (total)13.04sA test is fully passed only if every run passed for that test.โฆ
3.26sResponse Time (avg)โฆ
110Output Tokensโฆ
1,076Reasoning Tokensโฆ
Coding
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.36sResponse Time (max)31.36sResponse Time (total)31.36sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)27.11sResponse Time (max)27.11sResponse Time (total)27.11sA test is fully passed only if every run passed for that test.โฆ
2.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
1.1Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)23.58sResponse Time (max)23.58sResponse Time (total)23.58sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)82.37sResponse Time (max)82.37sResponse Time (total)82.37sA test is fully passed only if every run passed for that test.โฆ
82.37sResponse Time (avg)โฆ
144Output Tokensโฆ
16,257Reasoning Tokensโฆ
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.93sResponse Time (max)20.93sResponse Time (total)20.93sA test is fully passed only if every run passed for that test.โฆ
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0msA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)37.64sResponse Time (max)37.64sResponse Time (total)37.64sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.58sResponse Time (max)23.58sResponse Time (total)23.58sA test is fully passed only if every run passed for that test.โฆ
23.58sResponse Time (avg)โฆ
117Output Tokensโฆ
3,495Reasoning Tokensโฆ
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.01sResponse Time (max)4.27sResponse Time (total)8.02sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.54sResponse Time (max)7.51sResponse Time (total)11.08sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.63sResponse Time (max)6.63sResponse Time (total)6.63sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.62sResponse Time (max)8.37sResponse Time (total)15.24sA test is fully passed only if every run passed for that test.โฆ
7.62sResponse Time (avg)โฆ
93Output Tokensโฆ
2,197Reasoning Tokensโฆ
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)21.33sResponse Time (max)24.21sResponse Time (total)64.00sA test is fully passed only if every run passed for that test.โฆ
2.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
11.1%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 2Extra formatting: 1Response Time (avg)24.67sResponse Time (max)35.28sResponse Time (total)74.02sA test is fully passed only if every run passed for that test.โฆ
5.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
4.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)121.79sResponse Time (max)121.79sResponse Time (total)121.79sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.81sResponse Time (max)32.44sResponse Time (total)44.43sA test is fully passed only if every run passed for that test.โฆ
14.81sResponse Time (avg)โฆ
4Output Tokensโฆ
7,228Reasoning Tokensโฆ
General Intelligence
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.78sResponse Time (max)5.78sResponse Time (total)5.78sA test is fully passed only if every run passed for that test.โฆ
5.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
2.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)6.40sResponse Time (max)6.40sResponse Time (total)6.40sA test is fully passed only if every run passed for that test.โฆ
4.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
9.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)16.25sResponse Time (max)16.25sResponse Time (total)16.25sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.34sResponse Time (max)6.34sResponse Time (total)6.34sA test is fully passed only if every run passed for that test.โฆ
6.34sResponse Time (avg)โฆ
24Output Tokensโฆ
635Reasoning Tokensโฆ
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
8.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
50.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.97sResponse Time (max)6.05sResponse Time (total)9.94sA test is fully passed only if every run passed for that test.โฆ
8.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
50.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.63sResponse Time (max)5.46sResponse Time (total)9.26sA test is fully passed only if every run passed for that test.โฆ
6.6Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
50.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)5.30sResponse Time (max)5.30sResponse Time (total)5.30sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.30sResponse Time (max)5.19sResponse Time (total)8.59sA test is fully passed only if every run passed for that test.โฆ
4.30sResponse Time (avg)โฆ
24Output Tokensโฆ
903Reasoning Tokensโฆ
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
8.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
88.9%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)3.85sResponse Time (max)4.53sResponse Time (total)11.55sA test is fully passed only if every run passed for that test.โฆ
7.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
5.1Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
77.8%Attempt pass rate = passed attempts / total attempts across runs.โฆ
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 2Response Time (avg)5.01sResponse Time (max)5.49sResponse Time (total)15.03sA test is fully passed only if every run passed for that test.โฆ
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
44.4%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)8.08sResponse Time (max)8.38sResponse Time (total)16.17sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.86sResponse Time (max)7.59sResponse Time (total)14.57sA test is fully passed only if every run passed for that test.โฆ
4.86sResponse Time (avg)โฆ
61Output Tokensโฆ
1,455Reasoning Tokensโฆ
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)12.39sResponse Time (max)12.39sResponse Time (total)12.39sA test is fully passed only if every run passed for that test.โฆ
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0msA test is fully passed only if every run passed for that test.โฆ
2.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
1.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)27.71sResponse Time (max)27.71sResponse Time (total)27.71sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.78sResponse Time (max)9.78sResponse Time (total)9.78sA test is fully passed only if every run passed for that test.โฆ