Grok 4.20 BetaGrok 4.20 BetamediumArchived model: this model is no longer updated or tested on new tests.Release: 2026-03-12
Grok 4.1 FastGrok 4.1 FastmediumArchived model: this model is no longer updated or tested on new tests.Release: 2025-11-19
Hunter AlphaHunter AlphamediumArchived model: this model is no longer updated or tested on new tests.Release: 2026-03-11
Metric
Grok 4.20 BetaGrok 4.20 BetamediumArchived model: this model is no longer updated or tested on new tests.Release: 2026-03-12
Grok 4.1 FastGrok 4.1 FastmediumArchived model: this model is no longer updated or tested on new tests.Release: 2025-11-19
Hunter AlphaHunter AlphamediumArchived model: this model is no longer updated or tested on new tests.Release: 2026-03-11
Score
8.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
6.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
6.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
Rank
#14
#88
#76
Reliability
N/AFirst-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
10.0First-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
N/AFirst-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
Consistency
9.5Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
7.3Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
7.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
Tests Correct
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 1Response Time (avg)9.75sResponse Time (max)31.36sResponse Time (total)175.48sA test is fully passed only if every run passed for that test.โฆ
81.5%Attempt pass rate = passed attempts / total attempts across runs.โฆ
61.4%Attempt pass rate = passed attempts / total attempts across runs.โฆ
64.8%Attempt pass rate = passed attempts / total attempts across runs.โฆ
Flaky tests
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
6Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
6Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
Total Runs
128Total Runsโฆ
133Total Runsโฆ
90Total Runsโฆ
Cost per result
8.557Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
0.926Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
0.000Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
Total Cost
$1.198Total Costโฆ
$0.084Total Costโฆ
$0.000Total Costโฆ
Input Price
$0.000 / 1MInput Priceโฆ
$0.000 / 1MInput Priceโฆ
$0.000 / 1MInput Priceโฆ
Output Price
$0.000 / 1MOutput Priceโฆ
$0.000 / 1MOutput Priceโฆ
$0.000 / 1MOutput Priceโฆ
Output Tokens
4,915Output Tokensโฆ
3,298Output Tokensโฆ
6,506Output Tokensโฆ
Reasoning Tokens
177,787Reasoning Tokensโฆ
139,122Reasoning Tokensโฆ
24,809Reasoning Tokensโฆ
Response Time (avg)
9.75sResponse Time (avg)โฆ
23.85sResponse Time (avg)โฆ
10.33sResponse Time (avg)โฆ
Response Time (max)
31.36sResponse Time (max)โฆ
121.79sResponse Time (max)โฆ
30.53sResponse Time (max)โฆ
Response Time (total)
175.48sResponse Time (total)โฆ
286.16sResponse Time (total)โฆ
175.58sResponse Time (total)โฆ
Top Models by Score
Score vs Total Cost
Response Time (avg)
Score vs Response Time (avg)
Total Output Tokens
Score vs Total Output Tokens
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
8.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
91.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.16sResponse Time (max)3.44sResponse Time (total)12.65sA test is fully passed only if every run passed for that test.โฆ
3.16sResponse Time (avg)โฆ
268Output Tokensโฆ
7,583Reasoning Tokensโฆ
Grok 4.1 FastArchived model: this model is no longer updated or tested on new tests.
8.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
91.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.81sResponse Time (max)5.65sResponse Time (total)7.62sA test is fully passed only if every run passed for that test.โฆ
3.81sResponse Time (avg)โฆ
108Output Tokensโฆ
4,741Reasoning Tokensโฆ
Hunter AlphaArchived model: this model is no longer updated or tested on new tests.
7.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
5.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
83.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)4.75sResponse Time (max)7.62sResponse Time (total)19.00sA test is fully passed only if every run passed for that test.โฆ
4.75sResponse Time (avg)โฆ
479Output Tokensโฆ
1,103Reasoning Tokensโฆ
Coding
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.36sResponse Time (max)31.36sResponse Time (total)31.36sA test is fully passed only if every run passed for that test.โฆ
31.36sResponse Time (avg)โฆ
81Output Tokensโฆ
3,987Reasoning Tokensโฆ
Grok 4.1 FastArchived model: this model is no longer updated or tested on new tests.
2.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
1.1Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)23.58sResponse Time (max)23.58sResponse Time (total)23.58sA test is fully passed only if every run passed for that test.โฆ
23.58sResponse Time (avg)โฆ
821Output Tokensโฆ
6,703Reasoning Tokensโฆ
Hunter AlphaArchived model: this model is no longer updated or tested on new tests.
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0msA test is fully passed only if every run passed for that test.โฆ
0msResponse Time (avg)โฆ
0Output Tokensโฆ
0Reasoning Tokensโฆ
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.93sResponse Time (max)20.93sResponse Time (total)20.93sA test is fully passed only if every run passed for that test.โฆ
20.93sResponse Time (avg)โฆ
227Output Tokensโฆ
12,212Reasoning Tokensโฆ
Grok 4.1 FastArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)37.64sResponse Time (max)37.64sResponse Time (total)37.64sA test is fully passed only if every run passed for that test.โฆ
37.64sResponse Time (avg)โฆ
261Output Tokensโฆ
12,272Reasoning Tokensโฆ
Hunter AlphaArchived model: this model is no longer updated or tested on new tests.
4.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
1.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)30.53sResponse Time (max)30.53sResponse Time (total)30.53sA test is fully passed only if every run passed for that test.โฆ
30.53sResponse Time (avg)โฆ
792Output Tokensโฆ
3,456Reasoning Tokensโฆ
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.01sResponse Time (max)4.27sResponse Time (total)8.02sA test is fully passed only if every run passed for that test.โฆ
4.01sResponse Time (avg)โฆ
180Output Tokensโฆ
5,281Reasoning Tokensโฆ
Grok 4.1 FastArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.63sResponse Time (max)6.63sResponse Time (total)6.63sA test is fully passed only if every run passed for that test.โฆ
6.63sResponse Time (avg)โฆ
180Output Tokensโฆ
5,409Reasoning Tokensโฆ
Hunter AlphaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.16sResponse Time (max)26.55sResponse Time (total)46.33sA test is fully passed only if every run passed for that test.โฆ
23.16sResponse Time (avg)โฆ
1,488Output Tokensโฆ
8,017Reasoning Tokensโฆ
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)21.33sResponse Time (max)24.21sResponse Time (total)64.00sA test is fully passed only if every run passed for that test.โฆ
21.33sResponse Time (avg)โฆ
251Output Tokensโฆ
40,255Reasoning Tokensโฆ
Grok 4.1 FastArchived model: this model is no longer updated or tested on new tests.
5.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
4.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)121.79sResponse Time (max)121.79sResponse Time (total)121.79sA test is fully passed only if every run passed for that test.โฆ
121.79sResponse Time (avg)โฆ
11Output Tokensโฆ
37,657Reasoning Tokensโฆ
Hunter AlphaArchived model: this model is no longer updated or tested on new tests.
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Extra formatting: 1Timed out: 1Wrong answer: 1Response Time (avg)10.52sResponse Time (max)18.68sResponse Time (total)31.56sA test is fully passed only if every run passed for that test.โฆ
10.52sResponse Time (avg)โฆ
892Output Tokensโฆ
2,406Reasoning Tokensโฆ
General Intelligence
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.78sResponse Time (max)5.78sResponse Time (total)5.78sA test is fully passed only if every run passed for that test.โฆ
5.78sResponse Time (avg)โฆ
72Output Tokensโฆ
3,440Reasoning Tokensโฆ
Grok 4.1 FastArchived model: this model is no longer updated or tested on new tests.
4.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
9.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)16.25sResponse Time (max)16.25sResponse Time (total)16.25sA test is fully passed only if every run passed for that test.โฆ
16.25sResponse Time (avg)โฆ
127Output Tokensโฆ
3,456Reasoning Tokensโฆ
Hunter AlphaArchived model: this model is no longer updated or tested on new tests.
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
3.7Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)6.44sResponse Time (max)6.44sResponse Time (total)6.44sA test is fully passed only if every run passed for that test.โฆ
6.44sResponse Time (avg)โฆ
116Output Tokensโฆ
260Reasoning Tokensโฆ
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
9.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.89sResponse Time (max)5.89sResponse Time (total)9.78sA test is fully passed only if every run passed for that test.โฆ
4.89sResponse Time (avg)โฆ
703Output Tokensโฆ
67,771Reasoning Tokensโฆ
Grok 4.1 FastArchived model: this model is no longer updated or tested on new tests.
6.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
50.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.63sResponse Time (max)4.63sResponse Time (total)4.63sA test is fully passed only if every run passed for that test.โฆ
4.63sResponse Time (avg)โฆ
662Output Tokensโฆ
21,680Reasoning Tokensโฆ
Hunter AlphaArchived model: this model is no longer updated or tested on new tests.
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.18sResponse Time (max)4.46sResponse Time (total)8.36sA test is fully passed only if every run passed for that test.โฆ
4.18sResponse Time (avg)โฆ
208Output Tokensโฆ
465Reasoning Tokensโฆ
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.52sResponse Time (max)4.53sResponse Time (total)10.57sA test is fully passed only if every run passed for that test.โฆ
3.52sResponse Time (avg)โฆ
2,950Output Tokensโฆ
31,874Reasoning Tokensโฆ
Grok 4.1 FastArchived model: this model is no longer updated or tested on new tests.
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
44.4%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)7.40sResponse Time (max)7.79sResponse Time (total)14.81sA test is fully passed only if every run passed for that test.โฆ
7.40sResponse Time (avg)โฆ
853Output Tokensโฆ
30,338Reasoning Tokensโฆ
Hunter AlphaArchived model: this model is no longer updated or tested on new tests.
6.1Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
4.7Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)5.35sResponse Time (max)6.20sResponse Time (total)16.06sA test is fully passed only if every run passed for that test.โฆ
5.35sResponse Time (avg)โฆ
2,223Output Tokensโฆ
8,198Reasoning Tokensโฆ
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)12.39sResponse Time (max)12.39sResponse Time (total)12.39sA test is fully passed only if every run passed for that test.โฆ
12.39sResponse Time (avg)โฆ
183Output Tokensโฆ
5,384Reasoning Tokensโฆ
Grok 4.1 FastArchived model: this model is no longer updated or tested on new tests.
2.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
1.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)27.71sResponse Time (max)27.71sResponse Time (total)27.71sA test is fully passed only if every run passed for that test.โฆ
27.71sResponse Time (avg)โฆ
260Output Tokensโฆ
11,485Reasoning Tokensโฆ
Hunter AlphaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.33sResponse Time (max)17.33sResponse Time (total)17.33sA test is fully passed only if every run passed for that test.โฆ
17.33sResponse Time (avg)โฆ
308Output Tokensโฆ
904Reasoning Tokensโฆ
Trivia
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
-
-
-
-
-
-
-
-
Grok 4.1 FastArchived model: this model is no longer updated or tested on new tests.
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)25.52sResponse Time (max)25.52sResponse Time (total)25.52sA test is fully passed only if every run passed for that test.โฆ
25.52sResponse Time (avg)โฆ
15Output Tokensโฆ
5,381Reasoning Tokensโฆ
Hunter AlphaArchived model: this model is no longer updated or tested on new tests.