Grok 4.20 vs Grok 4.20 Beta vs Grok 4.3 benchmark comparisonGrok 4.3 leads on Score with 7.7. Grok 4.20 leads on Reliability with 10.0. Grok 4.20 has the lowest Total Cost at $0.609. Grok 4.20 Beta is fastest at 9.75s.
Recommended model: Grok 4.20 Beta - It offers the best overall trade-off: a competitive score (6.8), faster response than the other models in this comparison, and balanced cost.
7.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
6.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
Rank
#53
#69
#37
Reliability
10.0First-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
N/AFirst-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
10.0First-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.โฆ
Consistency
8.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
8.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
8.5Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
Tests Correct
A test is fully passed only if every run passed for that test.Wrong answer: 6Did not follow instructions: 2Extra formatting: 1Response Time (avg)27.68sResponse Time (max)199.66sResponse Time (total)581.26sA test is fully passed only if every run passed for that test.โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 1Response Time (avg)9.75sResponse Time (max)31.36sResponse Time (total)175.48sA test is fully passed only if every run passed for that test.โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 5Did not follow instructions: 2Extra formatting: 1Response Time (avg)47.51sResponse Time (max)216.69sResponse Time (total)997.68sA test is fully passed only if every run passed for that test.โฆ
Attempt pass rate
63.5%Attempt pass rate = passed attempts / total attempts across runs.โฆ
69.8%Attempt pass rate = passed attempts / total attempts across runs.โฆ
71.4%Attempt pass rate = passed attempts / total attempts across runs.โฆ
Flaky tests
3Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
4Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
Total Runs
63Total Runsโฆ
52Total Runsโฆ
63Total Runsโฆ
Cost per result
8.309Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
4.505Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
4.724Shows the average cost per correct benchmark answer in cents (lower is better).โฆ
Total Cost
$0.609Total Cost (Current Price)โฆ
$0.750Total Cost (Current Price)โฆ
$0.614Total Cost (Current Price)โฆ
Input Price
$1.250 / 1MInput Priceโฆ
$5.805 / 1MInput Priceโฆ
$1.250 / 1MInput Priceโฆ
Output Price
$2.500 / 1MOutput Priceโฆ
$5.805 / 1MOutput Priceโฆ
$2.500 / 1MOutput Priceโฆ
Total Input Tokens
44,433Total Input Tokensโฆ
35,955Total Input Tokensโฆ
44,472Total Input Tokensโฆ
Output Tokens
1,819Output Tokensโฆ
1,647Output Tokensโฆ
1,981Output Tokensโฆ
Reasoning Tokens
219,524Reasoning Tokensโฆ
91,565Reasoning Tokensโฆ
221,382Reasoning Tokensโฆ
Response Time (avg)
27.68sResponse Time (avg)โฆ
9.75sResponse Time (avg)โฆ
47.51sResponse Time (avg)โฆ
Response Time (max)
199.66sResponse Time (max)โฆ
31.36sResponse Time (max)โฆ
216.69sResponse Time (max)โฆ
Response Time (total)
581.26sResponse Time (total)โฆ
175.48sResponse Time (total)โฆ
997.68sResponse Time (total)โฆ
Generation showcase
Hamster playing table tennis
Prompt: Create a detailed SVG illustration of a hamster playing table tennis.
8.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
83.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.95sResponse Time (max)5.68sResponse Time (total)15.80sA test is fully passed only if every run passed for that test.โฆ
3.95sResponse Time (avg)โฆ
2,010Total Input Tokensโฆ
287Output Tokensโฆ
8,312Reasoning Tokensโฆ
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
8.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
91.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.16sResponse Time (max)3.44sResponse Time (total)12.65sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)8.83sResponse Time (max)11.20sResponse Time (total)35.31sA test is fully passed only if every run passed for that test.โฆ
6.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
6.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
55.6%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)109.93sResponse Time (max)199.66sResponse Time (total)329.79sA test is fully passed only if every run passed for that test.โฆ
109.93sResponse Time (avg)โฆ
8,307Total Input Tokensโฆ
268Output Tokensโฆ
103,150Reasoning Tokensโฆ
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
3.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
3.3Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.36sResponse Time (max)31.36sResponse Time (total)31.36sA test is fully passed only if every run passed for that test.โฆ
5.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.7Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
44.4%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Extra formatting: 1Wrong answer: 1Response Time (avg)41.23sResponse Time (max)64.81sResponse Time (total)123.69sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.40sResponse Time (max)17.40sResponse Time (total)17.40sA test is fully passed only if every run passed for that test.โฆ
17.40sResponse Time (avg)โฆ
12,909Total Input Tokensโฆ
232Output Tokensโฆ
9,556Reasoning Tokensโฆ
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.93sResponse Time (max)20.93sResponse Time (total)20.93sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)63.99sResponse Time (max)63.99sResponse Time (total)63.99sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.17sResponse Time (max)5.02sResponse Time (total)8.34sA test is fully passed only if every run passed for that test.โฆ
4.17sResponse Time (avg)โฆ
7,761Total Input Tokensโฆ
180Output Tokensโฆ
5,333Reasoning Tokensโฆ
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.01sResponse Time (max)4.27sResponse Time (total)8.02sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)18.97sResponse Time (max)26.99sResponse Time (total)37.93sA test is fully passed only if every run passed for that test.โฆ
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Extra formatting: 1Wrong answer: 1Response Time (avg)27.03sResponse Time (max)29.87sResponse Time (total)81.10sA test is fully passed only if every run passed for that test.โฆ
27.03sResponse Time (avg)โฆ
1,764Total Input Tokensโฆ
375Output Tokensโฆ
49,339Reasoning Tokensโฆ
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)21.33sResponse Time (max)24.21sResponse Time (total)64.00sA test is fully passed only if every run passed for that test.โฆ
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
44.4%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)181.74sResponse Time (max)216.69sResponse Time (total)545.21sA test is fully passed only if every run passed for that test.โฆ
3.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
2.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
33.3%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)24.48sResponse Time (max)24.48sResponse Time (total)24.48sA test is fully passed only if every run passed for that test.โฆ
24.48sResponse Time (avg)โฆ
825Total Input Tokensโฆ
65Output Tokensโฆ
6,440Reasoning Tokensโฆ
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.78sResponse Time (max)5.78sResponse Time (total)5.78sA test is fully passed only if every run passed for that test.โฆ
5.4Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
2.5Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)24.70sResponse Time (max)24.70sResponse Time (total)24.70sA test is fully passed only if every run passed for that test.โฆ
9.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.26sResponse Time (max)4.46sResponse Time (total)8.52sA test is fully passed only if every run passed for that test.โฆ
4.26sResponse Time (avg)โฆ
1,362Total Input Tokensโฆ
57Output Tokensโฆ
6,419Reasoning Tokensโฆ
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
9.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.89sResponse Time (max)5.89sResponse Time (total)9.78sA test is fully passed only if every run passed for that test.โฆ
9.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)18.58sResponse Time (max)31.48sResponse Time (total)37.15sA test is fully passed only if every run passed for that test.โฆ
7.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
66.7%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.22sResponse Time (max)11.63sResponse Time (total)18.66sA test is fully passed only if every run passed for that test.โฆ
6.22sResponse Time (avg)โฆ
1,689Total Input Tokensโฆ
149Output Tokensโฆ
7,913Reasoning Tokensโฆ
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.52sResponse Time (max)4.53sResponse Time (total)10.57sA test is fully passed only if every run passed for that test.โฆ
5.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
55.6%Attempt pass rate = passed attempts / total attempts across runs.โฆ
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)22.52sResponse Time (max)51.75sResponse Time (total)67.57sA test is fully passed only if every run passed for that test.โฆ
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)13.68sResponse Time (max)13.68sResponse Time (total)13.68sA test is fully passed only if every run passed for that test.โฆ
13.68sResponse Time (avg)โฆ
7,275Total Input Tokensโฆ
197Output Tokensโฆ
6,620Reasoning Tokensโฆ
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)12.39sResponse Time (max)12.39sResponse Time (total)12.39sA test is fully passed only if every run passed for that test.โฆ
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
100.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.66sResponse Time (max)17.66sResponse Time (total)17.66sA test is fully passed only if every run passed for that test.โฆ
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)63.48sResponse Time (max)63.48sResponse Time (total)63.48sA test is fully passed only if every run passed for that test.โฆ
63.48sResponse Time (avg)โฆ
531Total Input Tokensโฆ
9Output Tokensโฆ
16,442Reasoning Tokensโฆ
Grok 4.20 BetaArchived model: this model is no longer updated or tested on new tests.
0.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
0.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)0msResponse Time (max)0msResponse Time (total)0msA test is fully passed only if every run passed for that test.โฆ
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.โฆ
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).โฆ
0.0%Attempt pass rate = passed attempts / total attempts across runs.โฆ
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).โฆ
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)44.47sResponse Time (max)44.47sResponse Time (total)44.47sA test is fully passed only if every run passed for that test.โฆ