3.4Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
6.4Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Rank
#50
#28
Tests Correct
A test is fully passed only if every run passed for that test.Wrong answer: 11Response Time (avg)594msResponse Time (max)1.27sResponse Time (total)8.91sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Did not follow instructions: 2Wrong answer: 2No answer: 1Timed out: 1Response Time (avg)27.61sResponse Time (max)121.79sResponse Time (total)220.87sA test is fully passed only if every run passed for that test.…
Consistency
8.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
7.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Cost per result
0.147Shows the average cost per correct benchmark answer in cents (lower is better).…
0.541Shows the average cost per correct benchmark answer in cents (lower is better).…
Total Cost
$0.006Total Cost…
$0.049Total Cost…
Attempt pass rate
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
71.1%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
4Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
common.totalRuns
45 (15 x 3)common.totalRuns…
45 (15 x 3)common.totalRuns…
Output Tokens
1,144Output Tokens…
1,056Output Tokens…
Reasoning Tokens
0Reasoning Tokens…
80,419Reasoning Tokens…
Response Time (avg)
594msResponse Time (avg)…
27.61sResponse Time (avg)…
Response Time (max)
1.27sResponse Time (max)…
121.79sResponse Time (max)…
Response Time (total)
8.91sResponse Time (total)…
220.87sResponse Time (total)…
Top Models by Score
Score vs Total Cost
Response Time (avg)
Avg Score vs Response Time (avg)
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Inception: Mercury 2
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)466msResponse Time (max)716msResponse Time (total)1.40sA test is fully passed only if every run passed for that test.…
466msResponse Time (avg)…
274Output Tokens…
0Reasoning Tokens…
xAI: Grok 4.1 Fast
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.65sResponse Time (max)5.65sResponse Time (total)5.65sA test is fully passed only if every run passed for that test.…
5.65sResponse Time (avg)…
102Output Tokens…
4,021Reasoning Tokens…
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Inception: Mercury 2
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)606msResponse Time (max)606msResponse Time (total)606msA test is fully passed only if every run passed for that test.…
606msResponse Time (avg)…
131Output Tokens…
0Reasoning Tokens…
xAI: Grok 4.1 Fast
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)37.64sResponse Time (max)37.64sResponse Time (total)37.64sA test is fully passed only if every run passed for that test.…
37.64sResponse Time (avg)…
261Output Tokens…
12,272Reasoning Tokens…
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Inception: Mercury 2
5.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
5.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
83.3%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)667msResponse Time (max)819msResponse Time (total)1.33sA test is fully passed only if every run passed for that test.…
667msResponse Time (avg)…
180Output Tokens…
0Reasoning Tokens…
xAI: Grok 4.1 Fast
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.63sResponse Time (max)6.63sResponse Time (total)6.63sA test is fully passed only if every run passed for that test.…
6.63sResponse Time (avg)…
180Output Tokens…
5,409Reasoning Tokens…
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Inception: Mercury 2
4.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
44.4%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)534msResponse Time (max)733msResponse Time (total)1.60sA test is fully passed only if every run passed for that test.…
534msResponse Time (avg)…
46Output Tokens…
0Reasoning Tokens…
xAI: Grok 4.1 Fast
4.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
4.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)121.79sResponse Time (max)121.79sResponse Time (total)121.79sA test is fully passed only if every run passed for that test.…
121.79sResponse Time (avg)…
11Output Tokens…
37,657Reasoning Tokens…
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Inception: Mercury 2
5.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)551msResponse Time (max)622msResponse Time (total)1.10sA test is fully passed only if every run passed for that test.…
551msResponse Time (avg)…
82Output Tokens…
0Reasoning Tokens…
xAI: Grok 4.1 Fast
5.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)5.30sResponse Time (max)5.30sResponse Time (total)5.30sA test is fully passed only if every run passed for that test.…
5.30sResponse Time (avg)…
55Output Tokens…
3,489Reasoning Tokens…
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Inception: Mercury 2
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)533msResponse Time (max)637msResponse Time (total)1.60sA test is fully passed only if every run passed for that test.…
533msResponse Time (avg)…
234Output Tokens…
0Reasoning Tokens…
xAI: Grok 4.1 Fast
4.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
44.4%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)8.08sResponse Time (max)8.38sResponse Time (total)16.17sA test is fully passed only if every run passed for that test.…
8.08sResponse Time (avg)…
187Output Tokens…
6,086Reasoning Tokens…
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Inception: Mercury 2
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.27sResponse Time (max)1.27sResponse Time (total)1.27sA test is fully passed only if every run passed for that test.…
1.27sResponse Time (avg)…
197Output Tokens…
0Reasoning Tokens…
xAI: Grok 4.1 Fast
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
1.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)27.71sResponse Time (max)27.71sResponse Time (total)27.71sA test is fully passed only if every run passed for that test.…