7.38Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.12Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.92Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.87Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Consistency
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
9.99Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
9.44Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Cost per result
0.162Shows the average cost per correct benchmark answer in cents (lower is better).…
0.403Shows the average cost per correct benchmark answer in cents (lower is better).…
17.455Shows the average cost per correct benchmark answer in cents (lower is better).…
0.624Shows the average cost per correct benchmark answer in cents (lower is better).…
Total Cost
$0.017Total Cost…
$0.037Total Cost…
$1.920Total Cost…
$0.069Total Cost…
Tests Correct
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 1Response Time (avg)2.89sResponse Time (max)9.54sResponse Time (total)43.35sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 2Response Time (avg)3.74sResponse Time (max)12.98sResponse Time (total)56.15sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 1Response Time (avg)69.85sResponse Time (max)232.25sResponse Time (total)1047.79sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 1Response Time (avg)6.32sResponse Time (max)14.72sResponse Time (total)94.86sA test is fully passed only if every run passed for that test.…
Attempt pass rate
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
60.0%Attempt pass rate = passed attempts / total attempts across runs.…
73.3%Attempt pass rate = passed attempts / total attempts across runs.…
75.6%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
Output Tokens
1,392Output Tokens…
1,417Output Tokens…
943Output Tokens…
1,274Output Tokens…
Reasoning Tokens
6,379Reasoning Tokens…
19,435Reasoning Tokens…
1,275,768Reasoning Tokens…
18,372Reasoning Tokens…
Top Models by Score
Score vs Total Cost
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
7.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.18sResponse Time (max)3.18sResponse Time (total)6.53sA test is fully passed only if every run passed for that test.…
456Output Tokens…
1,224Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
9.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
9.99Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)2.53sResponse Time (max)3.89sResponse Time (total)7.58sA test is fully passed only if every run passed for that test.…
564Output Tokens…
3,780Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)43.87sResponse Time (max)121.88sResponse Time (total)131.62sA test is fully passed only if every run passed for that test.…
144Output Tokens…
193,077Reasoning Tokens…
Google: Gemini 3 Flash Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.50sResponse Time (max)4.31sResponse Time (total)10.49sA test is fully passed only if every run passed for that test.…
275Output Tokens…
2,476Reasoning Tokens…
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
1.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.96sResponse Time (max)2.96sResponse Time (total)2.96sA test is fully passed only if every run passed for that test.…
75Output Tokens…
253Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.98sResponse Time (max)12.98sResponse Time (total)12.98sA test is fully passed only if every run passed for that test.…
109Output Tokens…
2,449Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)232.25sResponse Time (max)232.25sResponse Time (total)232.25sA test is fully passed only if every run passed for that test.…
112Output Tokens…
126,813Reasoning Tokens…
Google: Gemini 3 Flash Preview
1.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.96sResponse Time (max)2.96sResponse Time (total)2.96sA test is fully passed only if every run passed for that test.…
104Output Tokens…
0Reasoning Tokens…
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
9.88Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.00sResponse Time (max)3.74sResponse Time (total)5.99sA test is fully passed only if every run passed for that test.…
291Output Tokens…
696Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
9.88Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.29sResponse Time (max)2.31sResponse Time (total)4.59sA test is fully passed only if every run passed for that test.…
279Output Tokens…
2,952Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
9.88Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.16sResponse Time (max)8.54sResponse Time (total)14.31sA test is fully passed only if every run passed for that test.…
279Output Tokens…
6,186Reasoning Tokens…
Google: Gemini 3 Flash Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.46sResponse Time (max)14.72sResponse Time (total)18.92sA test is fully passed only if every run passed for that test.…
305Output Tokens…
3,004Reasoning Tokens…
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
4.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)2.36sResponse Time (max)3.51sResponse Time (total)7.07sA test is fully passed only if every run passed for that test.…
18Output Tokens…
1,212Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
1.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)4.21sResponse Time (max)5.86sResponse Time (total)12.62sA test is fully passed only if every run passed for that test.…
18Output Tokens…
5,325Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
4.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)127.58sResponse Time (max)133.93sResponse Time (total)382.74sA test is fully passed only if every run passed for that test.…
18Output Tokens…
566,202Reasoning Tokens…
Google: Gemini 3 Flash Preview
4.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.21Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
44.4%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)8.05sResponse Time (max)14.40sResponse Time (total)24.15sA test is fully passed only if every run passed for that test.…
12Output Tokens…
6,410Reasoning Tokens…
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
8.50Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)1.49sResponse Time (max)1.66sResponse Time (total)2.99sA test is fully passed only if every run passed for that test.…
72Output Tokens…
753Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
8.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
9.99Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)1.91sResponse Time (max)1.93sResponse Time (total)3.82sA test is fully passed only if every run passed for that test.…
72Output Tokens…
2,121Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
8.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
9.96Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)70.07sResponse Time (max)136.53sResponse Time (total)140.14sA test is fully passed only if every run passed for that test.…
69Output Tokens…
190,053Reasoning Tokens…
Google: Gemini 3 Flash Preview
7.50Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
9.99Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)7.02sResponse Time (max)7.35sResponse Time (total)14.03sA test is fully passed only if every run passed for that test.…
71Output Tokens…
2,752Reasoning Tokens…
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.76sResponse Time (max)5.08sResponse Time (total)8.27sA test is fully passed only if every run passed for that test.…
243Output Tokens…
1,248Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
7.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.58sResponse Time (max)4.41sResponse Time (total)10.75sA test is fully passed only if every run passed for that test.…
141Output Tokens…
1,896Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
7.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)46.33sResponse Time (max)134.22sResponse Time (total)139.00sA test is fully passed only if every run passed for that test.…
87Output Tokens…
190,953Reasoning Tokens…
Google: Gemini 3 Flash Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.44sResponse Time (max)10.27sResponse Time (total)19.32sA test is fully passed only if every run passed for that test.…
273Output Tokens…
3,315Reasoning Tokens…
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.54sResponse Time (max)9.54sResponse Time (total)9.54sA test is fully passed only if every run passed for that test.…
237Output Tokens…
993Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.80sResponse Time (max)3.80sResponse Time (total)3.80sA test is fully passed only if every run passed for that test.…
234Output Tokens…
912Reasoning Tokens…
Google: Gemini 3.1 Flash Lite Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.73sResponse Time (max)7.73sResponse Time (total)7.73sA test is fully passed only if every run passed for that test.…
234Output Tokens…
2,484Reasoning Tokens…
Google: Gemini 3 Flash Preview
10.00Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.00Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.99sResponse Time (max)4.99sResponse Time (total)4.99sA test is fully passed only if every run passed for that test.…