7.6Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
8.1Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Rank
#49
#23
Reliability
10.0First-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.…
10.0First-attempt success score: 10.0 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.…
Consistency
8.1Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
8.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Tests Correct
A test is fully passed only if every run passed for that test.Did not follow instructions: 3Wrong answer: 3No answer: 1Response Time (avg)61.96sResponse Time (max)149.23sResponse Time (total)1115.31sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Wrong answer: 4Timed out: 2Response Time (avg)67.58sResponse Time (max)266.69sResponse Time (total)878.57sA test is fully passed only if every run passed for that test.…
Attempt pass rate
74.1%Attempt pass rate = passed attempts / total attempts across runs.…
76.7%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
4Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
3Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
Total Runs
54Total Runs…
60Total Runs…
Cost per result
18.579Shows the average cost per correct benchmark answer in cents (lower is better).…
2.251Shows the average cost per correct benchmark answer in cents (lower is better).…
Total Cost
$2.044Total Cost (Current Price)…
$0.316Total Cost (Current Price)…
Input Price
$0.250 / 1MInput Price…
$0.260 / 1MInput Price…
Output Price
$1.500 / 1MOutput Price…
$1.560 / 1MOutput Price…
Output Tokens
1,984Output Tokens…
2,145Output Tokens…
Reasoning Tokens
1,355,583Reasoning Tokens…
172,563Reasoning Tokens…
Response Time (avg)
61.96sResponse Time (avg)…
67.58sResponse Time (avg)…
Response Time (max)
149.23sResponse Time (max)…
266.69sResponse Time (max)…
Response Time (total)
1115.31sResponse Time (total)…
878.57sResponse Time (total)…
Top Models by Score
Score vs Total Cost
Response Time (avg)
Score vs Response Time (avg)
Total Output Tokens
Score vs Total Output Tokens
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Gemini 3.1 Flash LiteArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)37.16sResponse Time (max)140.53sResponse Time (total)148.65sA test is fully passed only if every run passed for that test.…
8.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
83.3%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)45.78sResponse Time (max)81.20sResponse Time (total)91.57sA test is fully passed only if every run passed for that test.…
45.78sResponse Time (avg)…
205Output Tokens…
21,236Reasoning Tokens…
Coding
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Gemini 3.1 Flash LiteArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)137.63sResponse Time (max)137.63sResponse Time (total)137.63sA test is fully passed only if every run passed for that test.…
7.6Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
6.7Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)193.80sResponse Time (max)266.69sResponse Time (total)387.60sA test is fully passed only if every run passed for that test.…
193.80sResponse Time (avg)…
406Output Tokens…
63,554Reasoning Tokens…
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Gemini 3.1 Flash LiteArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)149.23sResponse Time (max)149.23sResponse Time (total)149.23sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.85sResponse Time (max)46.85sResponse Time (total)46.85sA test is fully passed only if every run passed for that test.…
46.85sResponse Time (avg)…
421Output Tokens…
7,906Reasoning Tokens…
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Gemini 3.1 Flash LiteArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.49sResponse Time (max)4.96sResponse Time (total)8.98sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.91sResponse Time (max)46.91sResponse Time (total)46.91sA test is fully passed only if every run passed for that test.…
46.91sResponse Time (avg)…
270Output Tokens…
14,916Reasoning Tokens…
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Gemini 3.1 Flash LiteArchived model: this model is no longer updated or tested on new tests.
3.6Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
22.2%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)139.90sResponse Time (max)141.40sResponse Time (total)419.69sA test is fully passed only if every run passed for that test.…
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)17.50sResponse Time (max)17.50sResponse Time (total)17.50sA test is fully passed only if every run passed for that test.…
17.50sResponse Time (avg)…
35Output Tokens…
16,680Reasoning Tokens…
General Intelligence
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Gemini 3.1 Flash LiteArchived model: this model is no longer updated or tested on new tests.
5.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
2.1Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)45.69sResponse Time (max)45.69sResponse Time (total)45.69sA test is fully passed only if every run passed for that test.…
4.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
1.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)79.86sResponse Time (max)79.86sResponse Time (total)79.86sA test is fully passed only if every run passed for that test.…
79.86sResponse Time (avg)…
73Output Tokens…
8,675Reasoning Tokens…
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Gemini 3.1 Flash LiteArchived model: this model is no longer updated or tested on new tests.
7.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
5.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
83.3%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)23.26sResponse Time (max)43.87sResponse Time (total)46.51sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.93sResponse Time (max)31.93sResponse Time (total)31.93sA test is fully passed only if every run passed for that test.…
31.93sResponse Time (avg)…
101Output Tokens…
7,704Reasoning Tokens…
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Gemini 3.1 Flash LiteArchived model: this model is no longer updated or tested on new tests.
5.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
6.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
44.4%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 2Response Time (avg)50.83sResponse Time (max)144.85sResponse Time (total)152.49sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)32.50sResponse Time (max)49.12sResponse Time (total)65.01sA test is fully passed only if every run passed for that test.…
32.50sResponse Time (avg)…
301Output Tokens…
13,853Reasoning Tokens…
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Gemini 3.1 Flash LiteArchived model: this model is no longer updated or tested on new tests.
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.44sResponse Time (max)6.44sResponse Time (total)6.44sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.54sResponse Time (max)7.54sResponse Time (total)7.54sA test is fully passed only if every run passed for that test.…
7.54sResponse Time (avg)…
309Output Tokens…
909Reasoning Tokens…
Trivia
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Gemini 3.1 Flash LiteArchived model: this model is no longer updated or tested on new tests.
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)103.81sResponse Time (max)103.81sResponse Time (total)103.81sA test is fully passed only if every run passed for that test.…