7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
8.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Rank
#22
#7
Tests Correct
A test is fully passed only if every run passed for that test.Wrong answer: 5Response Time (avg)1.82sResponse Time (max)3.56sResponse Time (total)14.58sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)21.06sResponse Time (max)100.41sResponse Time (total)315.95sA test is fully passed only if every run passed for that test.…
Consistency
8.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
8.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Cost per result
0.180Shows the average cost per correct benchmark answer in cents (lower is better).…
6.533Shows the average cost per correct benchmark answer in cents (lower is better).…
Total Cost
$0.018Total Cost…
$0.784Total Cost…
Attempt pass rate
73.3%Attempt pass rate = passed attempts / total attempts across runs.…
86.7%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
common.totalRuns
45 (15 x 3)common.totalRuns…
45 (15 x 3)common.totalRuns…
Output Tokens
1,307Output Tokens…
1,611Output Tokens…
Reasoning Tokens
0Reasoning Tokens…
46,321Reasoning Tokens…
Response Time (avg)
1.82sResponse Time (avg)…
21.06sResponse Time (avg)…
Response Time (max)
3.56sResponse Time (max)…
100.41sResponse Time (max)…
Response Time (total)
14.58sResponse Time (total)…
315.95sResponse Time (total)…
Top Models by Score
Score vs Total Cost
Response Time (avg)
Avg Score vs Response Time (avg)
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3 Flash Preview
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.59sResponse Time (max)1.59sResponse Time (total)1.59sA test is fully passed only if every run passed for that test.…
1.59sResponse Time (avg)…
208Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.02sResponse Time (max)6.42sResponse Time (total)15.06sA test is fully passed only if every run passed for that test.…
5.02sResponse Time (avg)…
216Output Tokens…
1,466Reasoning Tokens…
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3 Flash Preview
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
1.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.56sResponse Time (max)3.56sResponse Time (total)3.56sA test is fully passed only if every run passed for that test.…
3.56sResponse Time (avg)…
350Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.57sResponse Time (max)20.57sResponse Time (total)20.57sA test is fully passed only if every run passed for that test.…
20.57sResponse Time (avg)…
301Output Tokens…
3,543Reasoning Tokens…
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3 Flash Preview
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.41sResponse Time (max)1.41sResponse Time (total)1.41sA test is fully passed only if every run passed for that test.…
1.41sResponse Time (avg)…
279Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.32sResponse Time (max)5.40sResponse Time (total)10.64sA test is fully passed only if every run passed for that test.…
5.32sResponse Time (avg)…
234Output Tokens…
804Reasoning Tokens…
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3 Flash Preview
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)963msResponse Time (max)963msResponse Time (total)963msA test is fully passed only if every run passed for that test.…
963msResponse Time (avg)…
18Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
4.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
44.4%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)74.27sResponse Time (max)100.41sResponse Time (total)222.80sA test is fully passed only if every run passed for that test.…
74.27sResponse Time (avg)…
61Output Tokens…
34,748Reasoning Tokens…
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3 Flash Preview
5.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
5.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.58sResponse Time (max)1.58sResponse Time (total)1.58sA test is fully passed only if every run passed for that test.…
1.58sResponse Time (avg)…
74Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.11sResponse Time (max)3.68sResponse Time (total)6.22sA test is fully passed only if every run passed for that test.…
3.11sResponse Time (avg)…
93Output Tokens…
897Reasoning Tokens…
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3 Flash Preview
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.06sResponse Time (max)1.06sResponse Time (total)2.12sA test is fully passed only if every run passed for that test.…
1.06sResponse Time (avg)…
144Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
88.9%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)9.13sResponse Time (max)18.14sResponse Time (total)27.39sA test is fully passed only if every run passed for that test.…
9.13sResponse Time (avg)…
442Output Tokens…
3,832Reasoning Tokens…
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3 Flash Preview
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.35sResponse Time (max)3.35sResponse Time (total)3.35sA test is fully passed only if every run passed for that test.…
3.35sResponse Time (avg)…
234Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.28sResponse Time (max)13.28sResponse Time (total)13.28sA test is fully passed only if every run passed for that test.…