80Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
85Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
68Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Consistency
85Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
90Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
74Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Cost per result
7.127Shows the average cost per correct benchmark answer in cents (lower is better).…
4.820Shows the average cost per correct benchmark answer in cents (lower is better).…
3.396Shows the average cost per correct benchmark answer in cents (lower is better).…
Total Cost
$0.784Total Cost…
$0.531Total Cost…
$0.306Total Cost…
Tests Correct
A test is fully passed only if every run passed for that test.Did not follow instructions: 2Wrong answer: 2Response Time (avg)21.06sResponse Time (max)100.41sResponse Time (total)315.95sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Did not follow instructions: 2Wrong answer: 2Response Time (avg)17.37sResponse Time (max)100.93sResponse Time (total)260.52sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Did not follow instructions: 3No answer: 1Timed out: 1Wrong answer: 1Response Time (avg)16.71sResponse Time (max)77.80sResponse Time (total)133.69sA test is fully passed only if every run passed for that test.…
Attempt pass rate
82.2%Attempt pass rate = passed attempts / total attempts across runs.…
82.2%Attempt pass rate = passed attempts / total attempts across runs.…
75.6%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
3Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
5Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
Output Tokens
1,611Output Tokens…
1,577Output Tokens…
2,058Output Tokens…
Reasoning Tokens
46,321Reasoning Tokens…
33,017Reasoning Tokens…
16,542Reasoning Tokens…
Top Models by Score
Score vs Total Cost
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
100Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.02sResponse Time (max)6.42sResponse Time (total)15.06sA test is fully passed only if every run passed for that test.…
216Output Tokens…
1,466Reasoning Tokens…
OpenAI: GPT-5.3-Codex
100Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.69sResponse Time (max)6.68sResponse Time (total)14.06sA test is fully passed only if every run passed for that test.…
216Output Tokens…
1,421Reasoning Tokens…
OpenAI: GPT-5.2
70Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
73Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
77.8%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)14.34sResponse Time (max)14.34sResponse Time (total)14.34sA test is fully passed only if every run passed for that test.…
549Output Tokens…
2,002Reasoning Tokens…
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
100Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.57sResponse Time (max)20.57sResponse Time (total)20.57sA test is fully passed only if every run passed for that test.…
301Output Tokens…
3,543Reasoning Tokens…
OpenAI: GPT-5.3-Codex
100Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)19.56sResponse Time (max)19.56sResponse Time (total)19.56sA test is fully passed only if every run passed for that test.…
364Output Tokens…
2,731Reasoning Tokens…
OpenAI: GPT-5.2
100Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.06sResponse Time (max)14.06sResponse Time (total)14.06sA test is fully passed only if every run passed for that test.…
291Output Tokens…
1,757Reasoning Tokens…
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
99Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.32sResponse Time (max)5.40sResponse Time (total)10.64sA test is fully passed only if every run passed for that test.…
234Output Tokens…
804Reasoning Tokens…
OpenAI: GPT-5.3-Codex
99Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.07sResponse Time (max)3.59sResponse Time (total)6.15sA test is fully passed only if every run passed for that test.…
234Output Tokens…
728Reasoning Tokens…
OpenAI: GPT-5.2
99Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.15sResponse Time (max)3.15sResponse Time (total)3.15sA test is fully passed only if every run passed for that test.…
234Output Tokens…
420Reasoning Tokens…
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
40Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
72Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
44.4%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)74.27sResponse Time (max)100.41sResponse Time (total)222.80sA test is fully passed only if every run passed for that test.…
61Output Tokens…
34,748Reasoning Tokens…
OpenAI: GPT-5.3-Codex
40Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
72Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
55.6%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)64.31sResponse Time (max)100.93sResponse Time (total)192.94sA test is fully passed only if every run passed for that test.…
64Output Tokens…
25,308Reasoning Tokens…
OpenAI: GPT-5.2
40Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
72Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
55.6%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)77.80sResponse Time (max)77.80sResponse Time (total)77.80sA test is fully passed only if every run passed for that test.…
42Output Tokens…
10,342Reasoning Tokens…
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
85Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
68Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)3.11sResponse Time (max)3.68sResponse Time (total)6.22sA test is fully passed only if every run passed for that test.…
93Output Tokens…
897Reasoning Tokens…
OpenAI: GPT-5.3-Codex
90Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)3.04sResponse Time (max)3.44sResponse Time (total)6.07sA test is fully passed only if every run passed for that test.…
93Output Tokens…
693Reasoning Tokens…
OpenAI: GPT-5.2
85Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
68Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)3.12sResponse Time (max)3.12sResponse Time (total)3.12sA test is fully passed only if every run passed for that test.…
94Output Tokens…
614Reasoning Tokens…
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
70Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
72Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
88.9%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)9.13sResponse Time (max)18.14sResponse Time (total)27.39sA test is fully passed only if every run passed for that test.…
442Output Tokens…
3,832Reasoning Tokens…
OpenAI: GPT-5.3-Codex
93Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
79Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
88.9%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)5.12sResponse Time (max)8.73sResponse Time (total)15.37sA test is fully passed only if every run passed for that test.…
352Output Tokens…
1,644Reasoning Tokens…
OpenAI: GPT-5.2
70Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
73Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
77.8%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)5.47sResponse Time (max)6.45sResponse Time (total)10.94sA test is fully passed only if every run passed for that test.…
609Output Tokens…
938Reasoning Tokens…
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
100Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.28sResponse Time (max)13.28sResponse Time (total)13.28sA test is fully passed only if every run passed for that test.…
264Output Tokens…
1,031Reasoning Tokens…
OpenAI: GPT-5.3-Codex
100Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
100Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.37sResponse Time (max)6.37sResponse Time (total)6.37sA test is fully passed only if every run passed for that test.…
254Output Tokens…
492Reasoning Tokens…
OpenAI: GPT-5.2
100Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
16Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)10.30sResponse Time (max)10.30sResponse Time (total)10.30sA test is fully passed only if every run passed for that test.…