5.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
8.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Rank
#38
#7
Tests Correct
A test is fully passed only if every run passed for that test.Wrong answer: 6Extra formatting: 2invalid tool call: 1Response Time (avg)13.53sResponse Time (max)115.89sResponse Time (total)202.92sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)21.06sResponse Time (max)100.41sResponse Time (total)315.95sA test is fully passed only if every run passed for that test.…
Consistency
8.3Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
8.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Cost per result
0.254Shows the average cost per correct benchmark answer in cents (lower is better).…
6.533Shows the average cost per correct benchmark answer in cents (lower is better).…
Total Cost
$0.016Total Cost…
$0.784Total Cost…
Attempt pass rate
51.1%Attempt pass rate = passed attempts / total attempts across runs.…
86.7%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
3Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
common.totalRuns
45 (15 x 3)common.totalRuns…
45 (15 x 3)common.totalRuns…
Output Tokens
7,756Output Tokens…
1,611Output Tokens…
Reasoning Tokens
0Reasoning Tokens…
46,321Reasoning Tokens…
Response Time (avg)
13.53sResponse Time (avg)…
21.06sResponse Time (avg)…
Response Time (max)
115.89sResponse Time (max)…
100.41sResponse Time (max)…
Response Time (total)
202.92sResponse Time (total)…
315.95sResponse Time (total)…
Top Models by Score
Score vs Total Cost
Response Time (avg)
Avg Score vs Response Time (avg)
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
DeepSeek: DeepSeek V3.2
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
9.7Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Extra formatting: 2Wrong answer: 1Response Time (avg)8.79sResponse Time (max)12.26sResponse Time (total)26.38sA test is fully passed only if every run passed for that test.…
8.79sResponse Time (avg)…
1,411Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.02sResponse Time (max)6.42sResponse Time (total)15.06sA test is fully passed only if every run passed for that test.…
5.02sResponse Time (avg)…
216Output Tokens…
1,466Reasoning Tokens…
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
DeepSeek: DeepSeek V3.2
8.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.invalid tool call: 1Response Time (avg)115.89sResponse Time (max)115.89sResponse Time (total)115.89sA test is fully passed only if every run passed for that test.…
115.89sResponse Time (avg)…
2,887Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.57sResponse Time (max)20.57sResponse Time (total)20.57sA test is fully passed only if every run passed for that test.…
20.57sResponse Time (avg)…
301Output Tokens…
3,543Reasoning Tokens…
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
DeepSeek: DeepSeek V3.2
5.4Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
5.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)9.42sResponse Time (max)16.20sResponse Time (total)18.84sA test is fully passed only if every run passed for that test.…
9.42sResponse Time (avg)…
1,710Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.32sResponse Time (max)5.40sResponse Time (total)10.64sA test is fully passed only if every run passed for that test.…
5.32sResponse Time (avg)…
234Output Tokens…
804Reasoning Tokens…
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
DeepSeek: DeepSeek V3.2
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
22.2%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.61sResponse Time (max)1.77sResponse Time (total)4.83sA test is fully passed only if every run passed for that test.…
1.61sResponse Time (avg)…
24Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
4.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
44.4%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)74.27sResponse Time (max)100.41sResponse Time (total)222.80sA test is fully passed only if every run passed for that test.…
74.27sResponse Time (avg)…
61Output Tokens…
34,748Reasoning Tokens…
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
DeepSeek: DeepSeek V3.2
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.52sResponse Time (max)1.99sResponse Time (total)3.04sA test is fully passed only if every run passed for that test.…
1.52sResponse Time (avg)…
66Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.11sResponse Time (max)3.68sResponse Time (total)6.22sA test is fully passed only if every run passed for that test.…
3.11sResponse Time (avg)…
93Output Tokens…
897Reasoning Tokens…
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
DeepSeek: DeepSeek V3.2
7.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.5Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
88.9%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)7.37sResponse Time (max)10.78sResponse Time (total)22.10sA test is fully passed only if every run passed for that test.…
7.37sResponse Time (avg)…
1,136Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
88.9%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)9.13sResponse Time (max)18.14sResponse Time (total)27.39sA test is fully passed only if every run passed for that test.…
9.13sResponse Time (avg)…
442Output Tokens…
3,832Reasoning Tokens…
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
DeepSeek: DeepSeek V3.2
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.85sResponse Time (max)11.85sResponse Time (total)11.85sA test is fully passed only if every run passed for that test.…
11.85sResponse Time (avg)…
522Output Tokens…
0Reasoning Tokens…
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.28sResponse Time (max)13.28sResponse Time (total)13.28sA test is fully passed only if every run passed for that test.…