8.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
8.1Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Rank
#28
#27
Consistency
9.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
9.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Tests Correct
A test is fully passed only if every run passed for that test.Timed out: 2Wrong answer: 2Did not follow instructions: 1Response Time (avg)25.03sResponse Time (max)147.47sResponse Time (total)425.48sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Did not follow instructions: 2Wrong answer: 2API error: 1Response Time (avg)14.63sResponse Time (max)46.04sResponse Time (total)248.72sA test is fully passed only if every run passed for that test.…
Attempt pass rate
75.9%Attempt pass rate = passed attempts / total attempts across runs.…
74.1%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
Total Runs
54Total Runs…
54Total Runs…
Cost per result
0.214Shows the average cost per correct benchmark answer in cents (lower is better).…
0.000Shows the average cost per correct benchmark answer in cents (lower is better).…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.20sResponse Time (max)9.64sResponse Time (total)24.78sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.59sResponse Time (max)10.20sResponse Time (total)26.37sA test is fully passed only if every run passed for that test.…
2.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)147.47sResponse Time (max)147.47sResponse Time (total)147.47sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.37sResponse Time (max)31.37sResponse Time (total)31.37sA test is fully passed only if every run passed for that test.…
9.6Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)73.55sResponse Time (max)73.55sResponse Time (total)73.55sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.04sResponse Time (max)46.04sResponse Time (total)46.04sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.51sResponse Time (max)20.57sResponse Time (total)33.02sA test is fully passed only if every run passed for that test.…
6.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)5.25sResponse Time (max)5.25sResponse Time (total)5.25sA test is fully passed only if every run passed for that test.…
2.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
4.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
22.2%Attempt pass rate = passed attempts / total attempts across runs.…
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)23.62sResponse Time (max)27.00sResponse Time (total)47.23sA test is fully passed only if every run passed for that test.…
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)22.30sResponse Time (max)30.51sResponse Time (total)66.90sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)29.76sResponse Time (max)29.76sResponse Time (total)29.76sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.84sResponse Time (max)16.84sResponse Time (total)16.84sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.54sResponse Time (max)21.25sResponse Time (total)35.08sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.16sResponse Time (max)7.72sResponse Time (total)12.31sA test is fully passed only if every run passed for that test.…
7.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
9.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)8.52sResponse Time (max)12.73sResponse Time (total)25.56sA test is fully passed only if every run passed for that test.…
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
44.4%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 2Response Time (avg)9.55sResponse Time (max)14.35sResponse Time (total)28.64sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.01sResponse Time (max)9.01sResponse Time (total)9.01sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.02sResponse Time (max)15.02sResponse Time (total)15.02sA test is fully passed only if every run passed for that test.…