7.4Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
8.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Rank
#39
#6
Consistency
9.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
9.1Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Tests Correct
A test is fully passed only if every run passed for that test.Extra formatting: 3Wrong answer: 3Did not follow instructions: 1Response Time (avg)4.98sResponse Time (max)23.84sResponse Time (total)54.83sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Timed out: 2Wrong answer: 2Response Time (avg)46.56sResponse Time (max)120.91sResponse Time (total)512.20sA test is fully passed only if every run passed for that test.…
Attempt pass rate
64.8%Attempt pass rate = passed attempts / total attempts across runs.…
83.3%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
Total Runs
54Total Runs…
54Total Runs…
Cost per result
2.376Shows the average cost per correct benchmark answer in cents (lower is better).…
1.566Shows the average cost per correct benchmark answer in cents (lower is better).…
4.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
25.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Extra formatting: 2Wrong answer: 1Response Time (avg)2.94sResponse Time (max)4.83sResponse Time (total)5.88sA test is fully passed only if every run passed for that test.…
8.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
83.3%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)45.78sResponse Time (max)81.20sResponse Time (total)91.57sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.67sResponse Time (max)3.67sResponse Time (total)3.67sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)120.91sResponse Time (max)120.91sResponse Time (total)120.91sA test is fully passed only if every run passed for that test.…
9.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.84sResponse Time (max)23.84sResponse Time (total)23.84sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.85sResponse Time (max)46.85sResponse Time (total)46.85sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.43sResponse Time (max)3.43sResponse Time (total)3.43sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.91sResponse Time (max)46.91sResponse Time (total)46.91sA test is fully passed only if every run passed for that test.…
7.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.54sResponse Time (max)3.54sResponse Time (total)3.54sA test is fully passed only if every run passed for that test.…
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)17.50sResponse Time (max)17.50sResponse Time (total)17.50sA test is fully passed only if every run passed for that test.…
6.1Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
3.1Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)2.56sResponse Time (max)2.56sResponse Time (total)2.56sA test is fully passed only if every run passed for that test.…
4.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
1.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)79.86sResponse Time (max)79.86sResponse Time (total)79.86sA test is fully passed only if every run passed for that test.…
6.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.96sResponse Time (max)1.96sResponse Time (total)1.96sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.93sResponse Time (max)31.93sResponse Time (total)31.93sA test is fully passed only if every run passed for that test.…
7.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Extra formatting: 1Response Time (avg)2.92sResponse Time (max)3.33sResponse Time (total)5.84sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)34.57sResponse Time (max)49.12sResponse Time (total)69.13sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.11sResponse Time (max)4.11sResponse Time (total)4.11sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.54sResponse Time (max)7.54sResponse Time (total)7.54sA test is fully passed only if every run passed for that test.…