8.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
8.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Rank
#20
#17
Reliability
N/AFirst-attempt success rate: 100 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.…
N/AFirst-attempt success rate: 100 means no retryable target API or rate-limit failures before successful calls; tracked failures lower the score.…
Consistency
9.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
9.5Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Tests Correct
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 2Response Time (avg)71.21sResponse Time (max)351.99sResponse Time (total)1281.73sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 1Response Time (avg)12.12sResponse Time (max)95.48sResponse Time (total)218.12sA test is fully passed only if every run passed for that test.…
Attempt pass rate
74.1%Attempt pass rate = passed attempts / total attempts across runs.…
75.9%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
Total Runs
24Total Runs…
54Total Runs…
Cost per result
1.224Shows the average cost per correct benchmark answer in cents (lower is better).…
2.454Shows the average cost per correct benchmark answer in cents (lower is better).…
8.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
75.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)26.93sResponse Time (max)61.35sResponse Time (total)107.71sA test is fully passed only if every run passed for that test.…
8.4Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
75.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.30sResponse Time (max)15.56sResponse Time (total)25.21sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)93.00sResponse Time (max)93.00sResponse Time (total)93.00sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.23sResponse Time (max)16.23sResponse Time (total)16.23sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)71.08sResponse Time (max)71.08sResponse Time (total)71.08sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)28.44sResponse Time (max)28.44sResponse Time (total)28.44sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)63.00sResponse Time (max)102.80sResponse Time (total)126.00sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.06sResponse Time (max)5.06sResponse Time (total)8.11sA test is fully passed only if every run passed for that test.…
5.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)202.56sResponse Time (max)351.99sResponse Time (total)607.68sA test is fully passed only if every run passed for that test.…
5.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
55.6%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)37.34sResponse Time (max)95.48sResponse Time (total)112.01sA test is fully passed only if every run passed for that test.…
5.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)26.96sResponse Time (max)26.96sResponse Time (total)26.96sA test is fully passed only if every run passed for that test.…
4.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.86sResponse Time (max)4.86sResponse Time (total)4.86sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.60sResponse Time (max)20.03sResponse Time (total)29.20sA test is fully passed only if every run passed for that test.…
9.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.62sResponse Time (max)2.78sResponse Time (total)5.24sA test is fully passed only if every run passed for that test.…
7.6Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
77.8%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)69.69sResponse Time (max)92.65sResponse Time (total)209.06sA test is fully passed only if every run passed for that test.…
7.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.94sResponse Time (max)6.33sResponse Time (total)11.83sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.05sResponse Time (max)11.05sResponse Time (total)11.05sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.20sResponse Time (max)6.20sResponse Time (total)6.20sA test is fully passed only if every run passed for that test.…