8.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
5.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Rank
#7
#32
Tests Correct
A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)21.06sResponse Time (max)100.41sResponse Time (total)315.95sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Wrong answer: 7Response Time (avg)4.13sResponse Time (max)11.07sResponse Time (total)33.03sA test is fully passed only if every run passed for that test.…
Consistency
8.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Cost per result
6.533Shows the average cost per correct benchmark answer in cents (lower is better).…
0.219Shows the average cost per correct benchmark answer in cents (lower is better).…
Total Cost
$0.784Total Cost…
$0.018Total Cost…
Attempt pass rate
86.7%Attempt pass rate = passed attempts / total attempts across runs.…
53.3%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
common.totalRuns
45 (15 x 3)common.totalRuns…
45 (15 x 3)common.totalRuns…
Output Tokens
1,611Output Tokens…
1,445Output Tokens…
Reasoning Tokens
46,321Reasoning Tokens…
0Reasoning Tokens…
Response Time (avg)
21.06sResponse Time (avg)…
4.13sResponse Time (avg)…
Response Time (max)
100.41sResponse Time (max)…
11.07sResponse Time (max)…
Response Time (total)
315.95sResponse Time (total)…
33.03sResponse Time (total)…
Top Models by Score
Score vs Total Cost
Response Time (avg)
Avg Score vs Response Time (avg)
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.02sResponse Time (max)6.42sResponse Time (total)15.06sA test is fully passed only if every run passed for that test.…
5.02sResponse Time (avg)…
216Output Tokens…
1,466Reasoning Tokens…
Z.ai: GLM 5
4.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)3.39sResponse Time (max)3.39sResponse Time (total)3.39sA test is fully passed only if every run passed for that test.…
3.39sResponse Time (avg)…
272Output Tokens…
0Reasoning Tokens…
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.57sResponse Time (max)20.57sResponse Time (total)20.57sA test is fully passed only if every run passed for that test.…
20.57sResponse Time (avg)…
301Output Tokens…
3,543Reasoning Tokens…
Z.ai: GLM 5
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.98sResponse Time (max)4.98sResponse Time (total)4.98sA test is fully passed only if every run passed for that test.…
4.98sResponse Time (avg)…
406Output Tokens…
0Reasoning Tokens…
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.32sResponse Time (max)5.40sResponse Time (total)10.64sA test is fully passed only if every run passed for that test.…
5.32sResponse Time (avg)…
234Output Tokens…
804Reasoning Tokens…
Z.ai: GLM 5
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.78sResponse Time (max)5.78sResponse Time (total)5.78sA test is fully passed only if every run passed for that test.…
5.78sResponse Time (avg)…
203Output Tokens…
0Reasoning Tokens…
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
4.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
44.4%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)74.27sResponse Time (max)100.41sResponse Time (total)222.80sA test is fully passed only if every run passed for that test.…
74.27sResponse Time (avg)…
61Output Tokens…
34,748Reasoning Tokens…
Z.ai: GLM 5
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)2.24sResponse Time (max)2.24sResponse Time (total)2.24sA test is fully passed only if every run passed for that test.…
2.24sResponse Time (avg)…
19Output Tokens…
0Reasoning Tokens…
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.11sResponse Time (max)3.68sResponse Time (total)6.22sA test is fully passed only if every run passed for that test.…
3.11sResponse Time (avg)…
93Output Tokens…
897Reasoning Tokens…
Z.ai: GLM 5
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.48sResponse Time (max)1.48sResponse Time (total)1.48sA test is fully passed only if every run passed for that test.…
1.48sResponse Time (avg)…
61Output Tokens…
0Reasoning Tokens…
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
88.9%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)9.13sResponse Time (max)18.14sResponse Time (total)27.39sA test is fully passed only if every run passed for that test.…
9.13sResponse Time (avg)…
442Output Tokens…
3,832Reasoning Tokens…
Z.ai: GLM 5
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.05sResponse Time (max)2.08sResponse Time (total)4.10sA test is fully passed only if every run passed for that test.…
2.05sResponse Time (avg)…
264Output Tokens…
0Reasoning Tokens…
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5.4
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.28sResponse Time (max)13.28sResponse Time (total)13.28sA test is fully passed only if every run passed for that test.…
13.28sResponse Time (avg)…
264Output Tokens…
1,031Reasoning Tokens…
Z.ai: GLM 5
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.07sResponse Time (max)11.07sResponse Time (total)11.07sA test is fully passed only if every run passed for that test.…