7.6Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
6.4Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Rank
#12
#29
Tests Correct
A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)3.49sResponse Time (max)11.91sResponse Time (total)52.29sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 1No answer: 1Timed out: 1Response Time (avg)69.84sResponse Time (max)137.29sResponse Time (total)558.72sA test is fully passed only if every run passed for that test.…
Consistency
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
7.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Cost per result
0.170Shows the average cost per correct benchmark answer in cents (lower is better).…
2.082Shows the average cost per correct benchmark answer in cents (lower is better).…
Total Cost
$0.019Total Cost…
$0.188Total Cost…
Attempt pass rate
73.3%Attempt pass rate = passed attempts / total attempts across runs.…
73.3%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
4Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
common.totalRuns
45 (15 x 3)common.totalRuns…
45 (15 x 3)common.totalRuns…
Output Tokens
1,542Output Tokens…
34,638Output Tokens…
Reasoning Tokens
6,888Reasoning Tokens…
68,234Reasoning Tokens…
Response Time (avg)
3.49sResponse Time (avg)…
69.84sResponse Time (avg)…
Response Time (max)
11.91sResponse Time (max)…
137.29sResponse Time (max)…
Response Time (total)
52.29sResponse Time (total)…
558.72sResponse Time (total)…
Top Models by Score
Score vs Total Cost
Response Time (avg)
Avg Score vs Response Time (avg)
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.18sResponse Time (max)3.18sResponse Time (total)6.53sA test is fully passed only if every run passed for that test.…
2.18sResponse Time (avg)…
456Output Tokens…
1,224Reasoning Tokens…
MoonshotAI: Kimi K2.5
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
88.9%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)85.28sResponse Time (max)85.28sResponse Time (total)85.28sA test is fully passed only if every run passed for that test.…
85.28sResponse Time (avg)…
335Output Tokens…
6,255Reasoning Tokens…
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)11.91sResponse Time (max)11.91sResponse Time (total)11.91sA test is fully passed only if every run passed for that test.…
11.91sResponse Time (avg)…
225Output Tokens…
762Reasoning Tokens…
MoonshotAI: Kimi K2.5
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)71.37sResponse Time (max)71.37sResponse Time (total)71.37sA test is fully passed only if every run passed for that test.…
71.37sResponse Time (avg)…
703Output Tokens…
3,713Reasoning Tokens…
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.00sResponse Time (max)3.74sResponse Time (total)5.99sA test is fully passed only if every run passed for that test.…
3.00sResponse Time (avg)…
291Output Tokens…
696Reasoning Tokens…
MoonshotAI: Kimi K2.5
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)49.78sResponse Time (max)49.78sResponse Time (total)49.78sA test is fully passed only if every run passed for that test.…
49.78sResponse Time (avg)…
563Output Tokens…
7,940Reasoning Tokens…
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
4.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)2.36sResponse Time (max)3.51sResponse Time (total)7.07sA test is fully passed only if every run passed for that test.…
2.36sResponse Time (avg)…
18Output Tokens…
1,212Reasoning Tokens…
MoonshotAI: Kimi K2.5
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
4.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)137.29sResponse Time (max)137.29sResponse Time (total)137.29sA test is fully passed only if every run passed for that test.…
137.29sResponse Time (avg)…
20,753Output Tokens…
30,564Reasoning Tokens…
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.49sResponse Time (max)1.66sResponse Time (total)2.99sA test is fully passed only if every run passed for that test.…
1.49sResponse Time (avg)…
72Output Tokens…
753Reasoning Tokens…
MoonshotAI: Kimi K2.5
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)92.47sResponse Time (max)92.47sResponse Time (total)92.47sA test is fully passed only if every run passed for that test.…
92.47sResponse Time (avg)…
5,371Output Tokens…
6,547Reasoning Tokens…
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.76sResponse Time (max)5.08sResponse Time (total)8.27sA test is fully passed only if every run passed for that test.…
2.76sResponse Time (avg)…
243Output Tokens…
1,248Reasoning Tokens…
MoonshotAI: Kimi K2.5
4.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.3Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
44.4%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)45.40sResponse Time (max)82.75sResponse Time (total)90.79sA test is fully passed only if every run passed for that test.…
45.40sResponse Time (avg)…
6,671Output Tokens…
12,403Reasoning Tokens…
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
Google: Gemini 3.1 Flash Lite Preview
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.54sResponse Time (max)9.54sResponse Time (total)9.54sA test is fully passed only if every run passed for that test.…
9.54sResponse Time (avg)…
237Output Tokens…
993Reasoning Tokens…
MoonshotAI: Kimi K2.5
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.74sResponse Time (max)31.74sResponse Time (total)31.74sA test is fully passed only if every run passed for that test.…