6.1Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
4.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
Rank
#32
#41
Tests Correct
A test is fully passed only if every run passed for that test.Did not follow instructions: 3Wrong answer: 3Timed out: 1Response Time (avg)25.92sResponse Time (max)88.15sResponse Time (total)388.79sA test is fully passed only if every run passed for that test.…
A test is fully passed only if every run passed for that test.Wrong answer: 8Did not follow instructions: 1Response Time (avg)3.73sResponse Time (max)13.73sResponse Time (total)55.90sA test is fully passed only if every run passed for that test.…
Consistency
8.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
9.5Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
Cost per result
1.401Shows the average cost per correct benchmark answer in cents (lower is better).…
0.088Shows the average cost per correct benchmark answer in cents (lower is better).…
Total Cost
$0.113Total Cost…
$0.006Total Cost…
Attempt pass rate
62.2%Attempt pass rate = passed attempts / total attempts across runs.…
42.2%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
common.totalRuns
45 (15 x 3)common.totalRuns…
45 (15 x 3)common.totalRuns…
Output Tokens
5,477Output Tokens…
3,674Output Tokens…
Reasoning Tokens
46,912Reasoning Tokens…
0Reasoning Tokens…
Response Time (avg)
25.92sResponse Time (avg)…
3.73sResponse Time (avg)…
Response Time (max)
88.15sResponse Time (max)…
13.73sResponse Time (max)…
Response Time (total)
388.79sResponse Time (total)…
55.90sResponse Time (total)…
Top Models by Score
Score vs Total Cost
Response Time (avg)
Avg Score vs Response Time (avg)
Category Breakdown
Anti-AI Tricks
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5 Mini
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
9.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)16.45sResponse Time (max)26.00sResponse Time (total)49.36sA test is fully passed only if every run passed for that test.…
16.45sResponse Time (avg)…
1,645Output Tokens…
5,824Reasoning Tokens…
Qwen: Qwen3.5-Flash
2.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
11.1%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.62sResponse Time (max)3.89sResponse Time (total)4.85sA test is fully passed only if every run passed for that test.…
1.62sResponse Time (avg)…
687Output Tokens…
0Reasoning Tokens…
Combined
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5 Mini
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)88.15sResponse Time (max)88.15sResponse Time (total)88.15sA test is fully passed only if every run passed for that test.…
88.15sResponse Time (avg)…
754Output Tokens…
11,520Reasoning Tokens…
Qwen: Qwen3.5-Flash
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.22sResponse Time (max)6.22sResponse Time (total)6.22sA test is fully passed only if every run passed for that test.…
6.22sResponse Time (avg)…
1,794Output Tokens…
0Reasoning Tokens…
Data parsing and extraction
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5 Mini
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.58sResponse Time (max)13.87sResponse Time (total)25.16sA test is fully passed only if every run passed for that test.…
12.58sResponse Time (avg)…
453Output Tokens…
3,200Reasoning Tokens…
Qwen: Qwen3.5-Flash
9.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.57sResponse Time (max)1.83sResponse Time (total)3.14sA test is fully passed only if every run passed for that test.…
1.57sResponse Time (avg)…
243Output Tokens…
0Reasoning Tokens…
Domain specific
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5 Mini
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
22.2%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)44.63sResponse Time (max)82.55sResponse Time (total)133.89sA test is fully passed only if every run passed for that test.…
44.63sResponse Time (avg)…
293Output Tokens…
14,016Reasoning Tokens…
Qwen: Qwen3.5-Flash
7.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)905msResponse Time (max)1.10sResponse Time (total)2.71sA test is fully passed only if every run passed for that test.…
905msResponse Time (avg)…
15Output Tokens…
0Reasoning Tokens…
Instructions following
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5 Mini
7.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
6.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
83.3%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)15.66sResponse Time (max)21.80sResponse Time (total)31.32sA test is fully passed only if every run passed for that test.…
15.66sResponse Time (avg)…
318Output Tokens…
4,992Reasoning Tokens…
Qwen: Qwen3.5-Flash
5.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)8.81sResponse Time (max)13.73sResponse Time (total)17.61sA test is fully passed only if every run passed for that test.…
8.81sResponse Time (avg)…
63Output Tokens…
0Reasoning Tokens…
Puzzle Solving
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5 Mini
4.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
9.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
33.3%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)14.09sResponse Time (max)16.81sResponse Time (total)42.28sA test is fully passed only if every run passed for that test.…
14.09sResponse Time (avg)…
1,527Output Tokens…
5,760Reasoning Tokens…
Qwen: Qwen3.5-Flash
1.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)5.90sResponse Time (max)12.19sResponse Time (total)17.69sA test is fully passed only if every run passed for that test.…
5.90sResponse Time (avg)…
608Output Tokens…
0Reasoning Tokens…
Tool Calling
Score
Consistency
Attempt pass rate
Flaky tests
Tests Correct
Response Time (avg)
Output Tokens
Reasoning Tokens
OpenAI: GPT-5 Mini
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)18.64sResponse Time (max)18.64sResponse Time (total)18.64sA test is fully passed only if every run passed for that test.…
18.64sResponse Time (avg)…
487Output Tokens…
1,600Reasoning Tokens…
Qwen: Qwen3.5-Flash
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.67sResponse Time (max)3.67sResponse Time (total)3.67sA test is fully passed only if every run passed for that test.…