Cobuddy vs GPT-4o-mini benchmark comparison: The average score is effectively tied at 4.9 vs 5.0. Cobuddy has the lower benchmark cost at $0.000 vs $0.006. GPT-4o-mini is faster at 1.77s vs 39.90s, with pass rates of 47.6% vs 23.8%.
Recommended model: GPT-4o-mini - It has the best score here (5.0), while responding about 22.5x faster than Cobuddy.
A test is fully passed only if every run passed for that test.Wrong answer: 15Did not follow instructions: 1Response Time (avg)1.77sResponse Time (max)7.58sResponse Time (total)24.80sA test is fully passed only if every run passed for that test.…
Attempt pass rate
47.6%Attempt pass rate = passed attempts / total attempts across runs.…
23.8%Attempt pass rate = passed attempts / total attempts across runs.…
Flaky tests
6Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
Total Runs
63Total Runs…
63Total Runs…
Cost per result
0.000Shows the average cost per correct benchmark answer in cents (lower is better).…
0.119Shows the average cost per correct benchmark answer in cents (lower is better).…
Total Cost
$0.000Total Cost (Current Price)…
$0.006Total Cost (Current Price)…
Input Price
$0.000 / 1MInput Price…
$0.150 / 1MInput Price…
Output Price
$0.000 / 1MOutput Price…
$0.600 / 1MOutput Price…
Total Input Tokens
37,449Total Input Tokens…
31,518Total Input Tokens…
Output Tokens
1,677Output Tokens…
1,982Output Tokens…
Reasoning Tokens
116,703Reasoning Tokens…
0Reasoning Tokens…
Response Time (avg)
39.90sResponse Time (avg)…
1.77sResponse Time (avg)…
Response Time (max)
309.02sResponse Time (max)…
7.58sResponse Time (max)…
Response Time (total)
797.98sResponse Time (total)…
24.80sResponse Time (total)…
Generation showcase
Hamster playing table tennis
Prompt: Create a detailed SVG illustration of a hamster playing table tennis.
8.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
91.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)10.00sResponse Time (max)11.53sResponse Time (total)39.99sA test is fully passed only if every run passed for that test.…
4.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
25.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.34sResponse Time (max)1.83sResponse Time (total)2.67sA test is fully passed only if every run passed for that test.…
3.7Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
6.7Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
22.2%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.API error: 1Did not follow instructions: 1Wrong answer: 1Response Time (avg)79.17sResponse Time (max)104.76sResponse Time (total)158.35sA test is fully passed only if every run passed for that test.…
3.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
9.6Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.63sResponse Time (max)2.55sResponse Time (total)4.90sA test is fully passed only if every run passed for that test.…
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)47.38sResponse Time (max)47.38sResponse Time (total)47.38sA test is fully passed only if every run passed for that test.…
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)7.58sResponse Time (max)7.58sResponse Time (total)7.58sA test is fully passed only if every run passed for that test.…
6.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
5.8Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
66.7%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)17.36sResponse Time (max)26.57sResponse Time (total)34.71sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.27sResponse Time (max)1.27sResponse Time (total)1.27sA test is fully passed only if every run passed for that test.…
2.9Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
4.4Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
22.2%Attempt pass rate = passed attempts / total attempts across runs.…
2Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)128.15sResponse Time (max)309.02sResponse Time (total)384.46sA test is fully passed only if every run passed for that test.…
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)637msResponse Time (max)637msResponse Time (total)637msA test is fully passed only if every run passed for that test.…
4.2Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
9.9Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)23.23sResponse Time (max)23.23sResponse Time (total)23.23sA test is fully passed only if every run passed for that test.…
4.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)909msResponse Time (max)909msResponse Time (total)909msA test is fully passed only if every run passed for that test.…
9.8Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.60sResponse Time (max)14.49sResponse Time (total)23.20sA test is fully passed only if every run passed for that test.…
6.3Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
50.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.11sResponse Time (max)1.11sResponse Time (total)1.11sA test is fully passed only if every run passed for that test.…
3.6Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
7.2Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
22.2%Attempt pass rate = passed attempts / total attempts across runs.…
1Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)12.83sResponse Time (max)24.40sResponse Time (total)38.49sA test is fully passed only if every run passed for that test.…
3.5Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)1.21sResponse Time (max)1.37sResponse Time (total)2.42sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.19sResponse Time (max)11.19sResponse Time (total)11.19sA test is fully passed only if every run passed for that test.…
10.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
100.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.51sResponse Time (max)2.51sResponse Time (total)2.51sA test is fully passed only if every run passed for that test.…
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)36.98sResponse Time (max)36.98sResponse Time (total)36.98sA test is fully passed only if every run passed for that test.…
3.0Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.…
10.0Consistency score reflects run-to-run stability (10 = very consistent, even if consistently wrong).…
0.0%Attempt pass rate = passed attempts / total attempts across runs.…
0Flaky tests had mixed outcomes across runs (at least one pass and one fail).…
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)794msResponse Time (max)794msResponse Time (total)794msA test is fully passed only if every run passed for that test.…