Track the top SOTA AI models with AI BENCHY's benchmark leaderboard—an easy way to see which models lead right now by score, reasoning quality, reliability, and value. Sort by: Total Cost ↓.
Last updated at: 2026-03-06Models Evaluated: 55
0/0
No data available.
Rank
Model
Company
ScoreSummarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance.
Cost per resultShows the average cost per correct benchmark answer in cents (lower is better).
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 1Response Time (avg)68.83sResponse Time (max)280.52sResponse Time (total)1101.32s…
Total Tests: 16Wrong Tests: 4Attempt pass rate: 77.1%Flaky tests: 1…Output Tokens: 1,283Reasoning Tokens: 1,533,310Response time: avg 68.83s · total 1101.32s · max 280.52s
Wrong answer: 3Did not follow instructions: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)43.87sResponse Time (max)121.88sResponse Time (total)131.62s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)280.52sResponse Time (max)280.52sResponse Time (total)280.52s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.16sResponse Time (max)8.54sResponse Time (total)14.31s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)127.58sResponse Time (max)133.93sResponse Time (total)382.74s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.25sResponse Time (max)5.25sResponse Time (total)5.25s
Instructions following: 9.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)70.07sResponse Time (max)136.53sResponse Time (total)140.14s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)46.33sResponse Time (max)134.22sResponse Time (total)139.00s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.73sResponse Time (max)7.73sResponse Time (total)7.73s
A test is fully passed only if every run passed for that test.Extra formatting: 4Wrong answer: 2Response Time (avg)22.86sResponse Time (max)83.40sResponse Time (total)205.71s…
Total Tests: 16Wrong Tests: 6Attempt pass rate: 66.7%Flaky tests: 2…Output Tokens: 26,254Reasoning Tokens: 17,363Response time: avg 22.86s · total 205.71s · max 83.40s
Extra formatting: 4Wrong answer: 2
Anti-AI Tricks: 4.0A test is fully passed only if every run passed for that test.Extra formatting: 2Response Time (avg)11.88sResponse Time (max)11.88sResponse Time (total)11.88s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)76.66sResponse Time (max)76.66sResponse Time (total)76.66s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.37sResponse Time (max)7.37sResponse Time (total)7.37s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Extra formatting: 2Wrong answer: 1Response Time (avg)83.40sResponse Time (max)83.40sResponse Time (total)83.40s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.04sResponse Time (max)5.04sResponse Time (total)5.04s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.43sResponse Time (max)2.43sResponse Time (total)2.43s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.60sResponse Time (max)4.66sResponse Time (total)9.20s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.73sResponse Time (max)9.73sResponse Time (total)9.73s
A test is fully passed only if every run passed for that test.Extra formatting: 2Timed out: 1Wrong answer: 1Response Time (avg)11.23sResponse Time (max)46.35sResponse Time (total)89.84s…
Total Tests: 16Wrong Tests: 4Attempt pass rate: 77.1%Flaky tests: 1…Output Tokens: 35,159Reasoning Tokens: 24,687Response time: avg 11.23s · total 89.84s · max 46.35s
Extra formatting: 2Timed out: 1Wrong answer: 1
Anti-AI Tricks: 7.0A test is fully passed only if every run passed for that test.Extra formatting: 1Response Time (avg)4.95sResponse Time (max)4.95sResponse Time (total)4.95s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.35sResponse Time (max)46.35sResponse Time (total)46.35s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.90sResponse Time (max)13.90sResponse Time (total)13.90s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Extra formatting: 1Timed out: 1Wrong answer: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.94sResponse Time (max)4.94sResponse Time (total)4.94s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.61sResponse Time (max)2.61sResponse Time (total)2.61s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.80sResponse Time (max)5.22sResponse Time (total)9.60s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.48sResponse Time (max)7.48sResponse Time (total)7.48s
A test is fully passed only if every run passed for that test.Did not follow instructions: 2Wrong answer: 2Response Time (avg)20.05sResponse Time (max)100.41sResponse Time (total)320.87s…
Total Tests: 16Wrong Tests: 4Attempt pass rate: 83.3%Flaky tests: 3…Output Tokens: 1,756Reasoning Tokens: 46,642Response time: avg 20.05s · total 320.87s · max 100.41s
Did not follow instructions: 2Wrong answer: 2
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.02sResponse Time (max)6.42sResponse Time (total)15.06s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.57sResponse Time (max)20.57sResponse Time (total)20.57s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.32sResponse Time (max)5.40sResponse Time (total)10.64s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)74.27sResponse Time (max)100.41sResponse Time (total)222.80s
General Intelligence: 5.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.92sResponse Time (max)4.92sResponse Time (total)4.92s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.11sResponse Time (max)3.68sResponse Time (total)6.22s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)9.13sResponse Time (max)18.14sResponse Time (total)27.39s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.28sResponse Time (max)13.28sResponse Time (total)13.28s
A test is fully passed only if every run passed for that test.Did not follow instructions: 2Wrong answer: 2Response Time (avg)16.59sResponse Time (max)100.93sResponse Time (total)265.39s…
Total Tests: 16Wrong Tests: 4Attempt pass rate: 83.3%Flaky tests: 2…Output Tokens: 1,764Reasoning Tokens: 33,348Response time: avg 16.59s · total 265.39s · max 100.93s
Did not follow instructions: 2Wrong answer: 2
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.69sResponse Time (max)6.68sResponse Time (total)14.06s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)19.56sResponse Time (max)19.56sResponse Time (total)19.56s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.07sResponse Time (max)3.59sResponse Time (total)6.15s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)64.31sResponse Time (max)100.93sResponse Time (total)192.94s
General Intelligence: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.87sResponse Time (max)4.87sResponse Time (total)4.87s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.04sResponse Time (max)3.44sResponse Time (total)6.07s
Puzzle Solving: 9.3A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)5.12sResponse Time (max)8.73sResponse Time (total)15.37s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.37sResponse Time (max)6.37sResponse Time (total)6.37s
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)16.60sResponse Time (max)40.61sResponse Time (total)149.36s…
Total Tests: 16Wrong Tests: 1Attempt pass rate: 93.8%Flaky tests: 0…Output Tokens: 1,521Reasoning Tokens: 35,656Response time: avg 16.60s · total 149.36s · max 40.61s
Wrong answer: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.52sResponse Time (max)9.52sResponse Time (total)9.52s
Combined: 9.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)40.61sResponse Time (max)40.61sResponse Time (total)40.61s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.72sResponse Time (max)7.72sResponse Time (total)7.72s
Domain specific: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)32.73sResponse Time (max)32.73sResponse Time (total)32.73s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.77sResponse Time (max)11.77sResponse Time (total)11.77s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.56sResponse Time (max)9.56sResponse Time (total)9.56s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.15sResponse Time (max)8.49sResponse Time (total)14.30s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.15sResponse Time (max)23.15sResponse Time (total)23.15s
A test is fully passed only if every run passed for that test.Wrong answer: 3Timed out: 1Response Time (avg)29.74sResponse Time (max)119.29sResponse Time (total)475.83s…
Total Tests: 16Wrong Tests: 4Attempt pass rate: 79.2%Flaky tests: 2…Output Tokens: 17,292Reasoning Tokens: 145,625Response time: avg 29.74s · total 475.83s · max 119.29s
Wrong answer: 3Timed out: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.99sResponse Time (max)11.62sResponse Time (total)20.98s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)107.79sResponse Time (max)107.79sResponse Time (total)107.79s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.41sResponse Time (max)29.79sResponse Time (total)46.83s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)63.40sResponse Time (max)119.29sResponse Time (total)190.20s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)34.11sResponse Time (max)34.11sResponse Time (total)34.11s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.88sResponse Time (max)15.44sResponse Time (total)19.76s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.18sResponse Time (max)31.99sResponse Time (total)51.55s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.60sResponse Time (max)4.60sResponse Time (total)4.60s
A test is fully passed only if every run passed for that test.Did not follow instructions: 2Timed out: 1Wrong answer: 1Response Time (avg)52.13sResponse Time (max)163.96sResponse Time (total)834.16s…
Total Tests: 16Wrong Tests: 4Attempt pass rate: 81.3%Flaky tests: 2…Output Tokens: 1,658Reasoning Tokens: 200,786Response time: avg 52.13s · total 834.16s · max 163.96s
Did not follow instructions: 2Timed out: 1Wrong answer: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.69sResponse Time (max)10.84sResponse Time (total)29.06s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)163.96sResponse Time (max)163.96sResponse Time (total)163.96s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)30.26sResponse Time (max)32.03sResponse Time (total)60.52s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)79.53sResponse Time (max)95.52sResponse Time (total)238.59s
General Intelligence: 5.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)101.41sResponse Time (max)101.41sResponse Time (total)101.41s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)19.66sResponse Time (max)32.25sResponse Time (total)39.32s
Puzzle Solving: 8.3A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)64.61sResponse Time (max)123.57sResponse Time (total)193.84s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.45sResponse Time (max)7.45sResponse Time (total)7.45s
A test is fully passed only if every run passed for that test.Timed out: 4Wrong answer: 2API error: 1No answer: 1Response Time (avg)43.93sResponse Time (max)106.00sResponse Time (total)702.85s…
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)21.75sResponse Time (max)34.96sResponse Time (total)65.26s
Combined: 10.0A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)75.34sResponse Time (max)75.34sResponse Time (total)75.34s
Data parsing and extraction: 5.5A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)59.33sResponse Time (max)97.12sResponse Time (total)118.65s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Timed out: 2Wrong answer: 1Response Time (avg)88.34sResponse Time (max)106.00sResponse Time (total)265.01s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)30.30sResponse Time (max)30.30sResponse Time (total)30.30s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)24.45sResponse Time (max)43.36sResponse Time (total)48.89s
Puzzle Solving: 4.0A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)31.58sResponse Time (max)60.18sResponse Time (total)94.75s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.65sResponse Time (max)4.65sResponse Time (total)4.65s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 2Response Time (avg)5.96sResponse Time (max)18.33sResponse Time (total)95.30s…
Total Tests: 16Wrong Tests: 6Attempt pass rate: 70.8%Flaky tests: 3…Output Tokens: 19,272Reasoning Tokens: 0Response time: avg 5.96s · total 95.30s · max 18.33s
Wrong answer: 4Did not follow instructions: 2
Anti-AI Tricks: 7.3A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.72sResponse Time (max)7.35sResponse Time (total)14.17s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.96sResponse Time (max)11.96sResponse Time (total)11.96s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.21sResponse Time (max)2.52sResponse Time (total)4.42s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)13.01sResponse Time (max)18.33sResponse Time (total)39.04s
General Intelligence: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)1.99sResponse Time (max)1.99sResponse Time (total)1.99s
Instructions following: 9.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.29sResponse Time (max)4.18sResponse Time (total)6.59s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.93sResponse Time (max)3.05sResponse Time (total)8.78s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)8.36sResponse Time (max)8.36sResponse Time (total)8.36s
A test is fully passed only if every run passed for that test.Did not follow instructions: 3No answer: 1Timed out: 1Wrong answer: 1Response Time (avg)15.33sResponse Time (max)77.80sResponse Time (total)138.01s…
Total Tests: 16Wrong Tests: 6Attempt pass rate: 75.0%Flaky tests: 4…Output Tokens: 2,220Reasoning Tokens: 16,811Response time: avg 15.33s · total 138.01s · max 77.80s
Did not follow instructions: 3No answer: 1Timed out: 1Wrong answer: 1
Anti-AI Tricks: 7.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)14.34sResponse Time (max)14.34sResponse Time (total)14.34s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.06sResponse Time (max)14.06sResponse Time (total)14.06s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.15sResponse Time (max)3.15sResponse Time (total)3.15s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)77.80sResponse Time (max)77.80sResponse Time (total)77.80s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.32sResponse Time (max)4.32sResponse Time (total)4.32s
Instructions following: 9.5A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.12sResponse Time (max)3.12sResponse Time (total)3.12s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)5.47sResponse Time (max)6.45sResponse Time (total)10.94s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)10.30sResponse Time (max)10.30sResponse Time (total)10.30s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 1Response Time (avg)12.35sResponse Time (max)95.48sResponse Time (total)197.62s…
Total Tests: 16Wrong Tests: 5Attempt pass rate: 72.9%Flaky tests: 1…Output Tokens: 1,370Reasoning Tokens: 110,522Response time: avg 12.35s · total 197.62s · max 95.48s
Wrong answer: 4Did not follow instructions: 1
Anti-AI Tricks: 7.3A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.98sResponse Time (max)15.56sResponse Time (total)20.95s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)28.44sResponse Time (max)28.44sResponse Time (total)28.44s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.06sResponse Time (max)5.06sResponse Time (total)8.11s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)37.34sResponse Time (max)95.48sResponse Time (total)112.01s
General Intelligence: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.86sResponse Time (max)4.86sResponse Time (total)4.86s
Instructions following: 9.5A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.62sResponse Time (max)2.78sResponse Time (total)5.24s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.94sResponse Time (max)6.33sResponse Time (total)11.83s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.20sResponse Time (max)6.20sResponse Time (total)6.20s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 1Response Time (avg)7.03sResponse Time (max)38.52sResponse Time (total)112.51s…
Total Tests: 16Wrong Tests: 5Attempt pass rate: 75.0%Flaky tests: 2…Output Tokens: 15,845Reasoning Tokens: 0Response time: avg 7.03s · total 112.51s · max 38.52s
Wrong answer: 4Did not follow instructions: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.97sResponse Time (max)4.78sResponse Time (total)11.90s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.12sResponse Time (max)9.12sResponse Time (total)9.12s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.05sResponse Time (max)3.33sResponse Time (total)6.10s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)17.78sResponse Time (max)38.52sResponse Time (total)53.33s
General Intelligence: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)3.20sResponse Time (max)3.20sResponse Time (total)3.20s
Instructions following: 6.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)5.46sResponse Time (max)6.45sResponse Time (total)10.92s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.42sResponse Time (max)5.04sResponse Time (total)13.27s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.68sResponse Time (max)4.68sResponse Time (total)4.68s
A test is fully passed only if every run passed for that test.Extra formatting: 3Wrong answer: 2Did not follow instructions: 1Response Time (avg)5.57sResponse Time (max)23.84sResponse Time (total)50.12s…
Total Tests: 16Wrong Tests: 6Attempt pass rate: 66.7%Flaky tests: 1…Output Tokens: 6,895Reasoning Tokens: 0Response time: avg 5.57s · total 50.12s · max 23.84s
Extra formatting: 3Wrong answer: 2Did not follow instructions: 1
Anti-AI Tricks: 4.0A test is fully passed only if every run passed for that test.Extra formatting: 2Response Time (avg)4.83sResponse Time (max)4.83sResponse Time (total)4.83s
Combined: 9.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.84sResponse Time (max)23.84sResponse Time (total)23.84s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.43sResponse Time (max)3.43sResponse Time (total)3.43s
Domain specific: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.54sResponse Time (max)3.54sResponse Time (total)3.54s
General Intelligence: 5.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)2.56sResponse Time (max)2.56sResponse Time (total)2.56s
Instructions following: 5.5A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.96sResponse Time (max)1.96sResponse Time (total)1.96s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Extra formatting: 1Response Time (avg)2.92sResponse Time (max)3.33sResponse Time (total)5.84s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.11sResponse Time (max)4.11sResponse Time (total)4.11s
A test is fully passed only if every run passed for that test.Wrong answer: 5Did not follow instructions: 3Timed out: 2invalid tool call: 1Response Time (avg)43.03sResponse Time (max)237.27sResponse Time (total)387.25s…
Total Tests: 16Wrong Tests: 11Attempt pass rate: 60.4%Flaky tests: 9…Output Tokens: 107,044Reasoning Tokens: 206,190Response time: avg 43.03s · total 387.25s · max 237.27s
Wrong answer: 5Did not follow instructions: 3Timed out: 2common.failureReasons.invalid_tool_call: 1
Anti-AI Tricks: 9.3A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)32.42sResponse Time (max)32.42sResponse Time (total)32.42s
Combined: 10.0A test is fully passed only if every run passed for that test.invalid tool call: 1Response Time (avg)60.39sResponse Time (max)60.39sResponse Time (total)60.39s
Data parsing and extraction: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)7.48sResponse Time (max)7.48sResponse Time (total)7.48s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)237.27sResponse Time (max)237.27sResponse Time (total)237.27s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)6.63sResponse Time (max)6.63sResponse Time (total)6.63s
Instructions following: 8.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.64sResponse Time (max)4.64sResponse Time (total)4.64s
Puzzle Solving: 4.0A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)11.54sResponse Time (max)17.37sResponse Time (total)23.08s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.35sResponse Time (max)15.35sResponse Time (total)15.35s
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 2No answer: 1Timed out: 1Response Time (avg)69.83sResponse Time (max)137.29sResponse Time (total)628.45s…
Anti-AI Tricks: 7.0A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)85.28sResponse Time (max)85.28sResponse Time (total)85.28s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)71.37sResponse Time (max)71.37sResponse Time (total)71.37s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)49.78sResponse Time (max)49.78sResponse Time (total)49.78s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)137.29sResponse Time (max)137.29sResponse Time (total)137.29s
General Intelligence: 6.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)69.73sResponse Time (max)69.73sResponse Time (total)69.73s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)92.47sResponse Time (max)92.47sResponse Time (total)92.47s
Puzzle Solving: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)45.40sResponse Time (max)82.75sResponse Time (total)90.79s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.74sResponse Time (max)31.74sResponse Time (total)31.74s
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)7.15sResponse Time (max)11.96sResponse Time (total)64.34s…
Total Tests: 16Wrong Tests: 3Attempt pass rate: 81.3%Flaky tests: 0…Output Tokens: 1,502Reasoning Tokens: 9,706Response time: avg 7.15s · total 64.34s · max 11.96s
Wrong answer: 3
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.75sResponse Time (max)3.75sResponse Time (total)3.75s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)10.37sResponse Time (max)10.37sResponse Time (total)10.37s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)10.84sResponse Time (max)10.84sResponse Time (total)10.84s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)7.01sResponse Time (max)7.01sResponse Time (total)7.01s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.34sResponse Time (max)9.34sResponse Time (total)9.34s
Instructions following: 9.5A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.26sResponse Time (max)3.26sResponse Time (total)3.26s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.91sResponse Time (max)4.23sResponse Time (total)7.81s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.96sResponse Time (max)11.96sResponse Time (total)11.96s
A test is fully passed only if every run passed for that test.Timed out: 2Wrong answer: 1Response Time (avg)34.45sResponse Time (max)79.86sResponse Time (total)310.09s…
Total Tests: 16Wrong Tests: 3Attempt pass rate: 85.4%Flaky tests: 1…Output Tokens: 1,735Reasoning Tokens: 77,212Response time: avg 34.45s · total 310.09s · max 79.86s
Timed out: 2Wrong answer: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)10.37sResponse Time (max)10.37sResponse Time (total)10.37s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.85sResponse Time (max)46.85sResponse Time (total)46.85s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.91sResponse Time (max)46.91sResponse Time (total)46.91s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)17.50sResponse Time (max)17.50sResponse Time (total)17.50s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)79.86sResponse Time (max)79.86sResponse Time (total)79.86s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.93sResponse Time (max)31.93sResponse Time (total)31.93s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)34.57sResponse Time (max)49.12sResponse Time (total)69.13s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.54sResponse Time (max)7.54sResponse Time (total)7.54s
A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.36sResponse Time (max)50.16sResponse Time (total)111.21s…
Total Tests: 16Wrong Tests: 0Attempt pass rate: 100.0%Flaky tests: 0…Output Tokens: 1,634Reasoning Tokens: 47,907Response time: avg 12.36s · total 111.21s · max 50.16s
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.61sResponse Time (max)5.61sResponse Time (total)5.61s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)50.16sResponse Time (max)50.16sResponse Time (total)50.16s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.72sResponse Time (max)4.72sResponse Time (total)4.72s
Domain specific: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)21.12sResponse Time (max)21.12sResponse Time (total)21.12s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.09sResponse Time (max)4.09sResponse Time (total)4.09s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.10sResponse Time (max)6.10sResponse Time (total)6.10s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.43sResponse Time (max)4.68sResponse Time (total)8.85s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)10.55sResponse Time (max)10.55sResponse Time (total)10.55s
A test is fully passed only if every run passed for that test.Did not follow instructions: 4Wrong answer: 3Timed out: 1Response Time (avg)25.14sResponse Time (max)88.15sResponse Time (total)402.29s…
Total Tests: 16Wrong Tests: 8Attempt pass rate: 58.3%Flaky tests: 2…Output Tokens: 5,826Reasoning Tokens: 48,768Response time: avg 25.14s · total 402.29s · max 88.15s
Did not follow instructions: 4Wrong answer: 3Timed out: 1
Anti-AI Tricks: 7.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)16.45sResponse Time (max)26.00sResponse Time (total)49.36s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)88.15sResponse Time (max)88.15sResponse Time (total)88.15s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.58sResponse Time (max)13.87sResponse Time (total)25.16s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)44.63sResponse Time (max)82.55sResponse Time (total)133.89s
General Intelligence: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)13.50sResponse Time (max)13.50sResponse Time (total)13.50s
Instructions following: 7.5A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)15.66sResponse Time (max)21.80sResponse Time (total)31.32s
Puzzle Solving: 4.3A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)14.09sResponse Time (max)16.81sResponse Time (total)42.28s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)18.64sResponse Time (max)18.64sResponse Time (total)18.64s
A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1No answer: 1Timed out: 1Response Time (avg)16.16sResponse Time (max)28.96sResponse Time (total)129.26s…
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)22.26sResponse Time (max)22.26sResponse Time (total)22.26s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)28.96sResponse Time (max)28.96sResponse Time (total)28.96s
Data parsing and extraction: 5.0A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)8.90sResponse Time (max)8.90sResponse Time (total)8.90s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
General Intelligence: 5.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)14.69sResponse Time (max)14.69sResponse Time (total)14.69s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.25sResponse Time (max)7.25sResponse Time (total)7.25s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.64sResponse Time (max)16.34sResponse Time (total)31.27s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.93sResponse Time (max)15.93sResponse Time (total)15.93s
A test is fully passed only if every run passed for that test.Wrong answer: 9Did not follow instructions: 1Response Time (avg)1.48sResponse Time (max)2.89sResponse Time (total)23.64s…
Total Tests: 16Wrong Tests: 10Attempt pass rate: 41.7%Flaky tests: 2…Output Tokens: 1,819Reasoning Tokens: 0Response time: avg 1.48s · total 23.64s · max 2.89s
Wrong answer: 9Did not follow instructions: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.41sResponse Time (max)2.58sResponse Time (total)4.23s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.89sResponse Time (max)2.89sResponse Time (total)2.89s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.04sResponse Time (max)1.06sResponse Time (total)2.08s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.07sResponse Time (max)1.54sResponse Time (total)3.22s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.78sResponse Time (max)1.78sResponse Time (total)1.78s
Instructions following: 5.5A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.07sResponse Time (max)1.17sResponse Time (total)2.15s
Puzzle Solving: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)1.52sResponse Time (max)1.82sResponse Time (total)4.56s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.75sResponse Time (max)2.75sResponse Time (total)2.75s
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)6.11sResponse Time (max)14.72sResponse Time (total)97.74s…
Total Tests: 16Wrong Tests: 3Attempt pass rate: 83.3%Flaky tests: 1…Output Tokens: 1,586Reasoning Tokens: 19,950Response time: avg 6.11s · total 97.74s · max 14.72s
Wrong answer: 3
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.50sResponse Time (max)4.31sResponse Time (total)10.49s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.27sResponse Time (max)3.27sResponse Time (total)3.27s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.40sResponse Time (max)14.72sResponse Time (total)18.80s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)8.05sResponse Time (max)14.40sResponse Time (total)24.15s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.68sResponse Time (max)3.68sResponse Time (total)3.68s
Instructions following: 9.5A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.02sResponse Time (max)7.35sResponse Time (total)14.03s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.11sResponse Time (max)10.27sResponse Time (total)18.32s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.99sResponse Time (max)4.99sResponse Time (total)4.99s
A test is fully passed only if every run passed for that test.Timed out: 3API error: 1Did not follow instructions: 1Wrong answer: 1Response Time (avg)70.81sResponse Time (max)234.29sResponse Time (total)1132.90s…
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)71.35sResponse Time (max)168.31sResponse Time (total)214.06s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.78sResponse Time (max)17.78sResponse Time (total)17.78s
Data parsing and extraction: 5.5A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)56.99sResponse Time (max)80.14sResponse Time (total)113.98s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)146.50sResponse Time (max)234.29sResponse Time (total)439.49s
General Intelligence: 5.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)40.05sResponse Time (max)40.05sResponse Time (total)40.05s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)63.49sResponse Time (max)111.61sResponse Time (total)126.98s
Puzzle Solving: 4.0A test is fully passed only if every run passed for that test.Timed out: 2Response Time (avg)56.74sResponse Time (max)115.01sResponse Time (total)170.23s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)10.33sResponse Time (max)10.33sResponse Time (total)10.33s
A test is fully passed only if every run passed for that test.Wrong answer: 5Did not follow instructions: 3Timed out: 1Response Time (avg)47.94sResponse Time (max)204.02sResponse Time (total)431.47s…
Total Tests: 16Wrong Tests: 9Attempt pass rate: 60.4%Flaky tests: 6…Output Tokens: 4,386Reasoning Tokens: 142,080Response time: avg 47.94s · total 431.47s · max 204.02s
Wrong answer: 5Did not follow instructions: 3Timed out: 1
Anti-AI Tricks: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)37.73sResponse Time (max)37.73sResponse Time (total)37.73s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)65.96sResponse Time (max)65.96sResponse Time (total)65.96s
Data parsing and extraction: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)21.42sResponse Time (max)21.42sResponse Time (total)21.42s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)204.02sResponse Time (max)204.02sResponse Time (total)204.02s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)17.51sResponse Time (max)17.51sResponse Time (total)17.51s
Instructions following: 9.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)11.90sResponse Time (max)11.90sResponse Time (total)11.90s
Puzzle Solving: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)19.81sResponse Time (max)21.31sResponse Time (total)39.63s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)33.30sResponse Time (max)33.30sResponse Time (total)33.30s
A test is fully passed only if every run passed for that test.Did not follow instructions: 3Wrong answer: 2No answer: 1Timed out: 1Response Time (avg)26.35sResponse Time (max)121.79sResponse Time (total)237.11s…
Total Tests: 16Wrong Tests: 7Attempt pass rate: 66.7%Flaky tests: 4…Output Tokens: 1,183Reasoning Tokens: 83,875Response time: avg 26.35s · total 237.11s · max 121.79s
Did not follow instructions: 3Wrong answer: 2No answer: 1Timed out: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.65sResponse Time (max)5.65sResponse Time (total)5.65s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)37.64sResponse Time (max)37.64sResponse Time (total)37.64s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.63sResponse Time (max)6.63sResponse Time (total)6.63s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)121.79sResponse Time (max)121.79sResponse Time (total)121.79s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)16.25sResponse Time (max)16.25sResponse Time (total)16.25s
Instructions following: 5.5A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)5.30sResponse Time (max)5.30sResponse Time (total)5.30s
Puzzle Solving: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)8.08sResponse Time (max)8.38sResponse Time (total)16.17s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)27.71sResponse Time (max)27.71sResponse Time (total)27.71s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 1Response Time (avg)3.83sResponse Time (max)14.93sResponse Time (total)61.25s…
Total Tests: 16Wrong Tests: 5Attempt pass rate: 68.8%Flaky tests: 0…Output Tokens: 1,731Reasoning Tokens: 25,821Response time: avg 3.83s · total 61.25s · max 14.93s
Wrong answer: 4Did not follow instructions: 1
Anti-AI Tricks: 9.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)2.53sResponse Time (max)3.89sResponse Time (total)7.58s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.93sResponse Time (max)14.93sResponse Time (total)14.93s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.29sResponse Time (max)2.31sResponse Time (total)4.59s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)4.21sResponse Time (max)5.86sResponse Time (total)12.62s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.16sResponse Time (max)3.16sResponse Time (total)3.16s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.91sResponse Time (max)1.93sResponse Time (total)3.82s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.58sResponse Time (max)4.41sResponse Time (total)10.75s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.80sResponse Time (max)3.80sResponse Time (total)3.80s
A test is fully passed only if every run passed for that test.Wrong answer: 5Did not follow instructions: 4Response Time (avg)2.36sResponse Time (max)14.63sResponse Time (total)35.39s…
Total Tests: 16Wrong Tests: 9Attempt pass rate: 54.2%Flaky tests: 3…Output Tokens: 3,708Reasoning Tokens: 45,921Response time: avg 2.36s · total 35.39s · max 14.63s
Wrong answer: 5Did not follow instructions: 4
Anti-AI Tricks: 7.3A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)1.30sResponse Time (max)2.46sResponse Time (total)3.89s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.28sResponse Time (max)3.28sResponse Time (total)3.28s
Data parsing and extraction: 5.5A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.11sResponse Time (max)1.47sResponse Time (total)2.21s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)6.48sResponse Time (max)14.63sResponse Time (total)19.43s
General Intelligence: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)821msResponse Time (max)821msResponse Time (total)821ms
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.07sResponse Time (max)1.07sResponse Time (total)1.07s
Puzzle Solving: 1.7A test is fully passed only if every run passed for that test.Did not follow instructions: 2Wrong answer: 1Response Time (avg)934msResponse Time (max)1.18sResponse Time (total)2.80s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.89sResponse Time (max)1.89sResponse Time (total)1.89s
A test is fully passed only if every run passed for that test.Wrong answer: 7Did not follow instructions: 2No answer: 2invalid tool call: 1Response Time (avg)36.84sResponse Time (max)174.55sResponse Time (total)331.58s…
Total Tests: 16Wrong Tests: 12Attempt pass rate: 41.7%Flaky tests: 7…Output Tokens: 38,682Reasoning Tokens: 64,952Response time: avg 36.84s · total 331.58s · max 174.55s
Wrong answer: 7Did not follow instructions: 2No answer: 2common.failureReasons.invalid_tool_call: 1
Anti-AI Tricks: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)27.09sResponse Time (max)27.09sResponse Time (total)27.09s
Combined: 10.0A test is fully passed only if every run passed for that test.invalid tool call: 1Response Time (avg)65.57sResponse Time (max)65.57sResponse Time (total)65.57s
Data parsing and extraction: 5.0A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)1.51sResponse Time (max)1.51sResponse Time (total)1.51s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 2No answer: 1Response Time (avg)174.55sResponse Time (max)174.55sResponse Time (total)174.55s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)18.14sResponse Time (max)18.14sResponse Time (total)18.14s
Instructions following: 5.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.97sResponse Time (max)2.97sResponse Time (total)2.97s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)12.90sResponse Time (max)22.33sResponse Time (total)25.80s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.95sResponse Time (max)15.95sResponse Time (total)15.95s
A test is fully passed only if every run passed for that test.Wrong answer: 3API error: 1Did not follow instructions: 1Response Time (avg)25.33sResponse Time (max)96.01sResponse Time (total)253.33s…
Total Tests: 16Wrong Tests: 5Attempt pass rate: 72.9%Flaky tests: 1…Output Tokens: 11,613Reasoning Tokens: 106,714Response time: avg 25.33s · total 253.33s · max 96.01s
Wrong answer: 3API error: 1Did not follow instructions: 1
Anti-AI Tricks: 9.7A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.79sResponse Time (max)20.83sResponse Time (total)33.57s
Combined: 9.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)75.68sResponse Time (max)75.68sResponse Time (total)75.68s
Data parsing and extraction: 5.5A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)96.01sResponse Time (max)96.01sResponse Time (total)96.01s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.20sResponse Time (max)4.20sResponse Time (total)4.20s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.28sResponse Time (max)7.37sResponse Time (total)8.55s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.77sResponse Time (max)5.26sResponse Time (total)7.55s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)27.78sResponse Time (max)27.78sResponse Time (total)27.78s
A test is fully passed only if every run passed for that test.Timed out: 4Did not follow instructions: 1Wrong answer: 1Response Time (avg)65.09sResponse Time (max)262.83sResponse Time (total)846.14s…
Total Tests: 16Wrong Tests: 6Attempt pass rate: 68.8%Flaky tests: 2…Output Tokens: 1,965Reasoning Tokens: 58,456Response time: avg 65.09s · total 846.14s · max 262.83s
Timed out: 4Did not follow instructions: 1Wrong answer: 1
Anti-AI Tricks: 7.0A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)98.99sResponse Time (max)182.10sResponse Time (total)296.96s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)262.83sResponse Time (max)262.83sResponse Time (total)262.83s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)24.27sResponse Time (max)27.52sResponse Time (total)48.54s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Timed out: 3Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
General Intelligence: 6.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)36.65sResponse Time (max)36.65sResponse Time (total)36.65s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.47sResponse Time (max)19.46sResponse Time (total)34.93s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)25.85sResponse Time (max)32.95sResponse Time (total)77.55s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)88.68sResponse Time (max)88.68sResponse Time (total)88.68s
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 1Timed out: 1Response Time (avg)39.48sResponse Time (max)93.11sResponse Time (total)631.71s…
Total Tests: 16Wrong Tests: 5Attempt pass rate: 79.2%Flaky tests: 3…Output Tokens: 7,392Reasoning Tokens: 39,089Response time: avg 39.48s · total 631.71s · max 93.11s
Wrong answer: 3Did not follow instructions: 1Timed out: 1
Anti-AI Tricks: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)33.39sResponse Time (max)44.23sResponse Time (total)100.18s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)93.11sResponse Time (max)93.11sResponse Time (total)93.11s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)36.09sResponse Time (max)39.12sResponse Time (total)72.18s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)39.32sResponse Time (max)79.03sResponse Time (total)117.95s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)31.30sResponse Time (max)31.30sResponse Time (total)31.30s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)35.78sResponse Time (max)47.30sResponse Time (total)71.56s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)36.87sResponse Time (max)59.22sResponse Time (total)110.62s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)34.81sResponse Time (max)34.81sResponse Time (total)34.81s
A test is fully passed only if every run passed for that test.Wrong answer: 10API error: 1Extra formatting: 1Did not follow instructions: 1Response Time (avg)2.97sResponse Time (max)19.68sResponse Time (total)35.60s…
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.36sResponse Time (max)2.73sResponse Time (total)4.07s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.87sResponse Time (max)2.87sResponse Time (total)2.87s
Data parsing and extraction: 10.0A test is fully passed only if every run passed for that test.API error: 1Extra formatting: 1Response Time (avg)19.68sResponse Time (max)19.68sResponse Time (total)19.68s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)564msResponse Time (max)564msResponse Time (total)564ms
General Intelligence: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)1.67sResponse Time (max)1.67sResponse Time (total)1.67s
Instructions following: 5.5A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)857msResponse Time (max)955msResponse Time (total)1.71s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.38sResponse Time (max)1.74sResponse Time (total)2.75s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.28sResponse Time (max)2.28sResponse Time (total)2.28s
A test is fully passed only if every run passed for that test.Wrong answer: 9Did not follow instructions: 1Response Time (avg)3.72sResponse Time (max)46.00sResponse Time (total)59.46s…
Total Tests: 16Wrong Tests: 10Attempt pass rate: 39.6%Flaky tests: 1…Output Tokens: 2,679Reasoning Tokens: 0Response time: avg 3.72s · total 59.46s · max 46.00s
Wrong answer: 9Did not follow instructions: 1
Anti-AI Tricks: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)927msResponse Time (max)1.38sResponse Time (total)2.78s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)46.00sResponse Time (max)46.00sResponse Time (total)46.00s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.01sResponse Time (max)1.06sResponse Time (total)2.02s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)465msResponse Time (max)492msResponse Time (total)1.39s
General Intelligence: 5.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)1.12sResponse Time (max)1.12sResponse Time (total)1.12s
Instructions following: 4.5A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)585msResponse Time (max)715msResponse Time (total)1.17s
Puzzle Solving: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)982msResponse Time (max)1.36sResponse Time (total)2.95s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.04sResponse Time (max)2.04sResponse Time (total)2.04s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 1Response Time (avg)3.36sResponse Time (max)11.91sResponse Time (total)53.84s…
Total Tests: 16Wrong Tests: 5Attempt pass rate: 68.8%Flaky tests: 0…Output Tokens: 1,611Reasoning Tokens: 7,272Response time: avg 3.36s · total 53.84s · max 11.91s
Wrong answer: 4Did not follow instructions: 1
Anti-AI Tricks: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.18sResponse Time (max)3.18sResponse Time (total)6.53s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)11.91sResponse Time (max)11.91sResponse Time (total)11.91s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.00sResponse Time (max)3.74sResponse Time (total)5.99s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)2.36sResponse Time (max)3.51sResponse Time (total)7.07s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)1.54sResponse Time (max)1.54sResponse Time (total)1.54s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.49sResponse Time (max)1.66sResponse Time (total)2.99s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.76sResponse Time (max)5.08sResponse Time (total)8.27s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.54sResponse Time (max)9.54sResponse Time (total)9.54s
A test is fully passed only if every run passed for that test.Wrong answer: 5Response Time (avg)1.75sResponse Time (max)3.56sResponse Time (total)15.71s…
Total Tests: 16Wrong Tests: 5Attempt pass rate: 75.0%Flaky tests: 2…Output Tokens: 1,411Reasoning Tokens: 0Response time: avg 1.75s · total 15.71s · max 3.56s
Wrong answer: 5
Anti-AI Tricks: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.59sResponse Time (max)1.59sResponse Time (total)1.59s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.56sResponse Time (max)3.56sResponse Time (total)3.56s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.41sResponse Time (max)1.41sResponse Time (total)1.41s
Domain specific: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)963msResponse Time (max)963msResponse Time (total)963ms
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.13sResponse Time (max)1.13sResponse Time (total)1.13s
Instructions following: 5.5A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.58sResponse Time (max)1.58sResponse Time (total)1.58s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.06sResponse Time (max)1.06sResponse Time (total)2.12s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.35sResponse Time (max)3.35sResponse Time (total)3.35s
A test is fully passed only if every run passed for that test.Wrong answer: 7Response Time (avg)4.03sResponse Time (max)11.07sResponse Time (total)36.30s…
Total Tests: 16Wrong Tests: 7Attempt pass rate: 56.3%Flaky tests: 0…Output Tokens: 1,548Reasoning Tokens: 0Response time: avg 4.03s · total 36.30s · max 11.07s
Wrong answer: 7
Anti-AI Tricks: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)3.39sResponse Time (max)3.39sResponse Time (total)3.39s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.98sResponse Time (max)4.98sResponse Time (total)4.98s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.78sResponse Time (max)5.78sResponse Time (total)5.78s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)2.24sResponse Time (max)2.24sResponse Time (total)2.24s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.27sResponse Time (max)3.27sResponse Time (total)3.27s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.48sResponse Time (max)1.48sResponse Time (total)1.48s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.05sResponse Time (max)2.08sResponse Time (total)4.10s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.07sResponse Time (max)11.07sResponse Time (total)11.07s
A test is fully passed only if every run passed for that test.Wrong answer: 7Response Time (avg)2.65sResponse Time (max)6.65sResponse Time (total)26.52s…
Total Tests: 16Wrong Tests: 7Attempt pass rate: 58.3%Flaky tests: 1…Output Tokens: 2,015Reasoning Tokens: 0Response time: avg 2.65s · total 26.52s · max 6.65s
Wrong answer: 7
Anti-AI Tricks: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)2.74sResponse Time (max)2.74sResponse Time (total)2.74s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.65sResponse Time (max)6.65sResponse Time (total)6.65s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.89sResponse Time (max)1.89sResponse Time (total)1.89s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.17sResponse Time (max)1.44sResponse Time (total)2.33s
General Intelligence: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.26sResponse Time (max)2.26sResponse Time (total)2.26s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.67sResponse Time (max)1.67sResponse Time (total)1.67s
Puzzle Solving: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.82sResponse Time (max)3.52sResponse Time (total)5.65s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.33sResponse Time (max)3.33sResponse Time (total)3.33s
A test is fully passed only if every run passed for that test.Wrong answer: 6Extra formatting: 2invalid tool call: 1Response Time (avg)12.86sResponse Time (max)115.89sResponse Time (total)205.78s…
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.Extra formatting: 2Wrong answer: 1Response Time (avg)8.79sResponse Time (max)12.26sResponse Time (total)26.38s
Combined: 8.0A test is fully passed only if every run passed for that test.invalid tool call: 1Response Time (avg)115.89sResponse Time (max)115.89sResponse Time (total)115.89s
Data parsing and extraction: 5.4A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)9.42sResponse Time (max)16.20sResponse Time (total)18.84s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.61sResponse Time (max)1.77sResponse Time (total)4.83s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.86sResponse Time (max)2.86sResponse Time (total)2.86s
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.52sResponse Time (max)1.99sResponse Time (total)3.04s
Puzzle Solving: 7.7A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)7.37sResponse Time (max)10.78sResponse Time (total)22.10s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.85sResponse Time (max)11.85sResponse Time (total)11.85s
A test is fully passed only if every run passed for that test.Wrong answer: 9Did not follow instructions: 2Response Time (avg)1.75sResponse Time (max)9.39sResponse Time (total)28.05s…
Total Tests: 16Wrong Tests: 11Attempt pass rate: 37.5%Flaky tests: 2…Output Tokens: 3,161Reasoning Tokens: 0Response time: avg 1.75s · total 28.05s · max 9.39s
Wrong answer: 9Did not follow instructions: 2
Anti-AI Tricks: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)796msResponse Time (max)1.34sResponse Time (total)2.39s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)9.39sResponse Time (max)9.39sResponse Time (total)9.39s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.43sResponse Time (max)1.45sResponse Time (total)2.86s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)540msResponse Time (max)649msResponse Time (total)1.62s
General Intelligence: 5.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)2.51sResponse Time (max)2.51sResponse Time (total)2.51s
Instructions following: 4.5A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)815msResponse Time (max)973msResponse Time (total)1.63s
Puzzle Solving: 6.3A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)1.37sResponse Time (max)2.23sResponse Time (total)4.12s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.54sResponse Time (max)3.54sResponse Time (total)3.54s
A test is fully passed only if every run passed for that test.Wrong answer: 11Response Time (avg)11.91sResponse Time (max)42.13sResponse Time (total)107.16s…
Total Tests: 16Wrong Tests: 11Attempt pass rate: 39.6%Flaky tests: 3…Output Tokens: 2,000Reasoning Tokens: 0Response time: avg 11.91s · total 107.16s · max 42.13s
Wrong answer: 11
Anti-AI Tricks: 2.7A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)11.38sResponse Time (max)11.38sResponse Time (total)11.38s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)19.16sResponse Time (max)19.16sResponse Time (total)19.16s
Data parsing and extraction: 5.4A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)42.13sResponse Time (max)42.13sResponse Time (total)42.13s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)4.38sResponse Time (max)4.38sResponse Time (total)4.38s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.00sResponse Time (max)4.00sResponse Time (total)4.00s
Instructions following: 5.5A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.67sResponse Time (max)2.67sResponse Time (total)2.67s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)4.73sResponse Time (max)7.81sResponse Time (total)9.45s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.99sResponse Time (max)13.99sResponse Time (total)13.99s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 2Response Time (avg)1.33sResponse Time (max)3.39sResponse Time (total)21.27s…
Total Tests: 16Wrong Tests: 6Attempt pass rate: 66.7%Flaky tests: 1…Output Tokens: 4,715Reasoning Tokens: 0Response time: avg 1.33s · total 21.27s · max 3.39s
Wrong answer: 4Did not follow instructions: 2
Anti-AI Tricks: 6.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)1.16sResponse Time (max)1.47sResponse Time (total)3.49s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.20sResponse Time (max)3.20sResponse Time (total)3.20s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.22sResponse Time (max)1.33sResponse Time (total)2.44s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)942msResponse Time (max)1.12sResponse Time (total)2.83s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)741msResponse Time (max)741msResponse Time (total)741ms
Instructions following: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.13sResponse Time (max)1.14sResponse Time (total)2.27s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)972msResponse Time (max)1.13sResponse Time (total)2.92s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.39sResponse Time (max)3.39sResponse Time (total)3.39s
A test is fully passed only if every run passed for that test.Wrong answer: 8Did not follow instructions: 2Response Time (avg)4.10sResponse Time (max)47.43sResponse Time (total)65.62s…
Total Tests: 16Wrong Tests: 10Attempt pass rate: 50.0%Flaky tests: 3…Output Tokens: 3,756Reasoning Tokens: 0Response time: avg 4.10s · total 65.62s · max 47.43s
Wrong answer: 8Did not follow instructions: 2
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.76sResponse Time (max)4.39sResponse Time (total)5.27s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)47.43sResponse Time (max)47.43sResponse Time (total)47.43s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.16sResponse Time (max)1.42sResponse Time (total)2.33s
Domain specific: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)485msResponse Time (max)549msResponse Time (total)1.45s
General Intelligence: 6.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)1.19sResponse Time (max)1.19sResponse Time (total)1.19s
Instructions following: 5.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)809msResponse Time (max)983msResponse Time (total)1.62s
Puzzle Solving: 1.7A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)1.34sResponse Time (max)2.25sResponse Time (total)4.03s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.30sResponse Time (max)2.30sResponse Time (total)2.30s
A test is fully passed only if every run passed for that test.Wrong answer: 9Did not follow instructions: 1Response Time (avg)923msResponse Time (max)4.39sResponse Time (total)14.78s…
Total Tests: 16Wrong Tests: 10Attempt pass rate: 43.8%Flaky tests: 2…Output Tokens: 1,270Reasoning Tokens: 0Response time: avg 923ms · total 14.78s · max 4.39s
Wrong answer: 9Did not follow instructions: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)668msResponse Time (max)844msResponse Time (total)2.01s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.39sResponse Time (max)4.39sResponse Time (total)4.39s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)652msResponse Time (max)660msResponse Time (total)1.30s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)495msResponse Time (max)642msResponse Time (total)1.49s
General Intelligence: 5.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)615msResponse Time (max)615msResponse Time (total)615ms
Instructions following: 9.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)672msResponse Time (max)785msResponse Time (total)1.34s
Puzzle Solving: 4.7A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)576msResponse Time (max)700msResponse Time (total)1.73s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.91sResponse Time (max)1.91sResponse Time (total)1.91s
A test is fully passed only if every run passed for that test.Wrong answer: 5Did not follow instructions: 4Response Time (avg)16.65sResponse Time (max)50.92sResponse Time (total)149.88s…
Total Tests: 16Wrong Tests: 9Attempt pass rate: 54.2%Flaky tests: 5…Output Tokens: 13,210Reasoning Tokens: 34,230Response time: avg 16.65s · total 149.88s · max 50.92s
Wrong answer: 5Did not follow instructions: 4
Anti-AI Tricks: 7.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)19.76sResponse Time (max)19.76sResponse Time (total)19.76s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.18sResponse Time (max)31.18sResponse Time (total)31.18s
Data parsing and extraction: 5.5A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.98sResponse Time (max)1.98sResponse Time (total)1.98s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)50.92sResponse Time (max)50.92sResponse Time (total)50.92s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)7.90sResponse Time (max)7.90sResponse Time (total)7.90s
Instructions following: 9.5A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.63sResponse Time (max)7.63sResponse Time (total)7.63s
Puzzle Solving: 1.7A test is fully passed only if every run passed for that test.Did not follow instructions: 2Wrong answer: 1Response Time (avg)11.80sResponse Time (max)12.60sResponse Time (total)23.61s
Tool Calling: 9.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.91sResponse Time (max)6.91sResponse Time (total)6.91s
A test is fully passed only if every run passed for that test.Wrong answer: 11Did not follow instructions: 2Response Time (avg)1.90sResponse Time (max)5.51sResponse Time (total)17.14s…
Total Tests: 16Wrong Tests: 13Attempt pass rate: 25.0%Flaky tests: 2…Output Tokens: 1,148Reasoning Tokens: 0Response time: avg 1.90s · total 17.14s · max 5.51s
Wrong answer: 11Did not follow instructions: 2
Anti-AI Tricks: 1.3A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)1.73sResponse Time (max)1.73sResponse Time (total)1.73s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.33sResponse Time (max)3.33sResponse Time (total)3.33s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)943msResponse Time (max)943msResponse Time (total)943ms
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.06sResponse Time (max)1.06sResponse Time (total)1.06s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)1.08sResponse Time (max)1.08sResponse Time (total)1.08s
Instructions following: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)923msResponse Time (max)923msResponse Time (total)923ms
Puzzle Solving: 1.3A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.28sResponse Time (max)1.36sResponse Time (total)2.56s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)5.51sResponse Time (max)5.51sResponse Time (total)5.51s
A test is fully passed only if every run passed for that test.Wrong answer: 10Extra formatting: 1Did not follow instructions: 1Response Time (avg)11.68sResponse Time (max)45.14sResponse Time (total)116.76s…
Total Tests: 16Wrong Tests: 12Attempt pass rate: 25.0%Flaky tests: 0…Output Tokens: 3,026Reasoning Tokens: 0Response time: avg 11.68s · total 116.76s · max 45.14s
Wrong answer: 10Extra formatting: 1Did not follow instructions: 1
Anti-AI Tricks: 2.3A test is fully passed only if every run passed for that test.Extra formatting: 1Did not follow instructions: 1Wrong answer: 1Response Time (avg)4.39sResponse Time (max)4.39sResponse Time (total)4.39s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)45.14sResponse Time (max)45.14sResponse Time (total)45.14s
Data parsing and extraction: 5.4A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.32sResponse Time (max)1.32sResponse Time (total)1.32s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)962msResponse Time (max)962msResponse Time (total)962ms
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.34sResponse Time (max)1.34sResponse Time (total)1.34s
Instructions following: 4.5A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)7.71sResponse Time (max)14.65sResponse Time (total)15.42s
Puzzle Solving: 1.3A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)22.86sResponse Time (max)42.58sResponse Time (total)45.73s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.47sResponse Time (max)2.47sResponse Time (total)2.47s
A test is fully passed only if every run passed for that test.Wrong answer: 8Did not follow instructions: 5Response Time (avg)12.53sResponse Time (max)81.80sResponse Time (total)125.32s…
Total Tests: 16Wrong Tests: 13Attempt pass rate: 27.1%Flaky tests: 2…Output Tokens: 2,935Reasoning Tokens: 0Response time: avg 12.53s · total 125.32s · max 81.80s
Wrong answer: 8Did not follow instructions: 5
Anti-AI Tricks: 1.3A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)15.28sResponse Time (max)15.28sResponse Time (total)15.28s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.28sResponse Time (max)4.28sResponse Time (total)4.28s
Data parsing and extraction: 5.4A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)81.80sResponse Time (max)81.80sResponse Time (total)81.80s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)638msResponse Time (max)638msResponse Time (total)638ms
General Intelligence: 6.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)1.39sResponse Time (max)1.39sResponse Time (total)1.39s
Instructions following: 4.5A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)7.34sResponse Time (max)13.67sResponse Time (total)14.68s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.Did not follow instructions: 2Wrong answer: 1Response Time (avg)2.30sResponse Time (max)3.80sResponse Time (total)4.61s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.64sResponse Time (max)2.64sResponse Time (total)2.64s
A test is fully passed only if every run passed for that test.Wrong answer: 11Did not follow instructions: 1Response Time (avg)596msResponse Time (max)1.27sResponse Time (total)9.54s…
Total Tests: 16Wrong Tests: 12Attempt pass rate: 31.3%Flaky tests: 2…Output Tokens: 1,303Reasoning Tokens: 0Response time: avg 596ms · total 9.54s · max 1.27s
Wrong answer: 11Did not follow instructions: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)466msResponse Time (max)716msResponse Time (total)1.40s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)606msResponse Time (max)606msResponse Time (total)606ms
Data parsing and extraction: 5.5A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)667msResponse Time (max)819msResponse Time (total)1.33s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)534msResponse Time (max)733msResponse Time (total)1.60s
General Intelligence: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)628msResponse Time (max)628msResponse Time (total)628ms
Instructions following: 5.5A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)551msResponse Time (max)622msResponse Time (total)1.10s
Puzzle Solving: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)533msResponse Time (max)637msResponse Time (total)1.60s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.27sResponse Time (max)1.27sResponse Time (total)1.27s
A test is fully passed only if every run passed for that test.Wrong answer: 8Did not follow instructions: 1Response Time (avg)3.54sResponse Time (max)13.73sResponse Time (total)56.70s…
Total Tests: 16Wrong Tests: 9Attempt pass rate: 45.8%Flaky tests: 1…Output Tokens: 3,774Reasoning Tokens: 0Response time: avg 3.54s · total 56.70s · max 13.73s
Wrong answer: 8Did not follow instructions: 1
Anti-AI Tricks: 2.3A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.62sResponse Time (max)3.89sResponse Time (total)4.85s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.22sResponse Time (max)6.22sResponse Time (total)6.22s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.57sResponse Time (max)1.83sResponse Time (total)3.14s
Domain specific: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)905msResponse Time (max)1.10sResponse Time (total)2.71s
General Intelligence: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)803msResponse Time (max)803msResponse Time (total)803ms
Instructions following: 5.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)8.81sResponse Time (max)13.73sResponse Time (total)17.61s
Puzzle Solving: 1.3A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)5.90sResponse Time (max)12.19sResponse Time (total)17.69s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.67sResponse Time (max)3.67sResponse Time (total)3.67s
A test is fully passed only if every run passed for that test.Wrong answer: 11Did not follow instructions: 1Response Time (avg)2.07sResponse Time (max)7.58sResponse Time (total)18.60s…
Total Tests: 16Wrong Tests: 12Attempt pass rate: 25.0%Flaky tests: 0…Output Tokens: 1,594Reasoning Tokens: 0Response time: avg 2.07s · total 18.60s · max 7.58s
Wrong answer: 11Did not follow instructions: 1
Anti-AI Tricks: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.83sResponse Time (max)1.83sResponse Time (total)1.83s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)7.58sResponse Time (max)7.58sResponse Time (total)7.58s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.27sResponse Time (max)1.27sResponse Time (total)1.27s
Domain specific: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)637msResponse Time (max)637msResponse Time (total)637ms
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)909msResponse Time (max)909msResponse Time (total)909ms
Instructions following: 4.5A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)1.27sResponse Time (max)1.27sResponse Time (total)1.27s
Puzzle Solving: 2.3A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.30sResponse Time (max)1.54sResponse Time (total)2.60s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.51sResponse Time (max)2.51sResponse Time (total)2.51s
A test is fully passed only if every run passed for that test.Wrong answer: 9Did not follow instructions: 2invalid tool call: 1Response Time (avg)2.99sResponse Time (max)7.05sResponse Time (total)26.90s…
Total Tests: 16Wrong Tests: 12Attempt pass rate: 35.4%Flaky tests: 3…Output Tokens: 1,855Reasoning Tokens: 0Response time: avg 2.99s · total 26.90s · max 7.05s
Wrong answer: 9Did not follow instructions: 2common.failureReasons.invalid_tool_call: 1
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)6.59sResponse Time (max)6.59sResponse Time (total)6.59s
Combined: 10.0A test is fully passed only if every run passed for that test.invalid tool call: 1Response Time (avg)3.22sResponse Time (max)3.22sResponse Time (total)3.22s
Data parsing and extraction: 5.4A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.82sResponse Time (max)4.82sResponse Time (total)4.82s
Domain specific: 7.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)744msResponse Time (max)744msResponse Time (total)744ms
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.59sResponse Time (max)1.59sResponse Time (total)1.59s
Instructions following: 5.5A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)888msResponse Time (max)888msResponse Time (total)888ms
Puzzle Solving: 3.7A test is fully passed only if every run passed for that test.Did not follow instructions: 2Wrong answer: 1Response Time (avg)1.00sResponse Time (max)1.12sResponse Time (total)2.00s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)7.05sResponse Time (max)7.05sResponse Time (total)7.05s
A test is fully passed only if every run passed for that test.Wrong answer: 9API error: 4Did not follow instructions: 2Response Time (avg)811msResponse Time (max)2.88sResponse Time (total)11.35s…
Total Tests: 16Wrong Tests: 15Attempt pass rate: 14.6%Flaky tests: 2…Output Tokens: 1,185Reasoning Tokens: 0Response time: avg 811ms · total 11.35s · max 2.88s
Wrong answer: 9API error: 4Did not follow instructions: 2
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)471msResponse Time (max)872msResponse Time (total)1.41s
Combined: 10.0A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Data parsing and extraction: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)714msResponse Time (max)987msResponse Time (total)1.43s
Domain specific: 4.0A test is fully passed only if every run passed for that test.API error: 1Wrong answer: 1Response Time (avg)287msResponse Time (max)334msResponse Time (total)860ms
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)395msResponse Time (max)395msResponse Time (total)395ms
Instructions following: 4.5A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.09sResponse Time (max)1.90sResponse Time (total)2.18s
Puzzle Solving: 3.3A test is fully passed only if every run passed for that test.API error: 1Did not follow instructions: 1Wrong answer: 1Response Time (avg)1.69sResponse Time (max)2.88sResponse Time (total)5.08s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
A test is fully passed only if every run passed for that test.Did not follow instructions: 3Wrong answer: 3Response Time (avg)29.10sResponse Time (max)170.45sResponse Time (total)290.96s…
Total Tests: 16Wrong Tests: 6Attempt pass rate: 68.8%Flaky tests: 2…Output Tokens: 71,452Reasoning Tokens: 155,147Response time: avg 29.10s · total 290.96s · max 170.45s
Did not follow instructions: 3Wrong answer: 3
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)18.54sResponse Time (max)32.30sResponse Time (total)37.07s
Combined: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)29.57sResponse Time (max)29.57sResponse Time (total)29.57s
Data parsing and extraction: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.01sResponse Time (max)15.01sResponse Time (total)15.01s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)170.45sResponse Time (max)170.45sResponse Time (total)170.45s
General Intelligence: 6.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)6.54sResponse Time (max)6.54sResponse Time (total)6.54s
Instructions following: 9.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)4.98sResponse Time (max)4.98sResponse Time (total)4.98s
Puzzle Solving: 4.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)7.72sResponse Time (max)10.60sResponse Time (total)15.44s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.91sResponse Time (max)11.91sResponse Time (total)11.91s
A test is fully passed only if every run passed for that test.Wrong answer: 9Did not follow instructions: 2Response Time (avg)3.15sResponse Time (max)8.91sResponse Time (total)50.46s…
Total Tests: 16Wrong Tests: 11Attempt pass rate: 33.3%Flaky tests: 1…Output Tokens: 1,837Reasoning Tokens: 0Response time: avg 3.15s · total 50.46s · max 8.91s
Wrong answer: 9Did not follow instructions: 2
Anti-AI Tricks: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)3.59sResponse Time (max)8.17sResponse Time (total)10.78s
Combined: 10.0A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)8.91sResponse Time (max)8.91sResponse Time (total)8.91s
Data parsing and extraction: 9.9A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.26sResponse Time (max)4.66sResponse Time (total)6.52s
Domain specific: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)877msResponse Time (max)894msResponse Time (total)2.63s
General Intelligence: 3.0A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)2.86sResponse Time (max)2.86sResponse Time (total)2.86s
Instructions following: 3.5A test is fully passed only if every run passed for that test.Did not follow instructions: 1Wrong answer: 1Response Time (avg)1.09sResponse Time (max)1.23sResponse Time (total)2.19s
Puzzle Solving: 4.0A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)3.30sResponse Time (max)4.81sResponse Time (total)9.91s
Tool Calling: 10.0A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.67sResponse Time (max)6.67sResponse Time (total)6.67s