A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.11sResponse Time (max)82.37sResponse Time (total)217.93s…
Total Tests: 18Wrong Tests: 0Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 100.0%Flaky tests: 0…Output Tokens: 655Reasoning Tokens: 33,749Response time: avg 12.11s · total 217.93s · max 82.37s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.26sResponse Time (max)5.01sResponse Time (total)13.04s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)82.37sResponse Time (max)82.37sResponse Time (total)82.37s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.58sResponse Time (max)23.58sResponse Time (total)23.58s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.62sResponse Time (max)8.37sResponse Time (total)15.24s
Domain specific
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.81sResponse Time (max)32.44sResponse Time (total)44.43s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.34sResponse Time (max)6.34sResponse Time (total)6.34s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.30sResponse Time (max)5.19sResponse Time (total)8.59s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.86sResponse Time (max)7.59sResponse Time (total)14.57s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.78sResponse Time (max)9.78sResponse Time (total)9.78s
A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)15.96sResponse Time (max)40.61sResponse Time (total)175.52s…
Total Tests: 18Wrong Tests: 1Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 94.4%Flaky tests: 0…Output Tokens: 1,932Reasoning Tokens: 40,542Response time: avg 15.96s · total 175.52s · max 40.61s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.90sResponse Time (max)9.52sResponse Time (total)15.80s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)19.88sResponse Time (max)19.88sResponse Time (total)19.88s
Combined
: 9.5 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)40.61sResponse Time (max)40.61sResponse Time (total)40.61s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.72sResponse Time (max)7.72sResponse Time (total)7.72s
Domain specific
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)32.73sResponse Time (max)32.73sResponse Time (total)32.73s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.77sResponse Time (max)11.77sResponse Time (total)11.77s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.56sResponse Time (max)9.56sResponse Time (total)9.56s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.15sResponse Time (max)8.49sResponse Time (total)14.30s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.15sResponse Time (max)23.15sResponse Time (total)23.15s
A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)3.53sResponse Time (max)21.45sResponse Time (total)60.03s…
Total Tests: 18Wrong Tests: 2Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 88.9%Flaky tests: 0…Output Tokens: 5,375Reasoning Tokens: 1,341Response time: avg 3.53s · total 60.03s · max 21.45s
Anti-AI Tricks
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.85sResponse Time (max)2.71sResponse Time (total)7.38s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.41sResponse Time (max)6.41sResponse Time (total)6.41s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)21.45sResponse Time (max)21.45sResponse Time (total)21.45s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.37sResponse Time (max)3.30sResponse Time (total)4.74s
Domain specific
: 7.7 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)1.17sResponse Time (max)1.40sResponse Time (total)2.35s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.87sResponse Time (max)2.87sResponse Time (total)2.87s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.57sResponse Time (max)1.66sResponse Time (total)3.14s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.51sResponse Time (max)2.89sResponse Time (total)7.54s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.17sResponse Time (max)4.17sResponse Time (total)4.17s
A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)3.13sResponse Time (max)18.27sResponse Time (total)56.33s…
Total Tests: 18Wrong Tests: 2Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 88.9%Flaky tests: 0…Output Tokens: 6,326Reasoning Tokens: 0Response time: avg 3.13s · total 56.33s · max 18.27s
Anti-AI Tricks
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.12sResponse Time (max)3.75sResponse Time (total)8.50s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.84sResponse Time (max)2.84sResponse Time (total)2.84s
Combined
: 9.5 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)18.27sResponse Time (max)18.27sResponse Time (total)18.27s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.15sResponse Time (max)2.33sResponse Time (total)4.29s
Domain specific
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.19sResponse Time (max)1.40sResponse Time (total)3.58s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.47sResponse Time (max)3.47sResponse Time (total)3.47s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.46sResponse Time (max)1.68sResponse Time (total)2.91s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.58sResponse Time (max)4.07sResponse Time (total)7.73s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.74sResponse Time (max)4.74sResponse Time (total)4.74s
A test is fully passed only if every run passed for that test.Wrong answer: 2Did not follow instructions: 1Response Time (avg)32.75sResponse Time (max)332.10sResponse Time (total)589.59s…
Total Tests: 18Wrong Tests: 3Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 87.0%Flaky tests: 2…Output Tokens: 1,920Reasoning Tokens: 89,632Response time: avg 32.75s · total 589.59s · max 332.10s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.66sResponse Time (max)6.74sResponse Time (total)18.65s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.09sResponse Time (max)9.09sResponse Time (total)9.09s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)19.29sResponse Time (max)19.29sResponse Time (total)19.29s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.18sResponse Time (max)4.35sResponse Time (total)8.36s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)164.14sResponse Time (max)332.10sResponse Time (total)492.41s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.16sResponse Time (max)4.16sResponse Time (total)4.16s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.36sResponse Time (max)3.46sResponse Time (total)6.73s
Puzzle Solving
: 8.6 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)6.78sResponse Time (max)10.54sResponse Time (total)20.33s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)10.57sResponse Time (max)10.57sResponse Time (total)10.57s
A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)6.01sResponse Time (max)14.72sResponse Time (total)108.12s…
Total Tests: 18Wrong Tests: 3Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 85.2%Flaky tests: 1…Output Tokens: 2,018Reasoning Tokens: 23,273Response time: avg 6.01s · total 108.12s · max 14.72s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.48sResponse Time (max)4.31sResponse Time (total)13.94s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.94sResponse Time (max)6.94sResponse Time (total)6.94s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.27sResponse Time (max)3.27sResponse Time (total)3.27s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.40sResponse Time (max)14.72sResponse Time (total)18.80s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)8.05sResponse Time (max)14.40sResponse Time (total)24.15s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.68sResponse Time (max)3.68sResponse Time (total)3.68s
Instructions following
: 9.9 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.02sResponse Time (max)7.35sResponse Time (total)14.03s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.11sResponse Time (max)10.27sResponse Time (total)18.32s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.99sResponse Time (max)4.99sResponse Time (total)4.99s
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 2Response Time (avg)30.37sResponse Time (max)168.71sResponse Time (total)546.72s…
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 83.3%Flaky tests: 3…Output Tokens: 3,257Reasoning Tokens: 52,042Response time: avg 30.37s · total 546.72s · max 168.71s
Anti-AI Tricks
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)17.99sResponse Time (max)48.33sResponse Time (total)71.98s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)74.49sResponse Time (max)74.49sResponse Time (total)74.49s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)37.67sResponse Time (max)37.67sResponse Time (total)37.67s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.07sResponse Time (max)12.19sResponse Time (total)18.14s
Domain specific
: 5.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)88.74sResponse Time (max)168.71sResponse Time (total)266.21s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.26sResponse Time (max)9.02sResponse Time (total)14.52s
Puzzle Solving
: 9.0 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)11.03sResponse Time (max)13.85sResponse Time (total)33.09s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.38sResponse Time (max)12.38sResponse Time (total)12.38s
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 2Response Time (avg)15.38sResponse Time (max)100.93sResponse Time (total)276.91s…
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 83.3%Flaky tests: 3…Output Tokens: 2,279Reasoning Tokens: 35,179Response time: avg 15.38s · total 276.91s · max 100.93s
Anti-AI Tricks
: 8.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.16sResponse Time (max)6.68sResponse Time (total)16.63s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)8.95sResponse Time (max)8.95sResponse Time (total)8.95s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)19.56sResponse Time (max)19.56sResponse Time (total)19.56s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.07sResponse Time (max)3.59sResponse Time (total)6.15s
Domain specific
: 5.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)64.31sResponse Time (max)100.93sResponse Time (total)192.94s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.04sResponse Time (max)3.44sResponse Time (total)6.07s
Puzzle Solving
: 9.0 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)5.12sResponse Time (max)8.73sResponse Time (total)15.37s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.37sResponse Time (max)6.37sResponse Time (total)6.37s
A test is fully passed only if every run passed for that test.Timed out: 2Wrong answer: 2Response Time (avg)46.56sResponse Time (max)120.91sResponse Time (total)512.20s…
Total Tests: 18Wrong Tests: 4Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 83.3%Flaky tests: 2…Output Tokens: 2,121Reasoning Tokens: 111,889Response time: avg 46.56s · total 512.20s · max 120.91s
Anti-AI Tricks
: 8.2 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)45.78sResponse Time (max)81.20sResponse Time (total)91.57s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)120.91sResponse Time (max)120.91sResponse Time (total)120.91s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.85sResponse Time (max)46.85sResponse Time (total)46.85s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.91sResponse Time (max)46.91sResponse Time (total)46.91s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)17.50sResponse Time (max)17.50sResponse Time (total)17.50s
General Intelligence
: 4.7 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)79.86sResponse Time (max)79.86sResponse Time (total)79.86s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.93sResponse Time (max)31.93sResponse Time (total)31.93s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)34.57sResponse Time (max)49.12sResponse Time (total)69.13s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.54sResponse Time (max)7.54sResponse Time (total)7.54s
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 81.5%Flaky tests: 3…Output Tokens: 238,920Reasoning Tokens: 0Response time: avg 55.19s · total 938.23s · max 149.94s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)32.69sResponse Time (max)85.41sResponse Time (total)130.78s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)99.76sResponse Time (max)99.76sResponse Time (total)99.76s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)113.09sResponse Time (max)113.09sResponse Time (total)113.09s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)12.11sResponse Time (max)12.11sResponse Time (total)12.11s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)109.04sResponse Time (max)149.94sResponse Time (total)327.11s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)24.31sResponse Time (max)24.31sResponse Time (total)24.31s
Puzzle Solving
: 9.0 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)28.07sResponse Time (max)45.06sResponse Time (total)84.21s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)78.83sResponse Time (max)78.83sResponse Time (total)78.83s
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 1Response Time (avg)13.94sResponse Time (max)43.55sResponse Time (total)237.01s…
Total Tests: 17Wrong Tests: 4Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 76.5%Flaky tests: 0…Output Tokens: 1,756Reasoning Tokens: 77,213Response time: avg 13.94s · total 237.01s · max 43.55s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.90sResponse Time (max)19.37sResponse Time (total)39.60s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)34.95sResponse Time (max)34.95sResponse Time (total)34.95s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.95sResponse Time (max)15.40sResponse Time (total)29.90s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)22.08sResponse Time (max)43.55sResponse Time (total)66.23s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.54sResponse Time (max)11.67sResponse Time (total)15.07s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.11sResponse Time (max)7.52sResponse Time (total)18.34s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.87sResponse Time (max)5.87sResponse Time (total)5.87s
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 81.5%Flaky tests: 3…Output Tokens: 2,500Reasoning Tokens: 242,500Response time: avg 53.03s · total 954.46s · max 163.96s
Anti-AI Tricks
: 8.7 A test is fully passed only if every run passed for that test.Extra formatting: 1Response Time (avg)19.75sResponse Time (max)49.95sResponse Time (total)79.01s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)70.35sResponse Time (max)70.35sResponse Time (total)70.35s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)163.96sResponse Time (max)163.96sResponse Time (total)163.96s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)30.26sResponse Time (max)32.03sResponse Time (total)60.52s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)79.53sResponse Time (max)95.52sResponse Time (total)238.59s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)19.66sResponse Time (max)32.25sResponse Time (total)39.32s
Puzzle Solving
: 8.2 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)64.61sResponse Time (max)123.57sResponse Time (total)193.84s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.45sResponse Time (max)7.45sResponse Time (total)7.45s
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 1Response Time (avg)68.83sResponse Time (max)280.52sResponse Time (total)1101.32s…
Total Tests: 16Wrong Tests: 4Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 77.1%Flaky tests: 1…Output Tokens: 1,283Reasoning Tokens: 1,533,310Response time: avg 68.83s · total 1101.32s · max 280.52s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)43.87sResponse Time (max)121.88sResponse Time (total)131.62s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)280.52sResponse Time (max)280.52sResponse Time (total)280.52s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.16sResponse Time (max)8.54sResponse Time (total)14.31s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)127.58sResponse Time (max)133.93sResponse Time (total)382.74s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.25sResponse Time (max)5.25sResponse Time (total)5.25s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)46.33sResponse Time (max)134.22sResponse Time (total)139.00s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.73sResponse Time (max)7.73sResponse Time (total)7.73s
A test is fully passed only if every run passed for that test.Wrong answer: 3API error: 1Response Time (avg)9.06sResponse Time (max)26.24sResponse Time (total)90.58s…
Total Tests: 18Wrong Tests: 4Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 77.8%Flaky tests: 0…Output Tokens: 1,508Reasoning Tokens: 10,084Response time: avg 9.06s · total 90.58s · max 26.24s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.99sResponse Time (max)26.24sResponse Time (total)29.99s
Coding
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)10.37sResponse Time (max)10.37sResponse Time (total)10.37s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)10.84sResponse Time (max)10.84sResponse Time (total)10.84s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)7.01sResponse Time (max)7.01sResponse Time (total)7.01s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.34sResponse Time (max)9.34sResponse Time (total)9.34s
Instructions following
: 9.8 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.26sResponse Time (max)3.26sResponse Time (total)3.26s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.91sResponse Time (max)4.23sResponse Time (total)7.81s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.96sResponse Time (max)11.96sResponse Time (total)11.96s
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 85.2%Flaky tests: 4…Output Tokens: 20,163Reasoning Tokens: 58,337Response time: avg 23.34s · total 233.40s · max 79.09s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.66sResponse Time (max)25.06sResponse Time (total)47.32s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)79.09sResponse Time (max)79.09sResponse Time (total)79.09s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)28.96sResponse Time (max)28.96sResponse Time (total)28.96s
Data parsing and extraction
: 7.1 A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)8.90sResponse Time (max)8.90sResponse Time (total)8.90s
Domain specific
: 3.5 A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.25sResponse Time (max)7.25sResponse Time (total)7.25s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.64sResponse Time (max)16.34sResponse Time (total)31.27s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.93sResponse Time (max)15.93sResponse Time (total)15.93s
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 79.6%Flaky tests: 2…Output Tokens: 12,734Reasoning Tokens: 27,950Response time: avg 24.88s · total 398.13s · max 70.97s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.89sResponse Time (max)26.66sResponse Time (total)51.55s
Coding
: 4.7 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)70.97sResponse Time (max)70.97sResponse Time (total)70.97s
Combined
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)21.11sResponse Time (max)21.94sResponse Time (total)42.21s
Domain specific
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)38.48sResponse Time (max)68.92sResponse Time (total)115.43s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.57sResponse Time (max)9.57sResponse Time (total)9.57s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.76sResponse Time (max)17.53sResponse Time (total)25.52s
Puzzle Solving
: 8.8 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)27.63sResponse Time (max)61.08sResponse Time (total)82.89s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 1Response Time (avg)12.12sResponse Time (max)95.48sResponse Time (total)218.12s…
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 75.9%Flaky tests: 1…Output Tokens: 1,898Reasoning Tokens: 122,273Response time: avg 12.12s · total 218.12s · max 95.48s
Anti-AI Tricks
: 8.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.30sResponse Time (max)15.56sResponse Time (total)25.21s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.23sResponse Time (max)16.23sResponse Time (total)16.23s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)28.44sResponse Time (max)28.44sResponse Time (total)28.44s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.06sResponse Time (max)5.06sResponse Time (total)8.11s
Domain specific
: 5.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)37.34sResponse Time (max)95.48sResponse Time (total)112.01s
Instructions following
: 9.8 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.62sResponse Time (max)2.78sResponse Time (total)5.24s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.94sResponse Time (max)6.33sResponse Time (total)11.83s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.20sResponse Time (max)6.20sResponse Time (total)6.20s
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 2Response Time (avg)18.63sResponse Time (max)100.41sResponse Time (total)335.26s…
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 79.6%Flaky tests: 3…Output Tokens: 2,169Reasoning Tokens: 48,732Response time: avg 18.63s · total 335.26s · max 100.41s
Anti-AI Tricks
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.11sResponse Time (max)6.42sResponse Time (total)16.42s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.03sResponse Time (max)13.03sResponse Time (total)13.03s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.57sResponse Time (max)20.57sResponse Time (total)20.57s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.32sResponse Time (max)5.40sResponse Time (total)10.64s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)74.27sResponse Time (max)100.41sResponse Time (total)222.80s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.11sResponse Time (max)3.68sResponse Time (total)6.22s
Puzzle Solving
: 8.2 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)9.13sResponse Time (max)18.14sResponse Time (total)27.39s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.28sResponse Time (max)13.28sResponse Time (total)13.28s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 1Response Time (avg)3.74sResponse Time (max)14.93sResponse Time (total)67.31s…
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 72.2%Flaky tests: 0…Output Tokens: 2,168Reasoning Tokens: 29,030Response time: avg 3.74s · total 67.31s · max 14.93s
Anti-AI Tricks
: 9.1 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)2.33sResponse Time (max)3.89sResponse Time (total)9.30s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.34sResponse Time (max)4.34sResponse Time (total)4.34s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.93sResponse Time (max)14.93sResponse Time (total)14.93s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.29sResponse Time (max)2.31sResponse Time (total)4.59s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)4.21sResponse Time (max)5.86sResponse Time (total)12.62s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.16sResponse Time (max)3.16sResponse Time (total)3.16s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.91sResponse Time (max)1.93sResponse Time (total)3.82s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.58sResponse Time (max)4.41sResponse Time (total)10.75s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.80sResponse Time (max)3.80sResponse Time (total)3.80s
A test is fully passed only if every run passed for that test.Wrong answer: 3Did not follow instructions: 2Response Time (avg)71.21sResponse Time (max)351.99sResponse Time (total)1281.73s…
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 74.1%Flaky tests: 1…Output Tokens: 671Reasoning Tokens: 39,383Response time: avg 71.21s · total 1281.73s · max 351.99s
Anti-AI Tricks
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)26.93sResponse Time (max)61.35sResponse Time (total)107.71s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)93.00sResponse Time (max)93.00sResponse Time (total)93.00s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)71.08sResponse Time (max)71.08sResponse Time (total)71.08s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)63.00sResponse Time (max)102.80sResponse Time (total)126.00s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)202.56sResponse Time (max)351.99sResponse Time (total)607.68s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.60sResponse Time (max)20.03sResponse Time (total)29.20s
Puzzle Solving
: 7.6 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)69.69sResponse Time (max)92.65sResponse Time (total)209.06s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.05sResponse Time (max)11.05sResponse Time (total)11.05s
Total Tests: 18Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 77.8%Flaky tests: 5…Output Tokens: 12,197Reasoning Tokens: 38,933Response time: avg 17.67s · total 317.98s · max 194.23s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.82sResponse Time (max)7.69sResponse Time (total)19.26s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.26sResponse Time (max)12.26sResponse Time (total)12.26s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.88sResponse Time (max)13.88sResponse Time (total)13.88s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.19sResponse Time (max)6.42sResponse Time (total)12.38s
Domain specific
: 2.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)71.07sResponse Time (max)194.23sResponse Time (total)213.22s
General Intelligence
: 6.1 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)10.05sResponse Time (max)10.05sResponse Time (total)10.05s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.38sResponse Time (max)5.70sResponse Time (total)10.77s
Puzzle Solving
: 7.3 A test is fully passed only if every run passed for that test.Did not follow instructions: 2Response Time (avg)5.44sResponse Time (max)7.26sResponse Time (total)16.32s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.84sResponse Time (max)9.84sResponse Time (total)9.84s
A test is fully passed only if every run passed for that test.Wrong answer: 3Timed out: 2Response Time (avg)31.38sResponse Time (max)119.29sResponse Time (total)564.84s…
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 79.6%Flaky tests: 3…Output Tokens: 17,635Reasoning Tokens: 162,668Response time: avg 31.38s · total 564.84s · max 119.29s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.75sResponse Time (max)18.03sResponse Time (total)39.01s
Coding
: 4.7 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)70.98sResponse Time (max)70.98sResponse Time (total)70.98s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)107.79sResponse Time (max)107.79sResponse Time (total)107.79s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.41sResponse Time (max)29.79sResponse Time (total)46.83s
Domain specific
: 2.9 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)63.40sResponse Time (max)119.29sResponse Time (total)190.20s
General Intelligence
: 3.4 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)34.11sResponse Time (max)34.11sResponse Time (total)34.11s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.88sResponse Time (max)15.44sResponse Time (total)19.76s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.18sResponse Time (max)31.99sResponse Time (total)51.55s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.60sResponse Time (max)4.60sResponse Time (total)4.60s
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 74.1%Flaky tests: 1…Output Tokens: 1,763Reasoning Tokens: 83,782Response time: avg 15.27s · total 259.55s · max 43.55s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.90sResponse Time (max)19.37sResponse Time (total)39.60s
Coding
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)34.95sResponse Time (max)34.95sResponse Time (total)34.95s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.95sResponse Time (max)15.40sResponse Time (total)29.90s
Domain specific
: 2.9 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)29.59sResponse Time (max)43.55sResponse Time (total)88.77s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.54sResponse Time (max)11.67sResponse Time (total)15.07s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.11sResponse Time (max)7.52sResponse Time (total)18.34s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.87sResponse Time (max)5.87sResponse Time (total)5.87s
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 75.9%Flaky tests: 1…Output Tokens: 65,778Reasoning Tokens: 0Response time: avg 23.98s · total 407.72s · max 78.74s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.61sResponse Time (max)38.50sResponse Time (total)66.46s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)27.94sResponse Time (max)27.94sResponse Time (total)27.94s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)78.74sResponse Time (max)78.74sResponse Time (total)78.74s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)5.85sResponse Time (max)5.85sResponse Time (total)5.85s
Domain specific
: 5.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)40.44sResponse Time (max)46.32sResponse Time (total)121.31s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.44sResponse Time (max)16.44sResponse Time (total)16.44s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.98sResponse Time (max)22.24sResponse Time (total)31.97s
Puzzle Solving
: 5.3 A test is fully passed only if every run passed for that test.Did not follow instructions: 2Response Time (avg)13.73sResponse Time (max)25.82sResponse Time (total)41.19s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.84sResponse Time (max)17.84sResponse Time (total)17.84s
A test is fully passed only if every run passed for that test.Wrong answer: 5Response Time (avg)1.65sResponse Time (max)3.56sResponse Time (total)18.20s…
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 77.8%Flaky tests: 2…Output Tokens: 1,840Reasoning Tokens: 0Response time: avg 1.65s · total 18.20s · max 3.56s
Anti-AI Tricks
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.25sResponse Time (max)1.59sResponse Time (total)2.49s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.59sResponse Time (max)1.59sResponse Time (total)1.59s
Combined
: 4.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.56sResponse Time (max)3.56sResponse Time (total)3.56s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.41sResponse Time (max)1.41sResponse Time (total)1.41s
Domain specific
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)963msResponse Time (max)963msResponse Time (total)963ms
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.13sResponse Time (max)1.13sResponse Time (total)1.13s
Instructions following
: 6.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.58sResponse Time (max)1.58sResponse Time (total)1.58s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.06sResponse Time (max)1.06sResponse Time (total)2.12s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.35sResponse Time (max)3.35sResponse Time (total)3.35s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 1Response Time (avg)3.22sResponse Time (max)11.91sResponse Time (total)58.00s…
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 72.2%Flaky tests: 0…Output Tokens: 2,247Reasoning Tokens: 8,058Response time: avg 3.22s · total 58.00s · max 11.91s
Anti-AI Tricks
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.12sResponse Time (max)3.18sResponse Time (total)8.50s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.20sResponse Time (max)2.20sResponse Time (total)2.20s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)11.91sResponse Time (max)11.91sResponse Time (total)11.91s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.00sResponse Time (max)3.74sResponse Time (total)5.99s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)2.36sResponse Time (max)3.51sResponse Time (total)7.07s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.49sResponse Time (max)1.66sResponse Time (total)2.99s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.76sResponse Time (max)5.08sResponse Time (total)8.27s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.54sResponse Time (max)9.54sResponse Time (total)9.54s
Total Tests: 18Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 75.9%Flaky tests: 3…Output Tokens: 2,735Reasoning Tokens: 52,571Response time: avg 16.17s · total 291.09s · max 84.22s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.95sResponse Time (max)5.12sResponse Time (total)11.80s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)32.58sResponse Time (max)32.58sResponse Time (total)32.58s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)53.36sResponse Time (max)53.36sResponse Time (total)53.36s
Data parsing and extraction
: 7.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)18.81sResponse Time (max)20.29sResponse Time (total)37.61s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Extra formatting: 2Response Time (avg)37.87sResponse Time (max)84.22sResponse Time (total)113.60s
Instructions following
: 9.9 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.77sResponse Time (max)3.21sResponse Time (total)5.54s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.87sResponse Time (max)16.87sResponse Time (total)16.87s
Total Tests: 18Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 77.8%Flaky tests: 3…Output Tokens: 2,360Reasoning Tokens: 38,320Response time: avg 12.27s · total 208.56s · max 64.71s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.06sResponse Time (max)4.70sResponse Time (total)12.23s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)52.12sResponse Time (max)52.12sResponse Time (total)52.12s
Combined
: 4.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)64.71sResponse Time (max)64.71sResponse Time (total)64.71s
Data parsing and extraction
: 7.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)17.20sResponse Time (max)17.44sResponse Time (total)34.40s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)6.00sResponse Time (max)6.14sResponse Time (total)12.01s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.06sResponse Time (max)4.06sResponse Time (total)4.06s
Instructions following
: 9.9 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.36sResponse Time (max)4.35sResponse Time (total)6.72s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)8.19sResponse Time (max)8.19sResponse Time (total)8.19s
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 74.1%Flaky tests: 1…Output Tokens: 65,057Reasoning Tokens: 0Response time: avg 14.63s · total 248.72s · max 46.04s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.59sResponse Time (max)10.20sResponse Time (total)26.37s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.37sResponse Time (max)31.37sResponse Time (total)31.37s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.04sResponse Time (max)46.04sResponse Time (total)46.04s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)5.25sResponse Time (max)5.25sResponse Time (total)5.25s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)22.30sResponse Time (max)30.51sResponse Time (total)66.90s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.84sResponse Time (max)16.84sResponse Time (total)16.84s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.16sResponse Time (max)7.72sResponse Time (total)12.31s
Puzzle Solving
: 5.3 A test is fully passed only if every run passed for that test.Did not follow instructions: 2Response Time (avg)9.55sResponse Time (max)14.35sResponse Time (total)28.64s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.02sResponse Time (max)15.02sResponse Time (total)15.02s
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 75.9%Flaky tests: 2…Output Tokens: 15,928Reasoning Tokens: 44,631Response time: avg 25.03s · total 425.48s · max 147.47s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.20sResponse Time (max)9.64sResponse Time (total)24.78s
Coding
: 2.8 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)147.47sResponse Time (max)147.47sResponse Time (total)147.47s
Combined
: 9.6 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)73.55sResponse Time (max)73.55sResponse Time (total)73.55s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.51sResponse Time (max)20.57sResponse Time (total)33.02s
Domain specific
: 2.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)23.62sResponse Time (max)27.00sResponse Time (total)47.23s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)29.76sResponse Time (max)29.76sResponse Time (total)29.76s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.54sResponse Time (max)21.25sResponse Time (total)35.08s
Puzzle Solving
: 7.9 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)8.52sResponse Time (max)12.73sResponse Time (total)25.56s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.01sResponse Time (max)9.01sResponse Time (total)9.01s
A test is fully passed only if every run passed for that test.Did not follow instructions: 3Wrong answer: 3Response Time (avg)9.81sResponse Time (max)31.36sResponse Time (total)176.62s…
Total Tests: 18Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 74.1%Flaky tests: 2…Output Tokens: 1,568Reasoning Tokens: 91,909Response time: avg 9.81s · total 176.62s · max 31.36s
Anti-AI Tricks
: 8.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.16sResponse Time (max)3.44sResponse Time (total)12.65s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.36sResponse Time (max)31.36sResponse Time (total)31.36s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.93sResponse Time (max)20.93sResponse Time (total)20.93s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.01sResponse Time (max)4.27sResponse Time (total)8.02s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)21.33sResponse Time (max)24.21sResponse Time (total)64.00s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.78sResponse Time (max)5.78sResponse Time (total)5.78s
Puzzle Solving
: 8.2 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)3.85sResponse Time (max)4.53sResponse Time (total)11.55s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)12.39sResponse Time (max)12.39sResponse Time (total)12.39s
A test is fully passed only if every run passed for that test.Extra formatting: 2Wrong answer: 2Timed out: 1Response Time (avg)12.66sResponse Time (max)46.35sResponse Time (total)126.62s…
Total Tests: 18Wrong Tests: 5Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 74.1%Flaky tests: 1…Output Tokens: 42,068Reasoning Tokens: 26,784Response time: avg 12.66s · total 126.62s · max 46.35s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)35.76sResponse Time (max)35.76sResponse Time (total)35.76s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)46.35sResponse Time (max)46.35sResponse Time (total)46.35s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.90sResponse Time (max)13.90sResponse Time (total)13.90s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.94sResponse Time (max)4.94sResponse Time (total)4.94s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.61sResponse Time (max)2.61sResponse Time (total)2.61s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.80sResponse Time (max)5.22sResponse Time (total)9.60s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.48sResponse Time (max)7.48sResponse Time (total)7.48s
Total Tests: 18Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 79.6%Flaky tests: 4…Output Tokens: 7,554Reasoning Tokens: 45,588Response time: avg 43.49s · total 782.73s · max 180.92s
Anti-AI Tricks
: 8.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)30.72sResponse Time (max)44.23sResponse Time (total)122.88s
Coding
: 4.7 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)180.92sResponse Time (max)180.92sResponse Time (total)180.92s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)93.11sResponse Time (max)93.11sResponse Time (total)93.11s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)36.09sResponse Time (max)39.12sResponse Time (total)72.18s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)21.78sResponse Time (max)30.66sResponse Time (total)65.35s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)35.78sResponse Time (max)47.30sResponse Time (total)71.56s
Puzzle Solving
: 8.2 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)36.87sResponse Time (max)59.22sResponse Time (total)110.62s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)34.81sResponse Time (max)34.81sResponse Time (total)34.81s
A test is fully passed only if every run passed for that test.Wrong answer: 5Did not follow instructions: 1Response Time (avg)6.84sResponse Time (max)38.52sResponse Time (total)123.17s…
Total Tests: 18Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 75.9%Flaky tests: 3…Output Tokens: 17,346Reasoning Tokens: 0Response time: avg 6.84s · total 123.17s · max 38.52s
Anti-AI Tricks
: 8.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.40sResponse Time (max)4.78sResponse Time (total)13.59s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)8.97sResponse Time (max)8.97sResponse Time (total)8.97s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.12sResponse Time (max)9.12sResponse Time (total)9.12s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.05sResponse Time (max)3.33sResponse Time (total)6.10s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)17.78sResponse Time (max)38.52sResponse Time (total)53.33s
Instructions following
: 7.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)5.46sResponse Time (max)6.45sResponse Time (total)10.92s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.42sResponse Time (max)5.04sResponse Time (total)13.27s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.68sResponse Time (max)4.68sResponse Time (total)4.68s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 2Response Time (avg)1.30sResponse Time (max)3.39sResponse Time (total)23.42s…
Total Tests: 18Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 70.4%Flaky tests: 1…Output Tokens: 5,361Reasoning Tokens: 0Response time: avg 1.30s · total 23.42s · max 3.39s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.47sResponse Time (max)1.47sResponse Time (total)1.47s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.20sResponse Time (max)3.20sResponse Time (total)3.20s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.22sResponse Time (max)1.33sResponse Time (total)2.44s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)942msResponse Time (max)1.12sResponse Time (total)2.83s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.13sResponse Time (max)1.14sResponse Time (total)2.27s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)972msResponse Time (max)1.13sResponse Time (total)2.92s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.39sResponse Time (max)3.39sResponse Time (total)3.39s
A test is fully passed only if every run passed for that test.Did not follow instructions: 3Wrong answer: 3Response Time (avg)26.78sResponse Time (max)170.45sResponse Time (total)294.58s…
Total Tests: 17Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 70.6%Flaky tests: 2…Output Tokens: 71,904Reasoning Tokens: 155,607Response time: avg 26.78s · total 294.58s · max 170.45s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.56sResponse Time (max)32.30sResponse Time (total)40.68s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)29.57sResponse Time (max)29.57sResponse Time (total)29.57s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.01sResponse Time (max)15.01sResponse Time (total)15.01s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)170.45sResponse Time (max)170.45sResponse Time (total)170.45s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.91sResponse Time (max)11.91sResponse Time (total)11.91s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 3Response Time (avg)47.47sResponse Time (max)255.28sResponse Time (total)854.45s…
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 79.6%Flaky tests: 5…Output Tokens: 1,757Reasoning Tokens: 55,907Response time: avg 47.47s · total 854.45s · max 255.28s
Anti-AI Tricks
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)28.51sResponse Time (max)39.73sResponse Time (total)114.05s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)62.48sResponse Time (max)62.48sResponse Time (total)62.48s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)76.57sResponse Time (max)76.57sResponse Time (total)76.57s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)28.03sResponse Time (max)30.49sResponse Time (total)56.07s
Domain specific
: 4.1 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)112.69sResponse Time (max)255.28sResponse Time (total)338.07s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.36sResponse Time (max)19.53sResponse Time (total)30.73s
Puzzle Solving
: 6.4 A test is fully passed only if every run passed for that test.Did not follow instructions: 2Response Time (avg)25.53sResponse Time (max)32.37sResponse Time (total)76.60s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)74.73sResponse Time (max)74.73sResponse Time (total)74.73s
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 77.8%Flaky tests: 6…Output Tokens: 2,351Reasoning Tokens: 58,941Response time: avg 14.96s · total 269.32s · max 67.08s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.78sResponse Time (max)13.78sResponse Time (total)13.78s
Combined
: 6.9 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)15.06sResponse Time (max)15.06sResponse Time (total)15.06s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.60sResponse Time (max)9.92sResponse Time (total)19.19s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)38.15sResponse Time (max)67.08sResponse Time (total)114.45s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.09sResponse Time (max)11.09sResponse Time (total)11.09s
Instructions following
: 9.9 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.74sResponse Time (max)5.23sResponse Time (total)7.47s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)10.91sResponse Time (max)18.97sResponse Time (total)32.74s
Tool Calling
: 7.0 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)12.53sResponse Time (max)12.53sResponse Time (total)12.53s
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 81.5%Flaky tests: 6…Output Tokens: 2,073Reasoning Tokens: 191,899Response time: avg 66.72s · total 1201.03s · max 234.29s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)59.11sResponse Time (max)168.31sResponse Time (total)236.44s
Coding
: 4.7 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)45.75sResponse Time (max)45.75sResponse Time (total)45.75s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.78sResponse Time (max)17.78sResponse Time (total)17.78s
Data parsing and extraction
: 7.3 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)56.99sResponse Time (max)80.14sResponse Time (total)113.98s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)146.50sResponse Time (max)234.29sResponse Time (total)439.49s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)63.49sResponse Time (max)111.61sResponse Time (total)126.98s
Puzzle Solving
: 6.4 A test is fully passed only if every run passed for that test.Timed out: 2Response Time (avg)56.74sResponse Time (max)115.01sResponse Time (total)170.23s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)10.33sResponse Time (max)10.33sResponse Time (total)10.33s
A test is fully passed only if every run passed for that test.Wrong answer: 3Timed out: 2API error: 1Response Time (avg)24.13sResponse Time (max)118.52sResponse Time (total)410.25s…
Total Tests: 18Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 75.9%Flaky tests: 3…Output Tokens: 8,005Reasoning Tokens: 49,090Response time: avg 24.13s · total 410.25s · max 118.52s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)8.31sResponse Time (max)14.20sResponse Time (total)33.24s
Coding
: 4.7 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)118.52sResponse Time (max)118.52sResponse Time (total)118.52s
Combined
: 9.5 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)43.11sResponse Time (max)43.11sResponse Time (total)43.11s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.33sResponse Time (max)9.40sResponse Time (total)18.66s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)29.77sResponse Time (max)32.22sResponse Time (total)89.30s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.95sResponse Time (max)20.95sResponse Time (total)20.95s
Instructions following
: 6.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)7.47sResponse Time (max)10.16sResponse Time (total)14.94s
Puzzle Solving
: 8.2 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)23.85sResponse Time (max)33.09sResponse Time (total)71.54s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Total Tests: 18Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 74.1%Flaky tests: 3…Output Tokens: 2,840Reasoning Tokens: 116,242Response time: avg 13.71s · total 246.73s · max 86.93s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.98sResponse Time (max)3.76sResponse Time (total)7.92s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.48sResponse Time (max)31.48sResponse Time (total)31.48s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)16.86sResponse Time (max)16.86sResponse Time (total)16.86s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Extra formatting: 1Wrong answer: 1Response Time (avg)34.53sResponse Time (max)86.93sResponse Time (total)103.59s
Instructions following
: 9.9 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.80sResponse Time (max)1.81sResponse Time (total)3.60s
Puzzle Solving
: 8.2 A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)20.60sResponse Time (max)57.93sResponse Time (total)61.79s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.29sResponse Time (max)7.29sResponse Time (total)7.29s
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 74.1%Flaky tests: 4…Output Tokens: 80,759Reasoning Tokens: 179,814Response time: avg 45.20s · total 768.37s · max 215.85s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)106.96sResponse Time (max)106.96sResponse Time (total)106.96s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)40.96sResponse Time (max)40.96sResponse Time (total)40.96s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)20.38sResponse Time (max)22.88sResponse Time (total)40.76s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Timed out: 2Response Time (avg)202.38sResponse Time (max)215.85sResponse Time (total)404.76s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.83sResponse Time (max)17.83sResponse Time (total)17.83s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)12.53sResponse Time (max)19.15sResponse Time (total)25.06s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)8.92sResponse Time (max)8.92sResponse Time (total)8.92s
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 61.1%Flaky tests: 0…Output Tokens: 928Reasoning Tokens: 72,661Response time: avg 16.76s · total 301.61s · max 158.78s
Anti-AI Tricks
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.11sResponse Time (max)3.43sResponse Time (total)8.43s
Coding
: 4.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)68.55sResponse Time (max)68.55sResponse Time (total)68.55s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)19.29sResponse Time (max)19.29sResponse Time (total)19.29s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.29sResponse Time (max)2.62sResponse Time (total)4.58s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.86sResponse Time (max)2.86sResponse Time (total)2.86s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.07sResponse Time (max)11.07sResponse Time (total)11.07s
A test is fully passed only if every run passed for that test.Wrong answer: 5Did not follow instructions: 2Response Time (avg)5.88sResponse Time (max)18.33sResponse Time (total)105.90s…
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 68.5%Flaky tests: 3…Output Tokens: 20,784Reasoning Tokens: 0Response time: avg 5.88s · total 105.90s · max 18.33s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.32sResponse Time (max)9.32sResponse Time (total)9.32s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.96sResponse Time (max)11.96sResponse Time (total)11.96s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.21sResponse Time (max)2.52sResponse Time (total)4.42s
Domain specific
: 3.5 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)13.01sResponse Time (max)18.33sResponse Time (total)39.04s
Instructions following
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.29sResponse Time (max)4.18sResponse Time (total)6.59s
Puzzle Solving
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.93sResponse Time (max)3.05sResponse Time (total)8.78s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)8.36sResponse Time (max)8.36sResponse Time (total)8.36s
A test is fully passed only if every run passed for that test.Extra formatting: 4Wrong answer: 2Response Time (avg)21.08sResponse Time (max)83.40sResponse Time (total)231.84s…
Total Tests: 18Wrong Tests: 6Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 70.4%Flaky tests: 2…Output Tokens: 29,829Reasoning Tokens: 18,938Response time: avg 21.08s · total 231.84s · max 83.40s
Anti-AI Tricks
: 6.4 A test is fully passed only if every run passed for that test.Extra formatting: 2Response Time (avg)7.45sResponse Time (max)11.88sResponse Time (total)14.90s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.11sResponse Time (max)23.11sResponse Time (total)23.11s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)76.66sResponse Time (max)76.66sResponse Time (total)76.66s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.37sResponse Time (max)7.37sResponse Time (total)7.37s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.04sResponse Time (max)5.04sResponse Time (total)5.04s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.43sResponse Time (max)2.43sResponse Time (total)2.43s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.60sResponse Time (max)4.66sResponse Time (total)9.20s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)9.73sResponse Time (max)9.73sResponse Time (total)9.73s
A test is fully passed only if every run passed for that test.Wrong answer: 4Did not follow instructions: 3Response Time (avg)11.21sResponse Time (max)94.06sResponse Time (total)201.80s…
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 68.5%Flaky tests: 2…Output Tokens: 2,946Reasoning Tokens: 58,132Response time: avg 11.21s · total 201.80s · max 94.06s
Anti-AI Tricks
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.52sResponse Time (max)7.74sResponse Time (total)18.10s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.41sResponse Time (max)13.41sResponse Time (total)13.41s
Combined
: 9.8 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)24.13sResponse Time (max)24.13sResponse Time (total)24.13s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.54sResponse Time (max)3.33sResponse Time (total)5.08s
Domain specific
: 5.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)38.18sResponse Time (max)94.06sResponse Time (total)114.53s
Instructions following
: 9.8 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.88sResponse Time (max)2.61sResponse Time (total)3.75s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.71sResponse Time (max)7.71sResponse Time (total)7.71s
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 66.7%Flaky tests: 2…Output Tokens: 2,419Reasoning Tokens: 79,238Response time: avg 69.70s · total 1045.47s · max 262.83s
Anti-AI Tricks
: 6.6 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)74.75sResponse Time (max)182.10sResponse Time (total)298.98s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)197.31sResponse Time (max)197.31sResponse Time (total)197.31s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)262.83sResponse Time (max)262.83sResponse Time (total)262.83s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)24.27sResponse Time (max)27.52sResponse Time (total)48.54s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Timed out: 3Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)17.47sResponse Time (max)19.46sResponse Time (total)34.93s
Puzzle Solving
: 8.2 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)25.85sResponse Time (max)32.95sResponse Time (total)77.55s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)88.68sResponse Time (max)88.68sResponse Time (total)88.68s
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 72.2%Flaky tests: 4…Output Tokens: 2,705Reasoning Tokens: 18,977Response time: avg 14.04s · total 154.41s · max 77.80s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.12sResponse Time (max)15.12sResponse Time (total)15.12s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)14.06sResponse Time (max)14.06sResponse Time (total)14.06s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.15sResponse Time (max)3.15sResponse Time (total)3.15s
Domain specific
: 5.9 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)77.80sResponse Time (max)77.80sResponse Time (total)77.80s
Instructions following
: 9.9 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.12sResponse Time (max)3.12sResponse Time (total)3.12s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)5.47sResponse Time (max)6.45sResponse Time (total)10.94s
Tool Calling
: 4.7 A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)10.30sResponse Time (max)10.30sResponse Time (total)10.30s
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 70.4%Flaky tests: 3…Output Tokens: 12,387Reasoning Tokens: 115,182Response time: avg 23.36s · total 280.34s · max 96.01s
Anti-AI Tricks
: 8.1 A test is fully passed only if every run passed for that test.Extra formatting: 1Response Time (avg)15.85sResponse Time (max)20.83sResponse Time (total)47.55s
Coding
: 4.7 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)13.03sResponse Time (max)13.03sResponse Time (total)13.03s
Combined
: 9.8 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)75.68sResponse Time (max)75.68sResponse Time (total)75.68s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Domain specific
: 5.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)96.01sResponse Time (max)96.01sResponse Time (total)96.01s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.28sResponse Time (max)7.37sResponse Time (total)8.55s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.77sResponse Time (max)5.26sResponse Time (total)7.55s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)27.78sResponse Time (max)27.78sResponse Time (total)27.78s
Total Tests: 18Wrong Tests: 7Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 64.8%Flaky tests: 1…Output Tokens: 7,433Reasoning Tokens: 0Response time: avg 4.98s · total 54.83s · max 23.84s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.67sResponse Time (max)3.67sResponse Time (total)3.67s
Combined
: 9.5 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.84sResponse Time (max)23.84sResponse Time (total)23.84s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.43sResponse Time (max)3.43sResponse Time (total)3.43s
Domain specific
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.54sResponse Time (max)3.54sResponse Time (total)3.54s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.96sResponse Time (max)1.96sResponse Time (total)1.96s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Extra formatting: 1Response Time (avg)2.92sResponse Time (max)3.33sResponse Time (total)5.84s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.11sResponse Time (max)4.11sResponse Time (total)4.11s