Total Tests: 18Wrong Tests: 9Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 51.9%Flaky tests: 1…Output Tokens: 1,611Reasoning Tokens: 0Response time: avg 23.34s · total 420.04s · max 109.46s
Anti-AI Tricks
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 2Extra formatting: 1Response Time (avg)36.12sResponse Time (max)109.46sResponse Time (total)144.50s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)33.40sResponse Time (max)33.40sResponse Time (total)33.40s
Combined
: 9.5 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)34.55sResponse Time (max)34.55sResponse Time (total)34.55s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)54.04sResponse Time (max)105.46sResponse Time (total)108.08s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)3.08sResponse Time (max)6.59sResponse Time (total)9.24s
General Intelligence
: 4.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.06sResponse Time (max)6.06sResponse Time (total)6.06s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)9.47sResponse Time (max)13.43sResponse Time (total)18.95s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.47sResponse Time (max)6.47sResponse Time (total)6.47s
Total Tests: 18Wrong Tests: 9Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 64.8%Flaky tests: 6…Output Tokens: 2,010Reasoning Tokens: 91,298Response time: avg 23.88s · total 262.66s · max 121.79s
Anti-AI Tricks
: 8.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.81sResponse Time (max)5.65sResponse Time (total)7.62s
Coding
: 2.3 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)23.58sResponse Time (max)23.58sResponse Time (total)23.58s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)37.64sResponse Time (max)37.64sResponse Time (total)37.64s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.63sResponse Time (max)6.63sResponse Time (total)6.63s
Domain specific
: 5.8 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)121.79sResponse Time (max)121.79sResponse Time (total)121.79s
Tool Calling
: 2.8 A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)27.71sResponse Time (max)27.71sResponse Time (total)27.71s
A test is fully passed only if every run passed for that test.Wrong answer: 9Response Time (avg)4.23sResponse Time (max)11.07sResponse Time (total)46.51s…
Total Tests: 18Wrong Tests: 9Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 51.9%Flaky tests: 1…Output Tokens: 1,959Reasoning Tokens: 0Response time: avg 4.23s · total 46.51s · max 11.07s
Anti-AI Tricks
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)2.37sResponse Time (max)3.39sResponse Time (total)4.75s
Coding
: 5.6 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)8.84sResponse Time (max)8.84sResponse Time (total)8.84s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.98sResponse Time (max)4.98sResponse Time (total)4.98s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.78sResponse Time (max)5.78sResponse Time (total)5.78s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)2.24sResponse Time (max)2.24sResponse Time (total)2.24s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.27sResponse Time (max)3.27sResponse Time (total)3.27s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.48sResponse Time (max)1.48sResponse Time (total)1.48s
Puzzle Solving
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.05sResponse Time (max)2.08sResponse Time (total)4.10s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.07sResponse Time (max)11.07sResponse Time (total)11.07s
A test is fully passed only if every run passed for that test.Wrong answer: 6Did not follow instructions: 4Response Time (avg)2.21sResponse Time (max)14.63sResponse Time (total)37.51s…
Total Tests: 18Wrong Tests: 10Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 53.7%Flaky tests: 3…Output Tokens: 3,972Reasoning Tokens: 48,333Response time: avg 2.21s · total 37.51s · max 14.63s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.53sResponse Time (max)1.53sResponse Time (total)1.53s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.28sResponse Time (max)3.28sResponse Time (total)3.28s
Data parsing and extraction
: 7.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.11sResponse Time (max)1.47sResponse Time (total)2.21s
Domain specific
: 2.9 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)6.48sResponse Time (max)14.63sResponse Time (total)19.43s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.07sResponse Time (max)1.07sResponse Time (total)1.07s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.89sResponse Time (max)1.89sResponse Time (total)1.89s
A test is fully passed only if every run passed for that test.Wrong answer: 8Did not follow instructions: 2Response Time (avg)1.99sResponse Time (max)6.81sResponse Time (total)35.81s…
Total Tests: 18Wrong Tests: 10Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 44.4%Flaky tests: 0…Output Tokens: 868Reasoning Tokens: 0Response time: avg 1.99s · total 35.81s · max 6.81s
Anti-AI Tricks
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.10sResponse Time (max)2.08sResponse Time (total)4.39s
Coding
: 6.6 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.72sResponse Time (max)1.72sResponse Time (total)1.72s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.47sResponse Time (max)2.47sResponse Time (total)2.47s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.69sResponse Time (max)2.46sResponse Time (total)3.38s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.14sResponse Time (max)1.63sResponse Time (total)3.41s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.18sResponse Time (max)6.81sResponse Time (total)8.36s
Puzzle Solving
: 8.0 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)2.71sResponse Time (max)5.96sResponse Time (total)8.14s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.76sResponse Time (max)2.76sResponse Time (total)2.76s
Total Tests: 18Wrong Tests: 11Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 57.4%Flaky tests: 6…Output Tokens: 299,034Reasoning Tokens: 309,670Response time: avg 9.80s · total 156.75s · max 35.28s
Anti-AI Tricks
: 6.9 A test is fully passed only if every run passed for that test.Extra formatting: 1Wrong answer: 1Response Time (avg)3.46sResponse Time (max)4.38sResponse Time (total)13.86s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)27.11sResponse Time (max)27.11sResponse Time (total)27.11s
Combined
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.54sResponse Time (max)7.51sResponse Time (total)11.08s
Puzzle Solving
: 7.2 A test is fully passed only if every run passed for that test.Did not follow instructions: 2Response Time (avg)5.01sResponse Time (max)5.49sResponse Time (total)15.03s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Total Tests: 18Wrong Tests: 11Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 59.3%Flaky tests: 8…Output Tokens: 4,980Reasoning Tokens: 156,288Response time: avg 44.13s · total 485.47s · max 204.02s
Anti-AI Tricks
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)25.50sResponse Time (max)37.73sResponse Time (total)51.00s
Coding
: 6.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)40.73sResponse Time (max)40.73sResponse Time (total)40.73s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)65.96sResponse Time (max)65.96sResponse Time (total)65.96s
Data parsing and extraction
: 3.7 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)21.42sResponse Time (max)21.42sResponse Time (total)21.42s
Domain specific
: 5.2 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)204.02sResponse Time (max)204.02sResponse Time (total)204.02s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)33.30sResponse Time (max)33.30sResponse Time (total)33.30s
A test is fully passed only if every run passed for that test.Wrong answer: 8Did not follow instructions: 2Response Time (avg)3.10sResponse Time (max)6.51sResponse Time (total)55.87s…
Total Tests: 18Wrong Tests: 10Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 44.4%Flaky tests: 0…Output Tokens: 1,724Reasoning Tokens: 0Response time: avg 3.10s · total 55.87s · max 6.51s
Anti-AI Tricks
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)3.13sResponse Time (max)5.90sResponse Time (total)12.50s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.30sResponse Time (max)5.30sResponse Time (total)5.30s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.51sResponse Time (max)6.51sResponse Time (total)6.51s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.81sResponse Time (max)5.69sResponse Time (total)7.62s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)2.09sResponse Time (max)2.39sResponse Time (total)6.26s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.97sResponse Time (max)2.43sResponse Time (total)3.93s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.86sResponse Time (max)4.86sResponse Time (total)4.86s
A test is fully passed only if every run passed for that test.Wrong answer: 9Did not follow instructions: 1Response Time (avg)3.25sResponse Time (max)13.73sResponse Time (total)58.44s…
Total Tests: 18Wrong Tests: 10Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 46.3%Flaky tests: 1…Output Tokens: 4,266Reasoning Tokens: 0Response time: avg 3.25s · total 58.44s · max 13.73s
Anti-AI Tricks
: 3.5 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)1.32sResponse Time (max)3.89sResponse Time (total)5.30s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.29sResponse Time (max)1.29sResponse Time (total)1.29s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.22sResponse Time (max)6.22sResponse Time (total)6.22s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.57sResponse Time (max)1.83sResponse Time (total)3.14s
Domain specific
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)905msResponse Time (max)1.10sResponse Time (total)2.71s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)803msResponse Time (max)803msResponse Time (total)803ms
Instructions following
: 6.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)8.81sResponse Time (max)13.73sResponse Time (total)17.61s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.67sResponse Time (max)3.67sResponse Time (total)3.67s
Total Tests: 18Wrong Tests: 11Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 48.2%Flaky tests: 3…Output Tokens: 1,783Reasoning Tokens: 0Response time: avg 6.59s · total 118.61s · max 57.10s
Anti-AI Tricks
: 8.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.28sResponse Time (max)2.09sResponse Time (total)5.13s
Coding
: 4.7 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)7.07sResponse Time (max)7.07sResponse Time (total)7.07s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)30.53sResponse Time (max)30.53sResponse Time (total)30.53s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.70sResponse Time (max)2.21sResponse Time (total)3.41s
Domain specific
: 3.6 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)2.49sResponse Time (max)4.23sResponse Time (total)7.48s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)57.10sResponse Time (max)57.10sResponse Time (total)57.10s
A test is fully passed only if every run passed for that test.Wrong answer: 10Response Time (avg)2.53sResponse Time (max)6.70sResponse Time (total)45.46s…
Total Tests: 18Wrong Tests: 10Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 55.6%Flaky tests: 5…Output Tokens: 3,129Reasoning Tokens: 0Response time: avg 2.53s · total 45.46s · max 6.70s
Anti-AI Tricks
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)2.43sResponse Time (max)6.70sResponse Time (total)9.73s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.61sResponse Time (max)4.61sResponse Time (total)4.61s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.59sResponse Time (max)6.59sResponse Time (total)6.59s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.82sResponse Time (max)1.97sResponse Time (total)3.63s
Domain specific
: 3.6 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.33sResponse Time (max)1.53sResponse Time (total)4.00s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.45sResponse Time (max)3.45sResponse Time (total)3.45s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.06sResponse Time (max)1.09sResponse Time (total)2.12s
Puzzle Solving
: 5.2 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)2.46sResponse Time (max)4.23sResponse Time (total)7.37s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.94sResponse Time (max)3.94sResponse Time (total)3.94s
A test is fully passed only if every run passed for that test.Wrong answer: 10Did not follow instructions: 1Response Time (avg)903msResponse Time (max)4.39sResponse Time (total)16.26s…
Total Tests: 18Wrong Tests: 11Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 44.4%Flaky tests: 2…Output Tokens: 1,726Reasoning Tokens: 0Response time: avg 903ms · total 16.26s · max 4.39s
Anti-AI Tricks
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)582msResponse Time (max)844msResponse Time (total)2.33s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.16sResponse Time (max)1.16sResponse Time (total)1.16s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.39sResponse Time (max)4.39sResponse Time (total)4.39s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)652msResponse Time (max)660msResponse Time (total)1.30s
Domain specific
: 5.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)495msResponse Time (max)642msResponse Time (total)1.49s
General Intelligence
: 5.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)615msResponse Time (max)615msResponse Time (total)615ms
Instructions following
: 8.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)672msResponse Time (max)785msResponse Time (total)1.34s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.91sResponse Time (max)1.91sResponse Time (total)1.91s
A test is fully passed only if every run passed for that test.Wrong answer: 9Did not follow instructions: 2Response Time (avg)3.82sResponse Time (max)47.43sResponse Time (total)68.74s…
Total Tests: 18Wrong Tests: 11Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 50.0%Flaky tests: 3…Output Tokens: 4,300Reasoning Tokens: 0Response time: avg 3.82s · total 68.74s · max 47.43s
Anti-AI Tricks
: 3.4 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)1.43sResponse Time (max)4.39sResponse Time (total)5.71s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.67sResponse Time (max)2.67sResponse Time (total)2.67s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)47.43sResponse Time (max)47.43sResponse Time (total)47.43s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.16sResponse Time (max)1.42sResponse Time (total)2.33s
Domain specific
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)485msResponse Time (max)549msResponse Time (total)1.45s
Instructions following
: 6.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)809msResponse Time (max)983msResponse Time (total)1.62s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.30sResponse Time (max)2.30sResponse Time (total)2.30s
Total Tests: 18Wrong Tests: 11Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 46.3%Flaky tests: 3…Output Tokens: 8,378Reasoning Tokens: 0Response time: avg 12.07s · total 217.28s · max 115.89s
Anti-AI Tricks
: 3.2 A test is fully passed only if every run passed for that test.Extra formatting: 2Wrong answer: 2Response Time (avg)7.63sResponse Time (max)12.26sResponse Time (total)30.54s
Coding
: 2.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)7.63sResponse Time (max)7.63sResponse Time (total)7.63s
Combined
: 6.5 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)115.89sResponse Time (max)115.89sResponse Time (total)115.89s
Data parsing and extraction
: 6.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)9.42sResponse Time (max)16.20sResponse Time (total)18.84s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.52sResponse Time (max)1.77sResponse Time (total)4.55s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.86sResponse Time (max)2.86sResponse Time (total)2.86s
Instructions following
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.52sResponse Time (max)1.99sResponse Time (total)3.04s
Puzzle Solving
: 8.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)7.37sResponse Time (max)10.78sResponse Time (total)22.10s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.85sResponse Time (max)11.85sResponse Time (total)11.85s
A test is fully passed only if every run passed for that test.Wrong answer: 9Did not follow instructions: 2Response Time (avg)2.39sResponse Time (max)6.58sResponse Time (total)43.06s…
Total Tests: 18Wrong Tests: 11Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 48.2%Flaky tests: 3…Output Tokens: 2,320Reasoning Tokens: 0Response time: avg 2.39s · total 43.06s · max 6.58s
Anti-AI Tricks
: 3.5 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)1.80sResponse Time (max)2.62sResponse Time (total)7.19s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.82sResponse Time (max)3.82sResponse Time (total)3.82s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)6.58sResponse Time (max)6.58sResponse Time (total)6.58s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.39sResponse Time (max)1.42sResponse Time (total)2.78s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.78sResponse Time (max)2.49sResponse Time (total)5.34s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.51sResponse Time (max)2.95sResponse Time (total)5.02s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.39sResponse Time (max)4.39sResponse Time (total)4.39s
A test is fully passed only if every run passed for that test.Wrong answer: 10Did not follow instructions: 1Response Time (avg)1.51sResponse Time (max)2.95sResponse Time (total)27.21s…
Total Tests: 18Wrong Tests: 11Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 42.6%Flaky tests: 2…Output Tokens: 2,317Reasoning Tokens: 0Response time: avg 1.51s · total 27.21s · max 2.95s
Anti-AI Tricks
: 3.2 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)1.21sResponse Time (max)2.58sResponse Time (total)4.85s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.95sResponse Time (max)2.95sResponse Time (total)2.95s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.89sResponse Time (max)2.89sResponse Time (total)2.89s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.04sResponse Time (max)1.06sResponse Time (total)2.08s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.07sResponse Time (max)1.54sResponse Time (total)3.22s
General Intelligence
: 4.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.78sResponse Time (max)1.78sResponse Time (total)1.78s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.07sResponse Time (max)1.17sResponse Time (total)2.15s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.75sResponse Time (max)2.75sResponse Time (total)2.75s
A test is fully passed only if every run passed for that test.Wrong answer: 10Did not follow instructions: 2Response Time (avg)1.74sResponse Time (max)9.39sResponse Time (total)31.32s…
Total Tests: 18Wrong Tests: 12Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 38.9%Flaky tests: 2…Output Tokens: 3,545Reasoning Tokens: 0Response time: avg 1.74s · total 31.32s · max 9.39s
Anti-AI Tricks
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)788msResponse Time (max)1.34sResponse Time (total)3.15s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.51sResponse Time (max)2.51sResponse Time (total)2.51s
Combined
: 2.8 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)9.39sResponse Time (max)9.39sResponse Time (total)9.39s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.43sResponse Time (max)1.45sResponse Time (total)2.86s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)540msResponse Time (max)649msResponse Time (total)1.62s
Instructions following
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)815msResponse Time (max)973msResponse Time (total)1.63s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.54sResponse Time (max)3.54sResponse Time (total)3.54s
A test is fully passed only if every run passed for that test.Wrong answer: 7Did not follow instructions: 4Response Time (avg)16.08sResponse Time (max)50.92sResponse Time (total)176.88s…
Total Tests: 18Wrong Tests: 11Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 51.9%Flaky tests: 6…Output Tokens: 13,493Reasoning Tokens: 36,879Response time: avg 16.08s · total 176.88s · max 50.92s
Coding
: 4.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)26.33sResponse Time (max)26.33sResponse Time (total)26.33s
Combined
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)31.18sResponse Time (max)31.18sResponse Time (total)31.18s
Data parsing and extraction
: 6.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.98sResponse Time (max)1.98sResponse Time (total)1.98s
Domain specific
: 2.9 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)50.92sResponse Time (max)50.92sResponse Time (total)50.92s
Instructions following
: 9.9 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.63sResponse Time (max)7.63sResponse Time (total)7.63s
Tool Calling
: 9.8 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.91sResponse Time (max)6.91sResponse Time (total)6.91s
A test is fully passed only if every run passed for that test.Wrong answer: 8Did not follow instructions: 3Response Time (avg)2.05sResponse Time (max)6.65sResponse Time (total)36.93s…
Total Tests: 18Wrong Tests: 11Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 42.6%Flaky tests: 2…Output Tokens: 2,973Reasoning Tokens: 0Response time: avg 2.05s · total 36.93s · max 6.65s
Anti-AI Tricks
: 4.6 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.39sResponse Time (max)2.96sResponse Time (total)5.56s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.65sResponse Time (max)6.65sResponse Time (total)6.65s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.38sResponse Time (max)3.38sResponse Time (total)3.38s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.32sResponse Time (max)1.39sResponse Time (total)2.64s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.48sResponse Time (max)1.85sResponse Time (total)4.45s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.64sResponse Time (max)1.80sResponse Time (total)3.28s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.46sResponse Time (max)4.46sResponse Time (total)4.46s
A test is fully passed only if every run passed for that test.Wrong answer: 10Did not follow instructions: 2Response Time (avg)1.51sResponse Time (max)3.54sResponse Time (total)27.21s…
Total Tests: 18Wrong Tests: 12Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 46.3%Flaky tests: 4…Output Tokens: 2,451Reasoning Tokens: 0Response time: avg 1.51s · total 27.21s · max 3.54s
Anti-AI Tricks
: 2.9 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)1.29sResponse Time (max)2.83sResponse Time (total)5.18s
Coding
: 6.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.39sResponse Time (max)2.39sResponse Time (total)2.39s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)3.54sResponse Time (max)3.54sResponse Time (total)3.54s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.32sResponse Time (max)1.42sResponse Time (total)2.64s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)877msResponse Time (max)904msResponse Time (total)2.63s
General Intelligence
: 4.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.53sResponse Time (max)1.53sResponse Time (total)1.53s
Instructions following
: 6.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.03sResponse Time (max)1.10sResponse Time (total)2.06s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.30sResponse Time (max)3.30sResponse Time (total)3.30s
A test is fully passed only if every run passed for that test.Wrong answer: 11Did not follow instructions: 1Response Time (avg)3.69sResponse Time (max)46.00sResponse Time (total)66.50s…
Total Tests: 18Wrong Tests: 12Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 38.9%Flaky tests: 2…Output Tokens: 3,341Reasoning Tokens: 0Response time: avg 3.69s · total 66.50s · max 46.00s
Anti-AI Tricks
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.59sResponse Time (max)3.60sResponse Time (total)6.38s
Coding
: 4.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.44sResponse Time (max)3.44sResponse Time (total)3.44s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)46.00sResponse Time (max)46.00sResponse Time (total)46.00s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.01sResponse Time (max)1.06sResponse Time (total)2.02s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)465msResponse Time (max)492msResponse Time (total)1.39s
Instructions following
: 4.5 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)585msResponse Time (max)715msResponse Time (total)1.17s
Puzzle Solving
: 5.4 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)982msResponse Time (max)1.36sResponse Time (total)2.95s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.04sResponse Time (max)2.04sResponse Time (total)2.04s
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 57.4%Flaky tests: 10…Output Tokens: 107,044Reasoning Tokens: 206,422Response time: avg 39.65s · total 396.47s · max 237.27s
Coding
: 3.0 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Combined
: 4.5 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)60.39sResponse Time (max)60.39sResponse Time (total)60.39s
Data parsing and extraction
: 4.6 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)7.48sResponse Time (max)7.48sResponse Time (total)7.48s
Domain specific
: 2.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Timed out: 1Response Time (avg)237.27sResponse Time (max)237.27sResponse Time (total)237.27s
Puzzle Solving
: 5.3 A test is fully passed only if every run passed for that test.Timed out: 1Wrong answer: 1Response Time (avg)11.54sResponse Time (max)17.37sResponse Time (total)23.08s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.35sResponse Time (max)15.35sResponse Time (total)15.35s
Total Tests: 18Wrong Tests: 12Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 46.3%Flaky tests: 4…Output Tokens: 2,278Reasoning Tokens: 0Response time: avg 4.58s · total 77.92s · max 15.17s
Anti-AI Tricks
: 3.5 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)3.81sResponse Time (max)6.85sResponse Time (total)15.23s
Coding
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)15.17sResponse Time (max)15.17sResponse Time (total)15.17s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)8.49sResponse Time (max)14.02sResponse Time (total)16.98s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)2.33sResponse Time (max)2.94sResponse Time (total)6.99s
Instructions following
: 6.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.82sResponse Time (max)2.92sResponse Time (total)5.65s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.02sResponse Time (max)6.02sResponse Time (total)6.02s
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 50.0%Flaky tests: 7…Output Tokens: 15,084Reasoning Tokens: 39,408Response time: avg 5.64s · total 101.52s · max 30.49s
Anti-AI Tricks
: 5.6 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)2.67sResponse Time (max)5.03sResponse Time (total)10.66s
Coding
: 6.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)30.49sResponse Time (max)30.49sResponse Time (total)30.49s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)25.25sResponse Time (max)25.25sResponse Time (total)25.25s
Data parsing and extraction
: 7.3 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)1.23sResponse Time (max)1.96sResponse Time (total)2.46s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.API error: 1Wrong answer: 1Response Time (avg)6.11sResponse Time (max)13.72sResponse Time (total)18.34s
Instructions following
: 7.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.38sResponse Time (max)1.61sResponse Time (total)2.75s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.50sResponse Time (max)3.50sResponse Time (total)3.50s
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 37.0%Flaky tests: 3…Output Tokens: 2,489Reasoning Tokens: 0Response time: avg 3.35s · total 36.90s · max 7.05s
Anti-AI Tricks
: 5.2 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)5.51sResponse Time (max)6.59sResponse Time (total)11.02s
Coding
: 6.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)5.57sResponse Time (max)5.57sResponse Time (total)5.57s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)3.22sResponse Time (max)3.22sResponse Time (total)3.22s
Data parsing and extraction
: 7.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.82sResponse Time (max)4.82sResponse Time (total)4.82s
Domain specific
: 7.7 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)744msResponse Time (max)744msResponse Time (total)744ms
General Intelligence
: 4.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.59sResponse Time (max)1.59sResponse Time (total)1.59s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)888msResponse Time (max)888msResponse Time (total)888ms
Tool Calling
: 2.8 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)7.05sResponse Time (max)7.05sResponse Time (total)7.05s
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 37.0%Flaky tests: 4…Output Tokens: 3,720Reasoning Tokens: 0Response time: avg 4.33s · total 78.02s · max 32.57s
Anti-AI Tricks
: 4.0 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)2.11sResponse Time (max)3.94sResponse Time (total)8.46s
Coding
: 5.1 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)9.79sResponse Time (max)9.79sResponse Time (total)9.79s
Combined
: 2.8 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)32.57sResponse Time (max)32.57sResponse Time (total)32.57s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.08sResponse Time (max)1.62sResponse Time (total)2.15s
Domain specific
: 2.9 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.99sResponse Time (max)3.99sResponse Time (total)5.98s
General Intelligence
: 5.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)790msResponse Time (max)790msResponse Time (total)790ms
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)10.68sResponse Time (max)10.68sResponse Time (total)10.68s
A test is fully passed only if every run passed for that test.Wrong answer: 12Response Time (avg)13.37sResponse Time (max)42.13sResponse Time (total)147.05s…
Total Tests: 18Wrong Tests: 12Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 40.7%Flaky tests: 3…Output Tokens: 2,659Reasoning Tokens: 0Response time: avg 13.37s · total 147.05s · max 42.13s
Anti-AI Tricks
: 3.6 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)6.24sResponse Time (max)11.38sResponse Time (total)12.48s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)38.78sResponse Time (max)38.78sResponse Time (total)38.78s
Combined
: 2.8 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)19.16sResponse Time (max)19.16sResponse Time (total)19.16s
Data parsing and extraction
: 7.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)42.13sResponse Time (max)42.13sResponse Time (total)42.13s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)4.38sResponse Time (max)4.38sResponse Time (total)4.38s
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.00sResponse Time (max)4.00sResponse Time (total)4.00s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.67sResponse Time (max)2.67sResponse Time (total)2.67s
Puzzle Solving
: 3.1 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)4.73sResponse Time (max)7.81sResponse Time (total)9.45s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)13.99sResponse Time (max)13.99sResponse Time (total)13.99s
A test is fully passed only if every run passed for that test.Wrong answer: 10Did not follow instructions: 2Response Time (avg)2.94sResponse Time (max)8.21sResponse Time (total)52.98s…
Total Tests: 18Wrong Tests: 12Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 37.0%Flaky tests: 2…Output Tokens: 1,775Reasoning Tokens: 0Response time: avg 2.94s · total 52.98s · max 8.21s
Anti-AI Tricks
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)2.84sResponse Time (max)4.15sResponse Time (total)11.35s
Coding
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.93sResponse Time (max)3.93sResponse Time (total)3.93s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.89sResponse Time (max)4.89sResponse Time (total)4.89s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.47sResponse Time (max)2.48sResponse Time (total)4.95s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.97sResponse Time (max)2.65sResponse Time (total)5.92s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.13sResponse Time (max)2.53sResponse Time (total)4.27s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)8.21sResponse Time (max)8.21sResponse Time (total)8.21s
Total Tests: 18Wrong Tests: 12Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 35.2%Flaky tests: 1…Output Tokens: 3,338Reasoning Tokens: 0Response time: avg 11.33s · total 203.88s · max 35.34s
Anti-AI Tricks
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)12.30sResponse Time (max)16.60sResponse Time (total)49.20s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)11.21sResponse Time (max)11.21sResponse Time (total)11.21s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)35.34sResponse Time (max)35.34sResponse Time (total)35.34s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)8.48sResponse Time (max)12.71sResponse Time (total)16.96s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)4.94sResponse Time (max)7.65sResponse Time (total)14.81s
Instructions following
: 9.8 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)5.52sResponse Time (max)8.19sResponse Time (total)11.04s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)18.80sResponse Time (max)18.80sResponse Time (total)18.80s
A test is fully passed only if every run passed for that test.Wrong answer: 11Did not follow instructions: 2Response Time (avg)5.07sResponse Time (max)39.47sResponse Time (total)91.23s…
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 29.6%Flaky tests: 1…Output Tokens: 1,985Reasoning Tokens: 0Response time: avg 5.07s · total 91.23s · max 39.47s
Anti-AI Tricks
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)3.02sResponse Time (max)8.17sResponse Time (total)12.07s
Coding
: 6.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)39.47sResponse Time (max)39.47sResponse Time (total)39.47s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)8.91sResponse Time (max)8.91sResponse Time (total)8.91s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.26sResponse Time (max)4.66sResponse Time (total)6.52s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)877msResponse Time (max)894msResponse Time (total)2.63s
Puzzle Solving
: 5.4 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)3.30sResponse Time (max)4.81sResponse Time (total)9.91s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)6.67sResponse Time (max)6.67sResponse Time (total)6.67s