Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 33.3%Flaky tests: 2…Output Tokens: 4,444Reasoning Tokens: 0Response time: avg 29.39s · total 529.10s · max 111.96s
Anti-AI Tricks
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)20.18sResponse Time (max)26.54sResponse Time (total)80.73s
Coding
: 6.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)24.04sResponse Time (max)24.04sResponse Time (total)24.04s
Combined
: 4.5 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)111.96sResponse Time (max)111.96sResponse Time (total)111.96s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)23.79sResponse Time (max)23.85sResponse Time (total)47.57s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)19.73sResponse Time (max)27.66sResponse Time (total)59.18s
General Intelligence
: 4.2 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)23.74sResponse Time (max)23.74sResponse Time (total)23.74s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Extra formatting: 1Response Time (avg)17.54sResponse Time (max)18.51sResponse Time (total)35.08s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)77.93sResponse Time (max)77.93sResponse Time (total)77.93s
Total Tests: 18Wrong Tests: 14Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 29.6%Flaky tests: 2…Output Tokens: 1,591Reasoning Tokens: 0Response time: avg 1.19s · total 21.37s · max 6.48s
Anti-AI Tricks
: 4.0 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)597msResponse Time (max)866msResponse Time (total)2.39s
Coding
: 5.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.14sResponse Time (max)1.14sResponse Time (total)1.14s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)6.48sResponse Time (max)6.48sResponse Time (total)6.48s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)601msResponse Time (max)634msResponse Time (total)1.20s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)611msResponse Time (max)616msResponse Time (total)1.83s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.79sResponse Time (max)4.79sResponse Time (total)4.79s
Total Tests: 18Wrong Tests: 14Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 51.9%Flaky tests: 10…Output Tokens: 4,984Reasoning Tokens: 62,787Response time: avg 31.08s · total 528.37s · max 117.04s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)91.27sResponse Time (max)91.27sResponse Time (total)91.27s
Combined
: 4.7 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)41.03sResponse Time (max)41.03sResponse Time (total)41.03s
Data parsing and extraction
: 6.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)21.95sResponse Time (max)24.88sResponse Time (total)43.89s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Timed out: 2Wrong answer: 1Response Time (avg)19.00sResponse Time (max)21.63sResponse Time (total)38.01s
Tool Calling
: 4.7 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)12.05sResponse Time (max)12.05sResponse Time (total)12.05s
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 29.6%Flaky tests: 1…Output Tokens: 2,596Reasoning Tokens: 0Response time: avg 1.27s · total 22.82s · max 3.70s
Anti-AI Tricks
: 6.6 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.19sResponse Time (max)2.04sResponse Time (total)4.75s
Coding
: 5.1 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.30sResponse Time (max)1.30sResponse Time (total)1.30s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.70sResponse Time (max)3.70sResponse Time (total)3.70s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)979msResponse Time (max)1.02sResponse Time (total)1.96s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)925msResponse Time (max)1.16sResponse Time (total)2.77s
Instructions following
: 9.8 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)987msResponse Time (max)1.13sResponse Time (total)1.97s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)2.83sResponse Time (max)2.83sResponse Time (total)2.83s
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 29.6%Flaky tests: 1…Output Tokens: 1,967Reasoning Tokens: 0Response time: avg 1.11s · total 20.02s · max 6.04s
Anti-AI Tricks
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)501msResponse Time (max)839msResponse Time (total)2.01s
Coding
: 3.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.22sResponse Time (max)1.22sResponse Time (total)1.22s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)6.04sResponse Time (max)6.04sResponse Time (total)6.04s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)522msResponse Time (max)537msResponse Time (total)1.04s
General Intelligence
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)659msResponse Time (max)659msResponse Time (total)659ms
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.63sResponse Time (max)4.63sResponse Time (total)4.63s
A test is fully passed only if every run passed for that test.Wrong answer: 11Did not follow instructions: 2Response Time (avg)665msResponse Time (max)1.72sResponse Time (total)11.97s…
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 31.5%Flaky tests: 1…Output Tokens: 2,207Reasoning Tokens: 0Response time: avg 665ms · total 11.97s · max 1.72s
Anti-AI Tricks
: 3.4 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)395msResponse Time (max)769msResponse Time (total)1.58s
Coding
: 4.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.28sResponse Time (max)1.28sResponse Time (total)1.28s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.72sResponse Time (max)1.72sResponse Time (total)1.72s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)822msResponse Time (max)1.08sResponse Time (total)1.64s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)367msResponse Time (max)388msResponse Time (total)1.10s
General Intelligence
: 4.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)729msResponse Time (max)729msResponse Time (total)729ms
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)380msResponse Time (max)380msResponse Time (total)759ms
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.40sResponse Time (max)1.40sResponse Time (total)1.40s
Total Tests: 18Wrong Tests: 14Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 38.9%Flaky tests: 5…Output Tokens: 44,652Reasoning Tokens: 0Response time: avg 11.96s · total 179.34s · max 68.97s
Coding
: 4.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)9.57sResponse Time (max)9.57sResponse Time (total)9.57s
Combined
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)7.12sResponse Time (max)7.12sResponse Time (total)7.12s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)34.98sResponse Time (max)68.97sResponse Time (total)104.94s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 31.5%Flaky tests: 1…Output Tokens: 2,573Reasoning Tokens: 0Response time: avg 1.23s · total 22.16s · max 3.81s
Coding
: 6.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.39sResponse Time (max)1.39sResponse Time (total)1.39s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.81sResponse Time (max)3.81sResponse Time (total)3.81s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.04sResponse Time (max)1.05sResponse Time (total)2.08s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)927msResponse Time (max)1.17sResponse Time (total)2.78s
Instructions following
: 9.8 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.03sResponse Time (max)1.17sResponse Time (total)2.07s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)2.79sResponse Time (max)2.79sResponse Time (total)2.79s
A test is fully passed only if every run passed for that test.Wrong answer: 10Did not follow instructions: 3Response Time (avg)1.17sResponse Time (max)2.52sResponse Time (total)21.01s…
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 35.2%Flaky tests: 3…Output Tokens: 2,418Reasoning Tokens: 0Response time: avg 1.17s · total 21.01s · max 2.52s
Anti-AI Tricks
: 3.1 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)929msResponse Time (max)1.55sResponse Time (total)3.72s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.19sResponse Time (max)1.19sResponse Time (total)1.19s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.52sResponse Time (max)2.52sResponse Time (total)2.52s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.30sResponse Time (max)1.58sResponse Time (total)2.61s
Domain specific
: 3.5 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)937msResponse Time (max)1.25sResponse Time (total)2.81s
Instructions following
: 6.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)728msResponse Time (max)731msResponse Time (total)1.46s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)2.32sResponse Time (max)2.32sResponse Time (total)2.32s
Total Tests: 18Wrong Tests: 14Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 25.9%Flaky tests: 1…Output Tokens: 3,617Reasoning Tokens: 0Response time: avg 10.18s · total 122.13s · max 45.14s
Coding
: 7.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.14sResponse Time (max)3.14sResponse Time (total)3.14s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)45.14sResponse Time (max)45.14sResponse Time (total)45.14s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.32sResponse Time (max)1.32sResponse Time (total)1.32s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)962msResponse Time (max)962msResponse Time (total)962ms
General Intelligence
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.34sResponse Time (max)1.34sResponse Time (total)1.34s
Instructions following
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)7.71sResponse Time (max)14.65sResponse Time (total)15.42s
Puzzle Solving
: 3.2 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)22.86sResponse Time (max)42.58sResponse Time (total)45.73s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.47sResponse Time (max)2.47sResponse Time (total)2.47s
Total Tests: 18Wrong Tests: 13Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 27.8%Flaky tests: 0…Output Tokens: 2,177Reasoning Tokens: 0Response time: avg 1.05s · total 18.94s · max 2.43s
Anti-AI Tricks
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)842msResponse Time (max)1.47sResponse Time (total)3.37s
Coding
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.95sResponse Time (max)1.95sResponse Time (total)1.95s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.36sResponse Time (max)2.36sResponse Time (total)2.36s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.Extra formatting: 1Response Time (avg)1.01sResponse Time (max)1.18sResponse Time (total)2.03s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)756msResponse Time (max)877msResponse Time (total)2.27s
General Intelligence
: 4.6 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)841msResponse Time (max)841msResponse Time (total)841ms
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)751msResponse Time (max)821msResponse Time (total)1.50s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.43sResponse Time (max)2.43sResponse Time (total)2.43s
A test is fully passed only if every run passed for that test.Wrong answer: 10Did not follow instructions: 4Response Time (avg)8.54sResponse Time (max)24.97sResponse Time (total)153.69s…
Total Tests: 18Wrong Tests: 14Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 35.2%Flaky tests: 4…Output Tokens: 4,760Reasoning Tokens: 0Response time: avg 8.54s · total 153.69s · max 24.97s
Anti-AI Tricks
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)7.43sResponse Time (max)16.69sResponse Time (total)29.72s
Coding
: 3.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.99sResponse Time (max)2.99sResponse Time (total)2.99s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)19.98sResponse Time (max)19.98sResponse Time (total)19.98s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)7.92sResponse Time (max)13.23sResponse Time (total)15.84s
Domain specific
: 3.6 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)6.23sResponse Time (max)14.38sResponse Time (total)18.70s
Tool Calling
: 4.7 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)16.00sResponse Time (max)16.00sResponse Time (total)16.00s
A test is fully passed only if every run passed for that test.Wrong answer: 13Did not follow instructions: 1Response Time (avg)2.00sResponse Time (max)7.58sResponse Time (total)21.99s…
Total Tests: 18Wrong Tests: 14Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 22.2%Flaky tests: 0…Output Tokens: 1,947Reasoning Tokens: 0Response time: avg 2.00s · total 21.99s · max 7.58s
Anti-AI Tricks
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.34sResponse Time (max)1.83sResponse Time (total)2.67s
Coding
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.55sResponse Time (max)2.55sResponse Time (total)2.55s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)7.58sResponse Time (max)7.58sResponse Time (total)7.58s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.27sResponse Time (max)1.27sResponse Time (total)1.27s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)637msResponse Time (max)637msResponse Time (total)637ms
General Intelligence
: 4.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)909msResponse Time (max)909msResponse Time (total)909ms
Puzzle Solving
: 3.7 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.30sResponse Time (max)1.54sResponse Time (total)2.60s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.51sResponse Time (max)2.51sResponse Time (total)2.51s
Total Tests: 18Wrong Tests: 14Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 24.1%Flaky tests: 1…Output Tokens: 3,951Reasoning Tokens: 0Response time: avg 1.47s · total 26.43s · max 5.91s
Anti-AI Tricks
: 3.1 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)1.71sResponse Time (max)3.79sResponse Time (total)6.84s
Coding
: 5.2 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)5.69sResponse Time (max)5.69sResponse Time (total)5.69s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)5.91sResponse Time (max)5.91sResponse Time (total)5.91s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)847msResponse Time (max)1.09sResponse Time (total)1.69s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)464msResponse Time (max)622msResponse Time (total)1.39s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)514msResponse Time (max)582msResponse Time (total)1.03s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.27sResponse Time (max)1.27sResponse Time (total)1.27s
A test is fully passed only if every run passed for that test.Wrong answer: 13Did not follow instructions: 1Response Time (avg)613msResponse Time (max)1.27sResponse Time (total)11.04s…
Total Tests: 18Wrong Tests: 14Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 27.8%Flaky tests: 2…Output Tokens: 1,625Reasoning Tokens: 0Response time: avg 613ms · total 11.04s · max 1.27s
Anti-AI Tricks
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)483msResponse Time (max)716msResponse Time (total)1.93s
Coding
: 3.6 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)969msResponse Time (max)969msResponse Time (total)969ms
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)606msResponse Time (max)606msResponse Time (total)606ms
Data parsing and extraction
: 7.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)667msResponse Time (max)819msResponse Time (total)1.33s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)534msResponse Time (max)733msResponse Time (total)1.60s
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)551msResponse Time (max)622msResponse Time (total)1.10s
Puzzle Solving
: 3.1 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)533msResponse Time (max)637msResponse Time (total)1.60s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.27sResponse Time (max)1.27sResponse Time (total)1.27s
Total Tests: 18Wrong Tests: 15Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 27.8%Flaky tests: 3…Output Tokens: 3,241Reasoning Tokens: 0Response time: avg 10.75s · total 129.01s · max 81.80s
Coding
: 4.7 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)1.69sResponse Time (max)1.69sResponse Time (total)1.69s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.28sResponse Time (max)4.28sResponse Time (total)4.28s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)81.80sResponse Time (max)81.80sResponse Time (total)81.80s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)638msResponse Time (max)638msResponse Time (total)638ms
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.64sResponse Time (max)2.64sResponse Time (total)2.64s
Total Tests: 18Wrong Tests: 14Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 27.8%Flaky tests: 2…Output Tokens: 2,639Reasoning Tokens: 0Response time: avg 13.56s · total 230.55s · max 35.84s
Coding
: 2.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)4.56sResponse Time (max)4.56sResponse Time (total)4.56s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)35.84sResponse Time (max)35.84sResponse Time (total)35.84s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)2.85sResponse Time (max)2.85sResponse Time (total)2.85s
Domain specific
: 3.6 A test is fully passed only if every run passed for that test.Wrong answer: 2API error: 1Response Time (avg)17.61sResponse Time (max)25.68sResponse Time (total)52.82s
Instructions following
: 6.3 A test is fully passed only if every run passed for that test.Extra formatting: 1Response Time (avg)12.98sResponse Time (max)23.51sResponse Time (total)25.95s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)33.76sResponse Time (max)33.76sResponse Time (total)33.76s
Total Tests: 18Wrong Tests: 14Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 38.9%Flaky tests: 8…Output Tokens: 39,688Reasoning Tokens: 72,401Response time: avg 32.33s · total 355.65s · max 174.55s
Coding
: 3.6 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)21.26sResponse Time (max)21.26sResponse Time (total)21.26s
Combined
: 2.8 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)65.57sResponse Time (max)65.57sResponse Time (total)65.57s
Data parsing and extraction
: 6.3 A test is fully passed only if every run passed for that test.No answer: 1Response Time (avg)1.51sResponse Time (max)1.51sResponse Time (total)1.51s
Domain specific
: 3.5 A test is fully passed only if every run passed for that test.Wrong answer: 2No answer: 1Response Time (avg)174.55sResponse Time (max)174.55sResponse Time (total)174.55s
General Intelligence
: 3.6 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)18.14sResponse Time (max)18.14sResponse Time (total)18.14s
Instructions following
: 6.2 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.97sResponse Time (max)2.97sResponse Time (total)2.97s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)15.95sResponse Time (max)15.95sResponse Time (total)15.95s
Total Tests: 18Wrong Tests: 15Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 27.8%Flaky tests: 5…Output Tokens: 68,522Reasoning Tokens: 0Response time: avg 2.79s · total 39.08s · max 19.68s
Anti-AI Tricks
: 3.2 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)1.19sResponse Time (max)2.73sResponse Time (total)4.76s
Coding
: 6.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.79sResponse Time (max)2.79sResponse Time (total)2.79s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)2.87sResponse Time (max)2.87sResponse Time (total)2.87s
Domain specific
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)564msResponse Time (max)564msResponse Time (total)564ms
Instructions following
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)857msResponse Time (max)955msResponse Time (total)1.71s
Puzzle Solving
: 3.6 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.38sResponse Time (max)1.74sResponse Time (total)2.75s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)2.28sResponse Time (max)2.28sResponse Time (total)2.28s
A test is fully passed only if every run passed for that test.Wrong answer: 13Did not follow instructions: 2Response Time (avg)1.76sResponse Time (max)5.51sResponse Time (total)19.35s…
Total Tests: 18Wrong Tests: 15Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 24.1%Flaky tests: 3…Output Tokens: 1,721Reasoning Tokens: 0Response time: avg 1.76s · total 19.35s · max 5.51s
Coding
: 5.3 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.79sResponse Time (max)1.79sResponse Time (total)1.79s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.33sResponse Time (max)3.33sResponse Time (total)3.33s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)943msResponse Time (max)943msResponse Time (total)943ms
Domain specific
: 5.9 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.06sResponse Time (max)1.06sResponse Time (total)1.06s
Instructions following
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)923msResponse Time (max)923msResponse Time (total)923ms
Puzzle Solving
: 3.2 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.28sResponse Time (max)1.36sResponse Time (total)2.56s
Tool Calling
: 2.8 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)5.51sResponse Time (max)5.51sResponse Time (total)5.51s
Total Tests: 18Wrong Tests: 15Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 16.7%Flaky tests: 0…Output Tokens: 2,434Reasoning Tokens: 0Response time: avg 8.79s · total 158.19s · max 25.72s
Anti-AI Tricks
: 3.4 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)6.55sResponse Time (max)9.41sResponse Time (total)26.19s
Coding
: 5.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)10.57sResponse Time (max)10.57sResponse Time (total)10.57s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)23.53sResponse Time (max)23.53sResponse Time (total)23.53s
Data parsing and extraction
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)1.37sResponse Time (max)1.37sResponse Time (total)2.73s
Domain specific
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)1.04sResponse Time (max)1.08sResponse Time (total)3.11s
Instructions following
: 6.4 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)5.36sResponse Time (max)9.81sResponse Time (total)10.73s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.Invalid tool call: 1Response Time (avg)25.72sResponse Time (max)25.72sResponse Time (total)25.72s
A test is fully passed only if every run passed for that test.Wrong answer: 13Did not follow instructions: 3Response Time (avg)1.40sResponse Time (max)3.84sResponse Time (total)25.14s…
Total Tests: 18Wrong Tests: 16Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 31.5%Flaky tests: 7…Output Tokens: 2,762Reasoning Tokens: 0Response time: avg 1.40s · total 25.14s · max 3.84s
Anti-AI Tricks
: 3.5 A test is fully passed only if every run passed for that test.Wrong answer: 4Response Time (avg)1.18sResponse Time (max)1.81sResponse Time (total)4.70s
Coding
: 7.1 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.43sResponse Time (max)1.43sResponse Time (total)1.43s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)3.84sResponse Time (max)3.84sResponse Time (total)3.84s
Data parsing and extraction
: 6.5 A test is fully passed only if every run passed for that test.Wrong answer: 1Response Time (avg)1.11sResponse Time (max)1.25sResponse Time (total)2.23s
Domain specific
: 2.9 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)926msResponse Time (max)959msResponse Time (total)2.78s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)3.40sResponse Time (max)3.40sResponse Time (total)3.40s
Total Tests: 18Wrong Tests: 15Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 33.3%Flaky tests: 6…Output Tokens: 24,291Reasoning Tokens: 172,597Response time: avg 73.64s · total 1104.60s · max 226.38s
Anti-AI Tricks
: 5.1 A test is fully passed only if every run passed for that test.Timed out: 2Wrong answer: 1Response Time (avg)34.44sResponse Time (max)57.86sResponse Time (total)103.31s
Coding
: 2.6 A test is fully passed only if every run passed for that test.Did not follow instructions: 1Response Time (avg)135.61sResponse Time (max)135.61sResponse Time (total)135.61s
Combined
: 3.0 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Domain specific
: 3.6 A test is fully passed only if every run passed for that test.Timed out: 3Response Time (avg)137.75sResponse Time (max)202.61sResponse Time (total)413.24s
General Intelligence
: 2.8 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)226.38sResponse Time (max)226.38sResponse Time (total)226.38s
Instructions following
: 6.4 A test is fully passed only if every run passed for that test.Timed out: 1Response Time (avg)17.15sResponse Time (max)28.54sResponse Time (total)34.29s
Tool Calling
: 10.0 A test is fully passed only if every run passed for that test.No failed answers.Response Time (avg)4.31sResponse Time (max)4.31sResponse Time (total)4.31s
Total Tests: 16Wrong Tests: 15Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 14.6%Flaky tests: 2…Output Tokens: 1,185Reasoning Tokens: 0Response time: avg 811ms · total 11.35s · max 2.88s
Anti-AI Tricks
: 3.3 A test is fully passed only if every run passed for that test.Wrong answer: 3Response Time (avg)471msResponse Time (max)872msResponse Time (total)1.41s
Combined
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Data parsing and extraction
: 3.0 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)714msResponse Time (max)987msResponse Time (total)1.43s
Domain specific
: 5.9 A test is fully passed only if every run passed for that test.API error: 1Wrong answer: 1Response Time (avg)287msResponse Time (max)334msResponse Time (total)860ms
Instructions following
: 4.8 A test is fully passed only if every run passed for that test.Wrong answer: 2Response Time (avg)1.09sResponse Time (max)1.90sResponse Time (total)2.18s
Tool Calling
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms
Total Tests: 1Wrong Tests: 1Reliability: N/AReliability telemetry is unavailable or incomplete for this model.Attempt pass rate: 0.0%Flaky tests: 0…Output Tokens: 0Reasoning Tokens: 0Response time: avg 0ms · total 0ms · max 0ms
Coding
: 3.0 A test is fully passed only if every run passed for that test.API error: 1Response Time (avg)0msResponse Time (max)0msResponse Time (total)0ms