AI BENCHY

Benchmark Methodology

This page explains our benchmarking approach at a high level. We keep exact prompts and grading internals private to protect test integrity.

How It Works (High Level)

Private tests: We do not publish exact test content, prompts, or full grading details.
Repeated runs: Each model is run multiple times so results reflect stability, not one lucky attempt.
Reasoning modes: When supported, models are evaluated across multiple reasoning configurations.
OpenRouter execution: Benchmark requests are routed through OpenRouter.
Real-world reliability: Timeouts, downtime, and API errors are counted as failed attempts.
Fast coverage with an evolving suite: Because our suite is smaller, we can test new models quickly and continuously add or remove tests.
Generic intelligence signal: The score is not tied to one category. It is a broad indicator of a practical question: if you ask the AI something, how likely is it to respond correctly?

We publish broad methodology for transparency while keeping sensitive benchmark details private.