AI BENCHY
Benchmark Methodology
This page explains our benchmarking approach at a high level. We keep exact prompts and grading internals private to protect test integrity.
How It Works (High Level)
- Private tests: We do not publish exact test content, prompts, or full grading details.
- Repeated runs: Each model is run multiple times so results reflect stability, not one lucky attempt.
- Reasoning modes: When supported, models are evaluated across multiple reasoning configurations.
- OpenRouter execution: Benchmark requests are routed through OpenRouter.
- Real-world reliability: Timeouts, downtime, and API errors are counted as failed attempts.
- Fast coverage with an evolving suite: Because our suite is smaller, we can test new models quickly and continuously add or remove tests.
- Generic intelligence signal: The score is not tied to one category. It is a broad indicator of a practical question: if you ask the AI something, how likely is it to respond correctly?
We publish broad methodology for transparency while keeping sensitive benchmark details private.