AI BENCHY
Compare Charts Methodology
❤️ Made by XCS
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY

Benchmark Methodology

This page explains our benchmarking approach at a high level. We keep exact prompts and grading internals private to protect test integrity.

How It Works (High Level)

  • Private tests: We do not publish exact test content, prompts, or full grading details.
  • Repeated runs: Each model is run multiple times so results reflect stability, not one lucky attempt.
  • Reasoning modes: When supported, models are evaluated across multiple reasoning configurations.
  • OpenRouter execution: Benchmark requests are routed through OpenRouter.
  • Real-world reliability: Timeouts, downtime, and API errors are counted as failed attempts.
  • Fast coverage with an evolving suite: Because our suite is smaller, we can test new models quickly and continuously add or remove tests.
  • Generic intelligence signal: The score is not tied to one category. It is a broad indicator of a practical question: if you ask the AI something, how likely is it to respond correctly?

We publish broad methodology for transparency while keeping sensitive benchmark details private.