AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY

Benchmark Methodology

This page explains our benchmarking approach at a high level. We keep exact prompts and grading internals private to protect test integrity.

The Tests

The questions are mostly randomly chosen, across different tasks and domains. Statistically speaking, a better model should on average do better than a weaker model on a random, non-cherry-picked task. I have a background in competitive programming, so thinking about tests and edge cases comes naturally.

This is not any specific standardized "IQ" value. The score does not have any unit, it is just an arbitrary value showing how well a model does on the whole test suite (correct answers + consistency). I am not cherry-picking models or modifying tests to accommodate any model. When I think of a new test, I add it, retest all models, and recalculate the scores.

The questions are usually based on simple ideas like, "I wonder if the models do well when they are asked to do X, Y, or Z". For example: "Respond with the two equal natural numbers, a and b, that when added together have the total = 2. Respond in this exact format: a,b". Some AIs might get the answer wrong, for example "2,2". Others might not follow the requirement that the numbers must be equal, for example "0,2". Others might ignore the output format, for example "The answer is a = 1 and b = 1". Others might simply answer correctly with "1,1".

Some tests are more complex than this, but you get the gist. This is not favoring any specific model, and these questions are generally very easy for humans. It is not my fault if Claude outputs something like "**1**, **1**", adding markdown highlighting, when most other models respect the required format correctly.

Cristian

How It Works (High Level)

  • Private tests: We do not publish exact test content, prompts, or full grading details.
  • Repeated runs: Each model is run multiple times so results reflect stability, not one lucky attempt.
  • Reasoning modes: When supported, models are evaluated across multiple reasoning configurations.
  • OpenRouter execution: Benchmark requests are routed through OpenRouter.
  • Real-world reliability: Timeouts, downtime, and API errors are counted as failed attempts.
  • Fast coverage with an evolving suite: Because our suite is smaller, we can test new models quickly and continuously add or remove tests.
  • Generic intelligence signal: The score is not tied to one category. It is a broad indicator of a practical question: if you ask the AI something, how likely is it to respond correctly?

We publish broad methodology for transparency while keeping sensitive benchmark details private.