AI BENCHY
Advertise here

AI BENCHY

Changelog

A simple log of product and benchmark updates, grouped by date. We use it to note newly tested models, re-tests, benchmark changes, and shipped UX/product work.

2026-05-08

  • New Models Tested: Gemini 3.1 Flash Lite Added benchmark coverage for Google Gemini 3.1 Flash Lite.
  • Bug Fix: Reasoning chips and compare labels now recognize the minimal reasoning variant instead of falling back to auto.
  • UX: Model pages now order sibling reasoning-variant chips from highest effort to lowest.

2026-05-06

  • New Models Tested: Cobuddy Added benchmark coverage for Baidu CoBuddy.

2026-05-01

  • New Models Tested: Grok 4.3, Granite 4.1 8B Added benchmark coverage for xAI Grok 4.3 and IBM Granite 4.1 8B.

2026-04-30

  • New Models Tested: Owl Alpha Added benchmark coverage for Owl Alpha.

2026-04-26

  • UX: Improved mobile compare dropdown placement, tightened model page layout, and split run history into per-model shards so pages load less historical data.
  • Bug Fix: Run history now groups near-duplicate same-suite retests and shows all public runs in a direct comparison table on model pages.

2026-04-25

  • New Feature: Added Reliability score telemetry so target API and rate-limit failures are tracked separately from wrong answers.

2026-04-24

  • New Models Tested: DeepSeek V4 Flash, DeepSeek V4 Pro Added benchmark coverage for DeepSeek V4 Flash and DeepSeek V4 Pro.
  • New Models Tested: GPT-5.5 Added benchmark coverage for OpenAI GPT-5.5.
  • Bug Fix: Changelog model links now resolve to canonical live model pages, and model pages now link across reasoning variants.

2026-04-23

  • New Models Tested: inclusionai/ling-2.6-1t:free Added benchmark coverage for InclusionAI Ling 2.6 1T Free.
  • New Feature: Run history - Model pages now show historical public runs and a side-by-side run comparison table. (Example model page)
  • UX: The leaderboard now supports URL-backed pagination, filters, and direct compare actions from the ranking list.
  • Bug Fix: Homepage search, filter counts, and pagination state now stay consistent across the full dataset.
  • Re-test: GLM 5.1 Reran the full benchmark suite and cleaned up the public run-history snapshot for this model.
  • Bug Fix: Stopped unrelated models from receiving a fresh tested_at timestamp when they were not actually retested.

Changelog page created

We started this changelog after launch, so some older updates are missing.

2026-02-15

  • Initial release