Changelog

A simple log of product and benchmark updates, grouped by date. We use it to note newly tested models, re-tests, benchmark changes, and shipped UX/product work.

2026-06-17

New Models Tested: GLM 5.2, Kimi K2.7 Code, Claude Fable 5, Nemotron 3 Ultra, Qwen3.7 Plus, MiniMax M3, Step 3.7 Flash, Claude Opus 4.8 Added benchmark coverage for newly released models missing from the changelog: Z.ai GLM 5.2, MoonshotAI Kimi K2.7 Code, Anthropic Claude Fable 5, NVIDIA Nemotron 3 Ultra 550B A55B, Qwen 3.7 Plus, MiniMax M3, StepFun Step 3.7 Flash, and Anthropic Claude Opus 4.8.
New Feature: Updated scoring to use per-category bias adjustments, so category-level differences are normalized before they roll into leaderboard results.
Bug Fix: Adjusted missing-test handling so models are not scored as if unavailable tests were valid wrong answers.
UX: Leaderboard search now supports comma-separated model queries, so searches like "deepseek, glm" show matches for either model family.

2026-05-22

New Models Tested: Qwen3.7 Max Added benchmark coverage for Qwen 3.7 Max.
New Tests Added: Added a new Coding test category focused on bug-finding in C++ solutions.

2026-05-21

New Models Tested: Gemini 3.5 Flash, Grok Build 0.1 Added benchmark coverage for Google Gemini 3.5 Flash and xAI Grok Build 0.1.
Bug Fix: Removed the unsupported xAI Grok Build 0.1 no-reasoning variant after provider validation required reasoning.

2026-05-08

New Models Tested: Gemini 3.1 Flash Lite Added benchmark coverage for Google Gemini 3.1 Flash Lite.
Bug Fix: Reasoning chips and compare labels now recognize the minimal reasoning variant instead of falling back to auto.
UX: Model pages now order sibling reasoning-variant chips from highest effort to lowest.

2026-05-06

New Models Tested: Cobuddy Added benchmark coverage for Baidu CoBuddy.

2026-05-01

New Models Tested: Grok 4.3, Granite 4.1 8B Added benchmark coverage for xAI Grok 4.3 and IBM Granite 4.1 8B.

2026-04-30

New Models Tested: Owl Alpha Added benchmark coverage for Owl Alpha.

2026-04-26

UX: Improved mobile compare dropdown placement, tightened model page layout, and split run history into per-model shards so pages load less historical data.
Bug Fix: Run history now groups near-duplicate same-suite retests and shows all public runs in a direct comparison table on model pages.

2026-04-25

New Feature: Added Reliability score telemetry so target API and rate-limit failures are tracked separately from wrong answers.

2026-04-24

New Models Tested: DeepSeek V4 Flash, DeepSeek V4 Pro Added benchmark coverage for DeepSeek V4 Flash and DeepSeek V4 Pro.
New Models Tested: GPT-5.5 Added benchmark coverage for OpenAI GPT-5.5.
Bug Fix: Changelog model links now resolve to canonical live model pages, and model pages now link across reasoning variants.

2026-04-23

New Models Tested: inclusionai/ling-2.6-1t:free Added benchmark coverage for InclusionAI Ling 2.6 1T Free.
New Feature: Run history - Model pages now show historical public runs and a side-by-side run comparison table. (Example model page)
UX: The leaderboard now supports URL-backed pagination, filters, and direct compare actions from the ranking list.
Bug Fix: Homepage search, filter counts, and pagination state now stay consistent across the full dataset.
Re-test: GLM 5.1 Reran the full benchmark suite and cleaned up the public run-history snapshot for this model.
Bug Fix: Stopped unrelated models from receiving a fresh tested_at timestamp when they were not actually retested.

Changelog page created

We started this changelog after launch, so some older updates are missing.

2026-02-15

Initial release