更新日志

一个按日期分组的产品与基准更新简明记录。我们用它记录新测试的模型、重新测试、基准变更以及已经发布的 UX/产品工作。

2026-06-17

新测试模型: GLM 5.2, Kimi K2.7 Code, Claude Fable 5, Nemotron 3 Ultra, Qwen3.7 Plus, MiniMax M3, Step 3.7 Flash, Claude Opus 4.8 Added benchmark coverage for newly released models missing from the changelog: Z.ai GLM 5.2, MoonshotAI Kimi K2.7 Code, Anthropic Claude Fable 5, NVIDIA Nemotron 3 Ultra 550B A55B, Qwen 3.7 Plus, MiniMax M3, StepFun Step 3.7 Flash, and Anthropic Claude Opus 4.8.
新功能: Updated scoring to use per-category bias adjustments, so category-level differences are normalized before they roll into leaderboard results.
Bug 修复: Adjusted missing-test handling so models are not scored as if unavailable tests were valid wrong answers.
UX: Leaderboard search now supports comma-separated model queries, so searches like "deepseek, glm" show matches for either model family.

新测试模型: Gemini 3.5 Flash, Grok Build 0.1 已添加 Google Gemini 3.5 Flash 和 xAI Grok Build 0.1 的基准覆盖。
Bug 修复: 在提供商验证要求启用 reasoning 后，已移除不受支持的 xAI Grok Build 0.1 无 reasoning 变体。

新测试模型: Gemini 3.1 Flash Lite Added benchmark coverage for Google Gemini 3.1 Flash Lite.
Bug 修复: Reasoning chips and compare labels now recognize the minimal reasoning variant instead of falling back to auto.
UX: Model pages now order sibling reasoning-variant chips from highest effort to lowest.

新测试模型: DeepSeek V4 Flash, DeepSeek V4 Pro 已为 DeepSeek V4 Flash 和 DeepSeek V4 Pro 添加基准测试覆盖。
新测试模型: GPT-5.5 已为 OpenAI GPT-5.5 添加基准测试覆盖。
Bug 修复: 更新日志中的模型链接现在会解析到规范的在线模型页面，模型页面之间也会互相链接到不同推理变体。

更新日志页面已创建

这个更新日志是在上线后才开始记录的，所以部分更早的更新没有列出。