#52
X AI ยท Release: 2026-03-12 ยท x-ai/grok-4.20-beta::none
Flaky tests
2
Flaky tests had mixed outcomes across runs (at least one pass and one fail).
Charts
Choose the first model, then click a second model to open a side-by-side page.
Avg Score vs Total Cost
Response Time (avg)
Avg Score vs Response Time (avg)
Total Output Tokens
Avg Score vs Total Output Tokens
Quick Compare
Grok 4.20 BetanonevsHunter AlphanoneGrok 4.20 BetanonevsMiniMax M2.5mediumGrok 4.20 BetanonevsTrinity Large PreviewnoneFree AvailableGrok 4.20 BetanonevsQwen3.5-35B-A3BnoneGrok 4.20 BetanonevsKimi K2.5noneGrok 4.20 BetanonevsGemini 3 Flash PreviewmediumGrok 4.20 BetanonevsGemini 3.1 Pro PreviewmediumGrok 4.20 BetanonevsStep 3.5 FlashmediumFree Available
Category Breakdown
| Category | Avg Score | Consistency | Tests Correct |
|---|---|---|---|
| Anti-AI Tricks | 3.3 | 7.9 | |
| Combined | 10.0 | 10.0 | |
| Data parsing and extraction | 9.9 | 10.0 | |
| Domain specific | 10.0 | 10.0 | |
| General Intelligence | 5.0 | 10.0 | |
| Instructions following | 4.5 | 10.0 | |
| Puzzle Solving | 4.0 | 7.2 | |
| Tool Calling | 10.0 | 10.0 |