Navigate
AI BENCHY
Compare Charts
❤️ Made by XCS
Your ad here

AI BENCHY Compare

Compared models

Compare:

Last updated at: 2026-03-05

Metric OpenAI: GPT-5.4 medium Release: 2026-03-05 OpenAI: GPT-5.3-Codex medium Release: 2026-02-05 OpenAI: GPT-5.2 medium Release: 2025-12-11
Rank #6 #4 #24
Avg Score 80 85 68
Consistency 85 90 74
Cost per result 7.127 4.820 3.396
Total Cost $0.784 $0.531 $0.306
Tests Correct
Attempt pass rate 82.2% 82.2% 75.6%
Flaky tests 3 2 5
Output Tokens 1,611 1,577 2,058
Reasoning Tokens 46,321 33,017 16,542

Top Models by Score

Score vs Total Cost

Category Breakdown

Anti-AI Tricks Score Consistency Attempt pass rate Flaky tests Tests Correct Output Tokens Reasoning Tokens
OpenAI: GPT-5.4 100 100 100.0% 0 216 1,466
OpenAI: GPT-5.3-Codex 100 100 100.0% 0 216 1,421
OpenAI: GPT-5.2 70 73 77.8% 1 549 2,002
Combined Score Consistency Attempt pass rate Flaky tests Tests Correct Output Tokens Reasoning Tokens
OpenAI: GPT-5.4 100 100 100.0% 0 301 3,543
OpenAI: GPT-5.3-Codex 100 100 100.0% 0 364 2,731
OpenAI: GPT-5.2 100 100 100.0% 0 291 1,757
Data parsing and extraction Score Consistency Attempt pass rate Flaky tests Tests Correct Output Tokens Reasoning Tokens
OpenAI: GPT-5.4 99 100 100.0% 0 234 804
OpenAI: GPT-5.3-Codex 99 100 100.0% 0 234 728
OpenAI: GPT-5.2 99 100 100.0% 0 234 420
Domain specific Score Consistency Attempt pass rate Flaky tests Tests Correct Output Tokens Reasoning Tokens
OpenAI: GPT-5.4 40 72 44.4% 1 61 34,748
OpenAI: GPT-5.3-Codex 40 72 55.6% 1 64 25,308
OpenAI: GPT-5.2 40 72 55.6% 1 42 10,342
Instructions following Score Consistency Attempt pass rate Flaky tests Tests Correct Output Tokens Reasoning Tokens
OpenAI: GPT-5.4 85 68 66.7% 1 93 897
OpenAI: GPT-5.3-Codex 90 100 50.0% 0 93 693
OpenAI: GPT-5.2 85 68 66.7% 1 94 614
Puzzle Solving Score Consistency Attempt pass rate Flaky tests Tests Correct Output Tokens Reasoning Tokens
OpenAI: GPT-5.4 70 72 88.9% 1 442 3,832
OpenAI: GPT-5.3-Codex 93 79 88.9% 1 352 1,644
OpenAI: GPT-5.2 70 73 77.8% 1 609 938
Tool Calling Score Consistency Attempt pass rate Flaky tests Tests Correct Output Tokens Reasoning Tokens
OpenAI: GPT-5.4 100 100 100.0% 0 264 1,031
OpenAI: GPT-5.3-Codex 100 100 100.0% 0 254 492
OpenAI: GPT-5.2 100 16 66.7% 1 239 469

Quick Compare

Switch Comparison Pair