Navigate
AI BENCHY
AD
Track all your projects in one dashboard. Get ๐Ÿ“Šstats, ๐Ÿ”ฅheatmaps and ๐Ÿ‘€recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Compare

Compared models

Summary

GPT-5.5 vs GPT-5.4 vs Gemini 3.1 Pro Preview vs Claude Opus 4.7 benchmark comparisonGemini 3.1 Pro Preview leads on Score with 9.2. GPT-5.5 leads on Reliability with 10.0. Claude Opus 4.7 has the lowest Total Cost at $0.679. Claude Opus 4.7 is fastest at 4.73s.

Recommended model: Claude Opus 4.7 - Its score stays close to the best score here (8.7 vs 9.2), while costing about 2.9x less than the other models in this comparison.

Last updated at: 2026-06-18

Metric GPT-5.5 GPT-5.5 medium Release: 2026-04-24 GPT-5.4 GPT-5.4 medium Release: 2026-03-05 Gemini 3.1 Pro Preview Gemini 3.1 Pro Preview medium Release: 2026-02-19 Claude Opus 4.7 Claude Opus 4.7 medium Release: 2026-04-16
Score 9.0 8.5 9.2 8.7
Rank #9 #17 #7 #13
Reliability 10.0 10.0 10.0 10.0
Consistency 8.9 8.6 10.0 9.6
Tests Correct
Attempt pass rate 87.3% 76.2% 90.5% 82.5%
Flaky tests 3 4 0 1
Total Runs 63 63 63 63
Cost per result 21.638 8.640 5.546 3.991
Total Cost $3.679 $1.210 $1.054 $0.679
Input Price $5.000 / 1M $2.500 / 1M $2.000 / 1M $5.000 / 1M
Output Price $30.000 / 1M $15.000 / 1M $12.000 / 1M $25.000 / 1M
Total Input Tokens 34,212 34,108 41,617 65,406
Output Tokens 1,985 2,242 1,977 11,858
Reasoning Tokens 114,925 72,707 78,896 2,198
Response Time (avg) 37.98s 22.35s 20.14s 4.73s
Response Time (max) 332.10s 100.41s 88.68s 23.18s
Response Time (total) 797.60s 469.29s 281.92s 94.51s

Generation showcase

Hamster playing table tennis

Prompt: Create a detailed SVG illustration of a hamster playing table tennis.

#9 GPT-5.5

medium
Cost
$0.112
Time
71.9s
Tokens
3,807 tok

#17 GPT-5.4

medium
Cost
$0.214
Time
199.6s
Tokens
14,349 tok

#7 Gemini 3.1 Pro Preview

medium
Cost
$0.115
Time
87.2s
Tokens
9,629 tok

#13 Claude Opus 4.7

medium
Cost
$0.059
Time
26.8s
Tokens
2,475 tok

Top Models by Score

Score vs Total Cost

Response Time (avg)

Score vs Response Time (avg)

Total Output Tokens

Score vs Total Output Tokens

Category Breakdown

Anti-AI Tricks Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 10.0 10.0 100.0% 0 4.66s 606 250 1,335
GPT-5.4 8.3 10.0 75.0% 0 4.11s 606 240 1,511
Gemini 3.1 Pro Preview 10.0 10.0 100.0% 0 7.90s 498 112 3,218
Claude Opus 4.7 8.3 10.0 75.0% 0 1.85s 894 348 0
Coding Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 8.8 7.8 88.9% 1 59.77s 7,305 362 24,959
GPT-5.4 8.8 7.8 88.9% 1 44.36s 7,305 433 24,216
Gemini 3.1 Pro Preview 7.9 9.9 66.7% 0 40.17s 8,124 435 41,247
Claude Opus 4.7 7.6 7.2 77.8% 1 12.96s 10,635 7,629 1,114
Combined Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 10.0 10.0 100.0% 0 19.29s 11,019 312 2,841
GPT-5.4 10.0 10.0 100.0% 0 20.57s 11,019 301 3,543
Gemini 3.1 Pro Preview 9.5 10.0 100.0% 0 40.61s 17,240 432 9,281
Claude Opus 4.7 10.0 10.0 100.0% 0 21.45s 24,501 2,369 1,084
Data parsing and extraction Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 10.0 10.0 100.0% 0 4.18s 7,140 234 593
GPT-5.4 10.0 10.0 100.0% 0 5.32s 7,140 234 804
Gemini 3.1 Pro Preview 10.0 10.0 100.0% 0 7.72s 7,265 279 3,904
Claude Opus 4.7 10.0 10.0 100.0% 0 2.37s 10,533 324 0
Domain specific Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 5.3 7.2 44.4% 1 164.14s 723 67 79,625
GPT-5.4 5.3 7.2 44.4% 1 74.27s 619 61 34,748
Gemini 3.1 Pro Preview 7.7 10.0 66.7% 0 32.73s 635 18 12,424
Claude Opus 4.7 7.7 10.0 66.7% 0 1.17s 630 51 0
General Intelligence Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 10.0 10.0 100.0% 0 4.16s 477 138 223
GPT-5.4 4.7 3.1 33.3% 1 4.92s 477 145 321
Gemini 3.1 Pro Preview 10.0 10.0 100.0% 0 11.77s 490 108 1,179
Claude Opus 4.7 10.0 10.0 100.0% 0 2.87s 723 256 0
Instructions following Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 10.0 10.0 100.0% 0 3.36s 660 93 538
GPT-5.4 10.0 10.0 100.0% 0 3.11s 660 93 897
Gemini 3.1 Pro Preview 10.0 10.0 100.0% 0 9.56s 621 72 2,236
Claude Opus 4.7 10.0 10.0 100.0% 0 1.57s 939 114 0
Puzzle Solving Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 10.0 10.0 100.0% 0 6.76s 642 241 2,225
GPT-5.4 8.2 7.2 88.9% 1 9.14s 642 441 3,815
Gemini 3.1 Pro Preview 10.0 10.0 100.0% 0 6.90s 570 235 3,128
Claude Opus 4.7 10.0 10.0 100.0% 0 2.43s 939 370 0
Tool Calling Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 10.0 10.0 100.0% 0 10.57s 5,445 258 832
GPT-5.4 10.0 10.0 100.0% 0 13.28s 5,445 264 1,031
Gemini 3.1 Pro Preview 10.0 10.0 100.0% 0 23.15s 6,018 274 982
Claude Opus 4.7 10.0 10.0 100.0% 0 4.17s 15,339 373 0
Trivia Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 2.8 1.6 33.3% 1 37.86s 195 30 1,754
GPT-5.4 3.0 10.0 0.0% 0 13.95s 195 30 1,821
Gemini 3.1 Pro Preview 10.0 10.0 100.0% 0 6.27s 156 12 1,297
Claude Opus 4.7 3.0 10.0 0.0% 0 2.25s 273 24 0

Quick Compare

Switch Comparison Pair