Navigate
AI BENCHY
AD
Track all your projects in one dashboard. Get 📊stats, 🔥heatmaps and 👀recordings in one self-hosted dashboard.
uxwizz.com

AI BENCHY Compare

OpenAI: GPT-5.5 vs StepFun: Step 3.7 Flash

Summary

GPT-5.5 vs Step 3.7 Flash benchmark comparison: Step 3.7 Flash leads on average score with 7.0 vs 6.4. GPT-5.5 has the lower benchmark cost at $0.231 vs $1.148. GPT-5.5 is faster at 1.89s vs 64.46s, with pass rates of 54.0% vs 63.5%.

Recommended model: GPT-5.5 - Its score stays close to the best score here (6.4 vs 7.0), while costing about 5.0x less than Step 3.7 Flash.

Last updated at: 2026-06-04

Metric GPT-5.5 GPT-5.5 none Release: 2026-04-24 Step 3.7 Flash Step 3.7 Flash high Release: 2026-05-29
Score 6.4 7.0
Rank #91 #71
Reliability 10.0 10.0
Consistency 8.8 8.2
Tests Correct
Attempt pass rate 54.0% 63.5%
Flaky tests 3 4
Total Runs 63 63
Cost per result 2.302 10.434
Total Cost $0.231 $1.148
Input Price $5.000 / 1M $0.200 / 1M
Output Price $30.000 / 1M $1.150 / 1M
Total Input Tokens 34,212 38,391
Output Tokens 1,971 991,355
Reasoning Tokens 0 0
Response Time (avg) 1.89s 64.46s
Response Time (max) 5.56s 364.99s
Response Time (total) 39.64s 1353.57s

Generation showcase

Hamster playing table tennis

Prompt: Create a detailed SVG illustration of a hamster playing table tennis.

#91 GPT-5.5

none
Cost
$0.090
Time
54.3s
Tokens
3,063 tok

#71 Step 3.7 Flash

high
Cost
$0.007
Time
63.6s
Tokens
6,030 tok

Top Models by Score

Score vs Total Cost

Response Time (avg)

Score vs Response Time (avg)

Total Output Tokens

Score vs Total Output Tokens

Category Breakdown

Anti-AI Tricks Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 6.9 7.9 66.7% 1 1.31s 606 213 0
Step 3.7 Flash 10.0 10.0 100.0% 0 13.40s 696 42,656 0
Coding Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 5.5 10.0 33.3% 0 1.35s 7,305 462 0
Step 3.7 Flash 4.0 6.0 22.2% 1 206.21s 6,057 327,340 0
Combined Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 3.0 10.0 0.0% 0 5.56s 11,019 300 0
Step 3.7 Flash 10.0 10.0 100.0% 0 13.01s 13,638 8,802 0
Data parsing and extraction Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 10.0 10.0 100.0% 0 1.18s 7,140 222 0
Step 3.7 Flash 10.0 10.0 100.0% 0 14.72s 7,368 23,113 0
Domain specific Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 2.9 7.2 11.1% 1 1.31s 723 52 0
Step 3.7 Flash 4.1 4.4 44.5% 2 149.64s 783 410,502 0
General Intelligence Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 10.0 10.0 100.0% 0 3.41s 477 124 0
Step 3.7 Flash 5.5 10.0 0.0% 0 4.17s 510 2,862 0
Instructions following Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 6.2 5.8 66.7% 1 1.15s 660 81 0
Step 3.7 Flash 9.8 10.0 100.0% 0 1.52s 705 2,010 0
Puzzle Solving Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 7.7 10.0 66.7% 0 1.29s 642 252 0
Step 3.7 Flash 5.3 7.2 44.4% 1 10.22s 711 25,422 0
Tool Calling Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 10.0 10.0 100.0% 0 3.90s 5,445 247 0
Step 3.7 Flash 10.0 10.0 100.0% 0 2.79s 7,701 1,172 0
Trivia Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Input Tokens Output Tokens Reasoning Tokens
GPT-5.5 3.0 10.0 0.0% 0 5.01s 195 18 0
Step 3.7 Flash 3.0 10.0 0.0% 0 149.34s 222 147,476 0

Quick Compare

Switch Comparison Pair