AI BENCHY Compare

Inception: Mercury 2 vs OpenAI: gpt-oss-120b

Last updated at: 2026-04-16

Metric	Mercury 2 Mercury 2 none Release: 2026-02-24	gpt-oss-120b gpt-oss-120b none Release: 2025-08-05 Free Available

Metric	Mercury 2 Mercury 2 none Release: 2026-02-24	gpt-oss-120b gpt-oss-120b none Release: 2025-08-05 Free Available
Score	4.8	5.2
Rank	#89	#82
Consistency	9.0	7.9
Tests Correct
Attempt pass rate	27.8%	38.9%
Flaky tests	2	5
Total Runs	54	54
Cost per result	0.165	0.221
Total Cost	$0.007	$0.009
Input Price	$0.250 / 1M	$0.039 / 1M
Output Price	$0.750 / 1M	$0.190 / 1M
Output Tokens	1,625	44,652
Reasoning Tokens	0	0
Response Time (avg)	613ms	11.96s
Response Time (max)	1.27s	68.97s
Response Time (total)	11.04s	179.34s

Top Models by Score

Score vs Total Cost

Response Time (avg)

Score vs Response Time (avg)

Total Output Tokens

Score vs Total Output Tokens

Category Breakdown

Anti-AI Tricks	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mercury 2	3.0	10.0	0.0%	0		483ms	286	0
gpt-oss-120b	6.6	8.0	58.3%	1		6.03s	4,867	0

Coding	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mercury 2	3.6	8.9	0.0%	0		969ms	310	0
gpt-oss-120b	4.3	1.1	66.7%	1		9.57s	3,232	0

Combined	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mercury 2	3.0	10.0	0.0%	0		606ms	131	0
gpt-oss-120b	3.0	10.0	0.0%	0		0ms	0	0

Data parsing and extraction	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mercury 2	7.3	5.9	83.3%	1		667ms	180	0
gpt-oss-120b	6.5	10.0	50.0%	0		7.12s	598	0

Domain specific	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mercury 2	5.3	7.2	44.4%	1		534ms	46	0
gpt-oss-120b	3.0	10.0	0.0%	0		34.98s	29,483	0

General Intelligence	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mercury 2	4.8	10.0	0.0%	0		628ms	159	0
gpt-oss-120b	4.6	10.0	0.0%	0		2.83s	586	0

Instructions following	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mercury 2	6.5	10.0	50.0%	0		551ms	82	0
gpt-oss-120b	8.4	6.9	83.3%	1		5.10s	1,982	0

Puzzle Solving	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mercury 2	3.1	10.0	0.0%	0		533ms	234	0
gpt-oss-120b	4.5	4.8	44.5%	2		6.86s	3,904	0

Tool Calling	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mercury 2	10.0	10.0	100.0%	0		1.27s	197	0
gpt-oss-120b	3.0	10.0	0.0%	0		0ms	0	0

Quick Compare

Switch Comparison Pair

gpt-oss-120bnoneFree AvailablevsElephantmedium MiniMax M2.7mediumvsgpt-oss-120bnoneFree Available Mercury 2nonevsQwen3 Coder Nextmedium Mercury 2nonevsGLM 4.7 Flashmedium Mercury 2nonevsQwen3.5-9Bmedium Mercury 2nonevsElephantmedium Mistral Small 4mediumvsgpt-oss-120bnoneFree Available Mercury 2nonevsMiniMax M2.7medium gpt-oss-120bnoneFree AvailablevsQwen3 Coder Nextmedium MiniMax M2.5mediumFree Availablevsgpt-oss-120bnoneFree Available gpt-oss-120bnoneFree AvailablevsGLM 4.7 Flashmedium gpt-oss-120bnoneFree AvailablevsQwen3.5-9Bmedium