AI BENCHY Compare

Mistral: Mistral Small 4 vs OpenAI: gpt-oss-120b

Last updated at: 2026-05-01

Metric	Mistral Small 4 Mistral Small 4 none Release: 2026-03-16	gpt-oss-120b gpt-oss-120b none Release: 2025-08-05 Free Available

Metric	Mistral Small 4 Mistral Small 4 none Release: 2026-03-16	gpt-oss-120b gpt-oss-120b none Release: 2025-08-05 Free Available
Score	5.2	5.4
Rank	#115	#106
Reliability	N/A	N/A
Consistency	9.5	8.2
Tests Correct
Attempt pass rate	31.5%	40.7%
Flaky tests	1	4
Total Runs	54	54
Cost per result	0.118	0.177
Total Cost	$0.006	$0.009
Input Price	$0.150 / 1M	$0.000 / 1M
Output Price	$0.600 / 1M	$0.000 / 1M
Output Tokens	2,207	44,652
Reasoning Tokens	0	0
Response Time (avg)	665ms	11.96s
Response Time (max)	1.72s	68.97s
Response Time (total)	11.97s	179.34s

Top Models by Score

Score vs Total Cost

Response Time (avg)

Score vs Response Time (avg)

Total Output Tokens

Score vs Total Output Tokens

Category Breakdown

Anti-AI Tricks	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mistral Small 4	3.4	7.9	16.7%	1		395ms	182	0
gpt-oss-120b	6.6	8.0	58.3%	1		6.03s	4,867	0

Coding	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mistral Small 4	4.5	9.0	0.0%	0		1.28s	583	0
gpt-oss-120b	4.3	1.1	66.7%	1		9.57s	3,232	0

Combined	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mistral Small 4	3.0	10.0	0.0%	0		1.72s	496	0
gpt-oss-120b	3.0	10.0	0.0%	0		0ms	0	0

Data parsing and extraction	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mistral Small 4	10.0	10.0	100.0%	0		822ms	261	0
gpt-oss-120b	6.5	10.0	50.0%	0		7.12s	598	0

Domain specific	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mistral Small 4	5.3	10.0	33.3%	0		367ms	28	0
gpt-oss-120b	3.0	10.0	0.0%	0		34.98s	29,483	0

General Intelligence	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mistral Small 4	4.0	10.0	0.0%	0		729ms	205	0
gpt-oss-120b	4.6	10.0	0.0%	0		2.83s	586	0

Instructions following	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mistral Small 4	6.5	10.0	50.0%	0		380ms	69	0
gpt-oss-120b	9.8	10.0	100.0%	0		5.10s	1,982	0

Puzzle Solving	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mistral Small 4	3.1	9.9	0.0%	0		589ms	170	0
gpt-oss-120b	4.5	4.8	44.5%	2		6.86s	3,904	0

Tool Calling	Score	Consistency	Attempt pass rate	Flaky tests	Tests Correct	Response Time (avg)	Output Tokens	Reasoning Tokens
Mistral Small 4	10.0	10.0	100.0%	0		1.40s	213	0
gpt-oss-120b	3.0	10.0	0.0%	0		0ms	0	0

Quick Compare

Switch Comparison Pair

Mistral Small 4nonevsNemotron 3 Nano Omni 30b A3b ReasoningmediumFree Available Mistral Small 4nonevsElephant Alphamedium MiniMax M2.7mediumvsMistral Small 4none MiniMax M2.7mediumvsgpt-oss-120bnoneFree Available gpt-oss-120bnoneFree AvailablevsElephant Alphamedium Nemotron 3 Nano Omni 30b A3b ReasoningmediumFree Availablevsgpt-oss-120bnoneFree Available MiniMax M2.5mediumFree Availablevsgpt-oss-120bnoneFree Available Mistral Small 4mediumvsgpt-oss-120bnoneFree Available Mistral Small 4nonevsQwen3 Coder Nextmedium MiniMax M2.5mediumFree AvailablevsMistral Small 4none gpt-oss-120bnoneFree AvailablevsQwen3 Coder Nextmedium gpt-oss-120bnoneFree AvailablevsOwl Alphamedium