Navigate
AI BENCHY
Your ad here

AI BENCHY Compare

Inception: Mercury 2 vs Qwen: Qwen3.5-9B

Last updated at: 2026-03-12

Metric Mercury 2 Mercury 2 none Release: 2026-02-24 Qwen3.5-9B Qwen3.5-9B medium Release: 2026-03-02
Rank #61 #66
Avg Score 3.4 2.6
Consistency 9.0 7.4
Cost per result 0.153 0.779
Total Cost $0.007 $0.024
Tests Correct
Attempt pass rate 31.3% 35.4%
Flaky tests 2 5
Total Runs 48 48
Output Tokens 1,303 17,930
Reasoning Tokens 0 139,706
Response Time (avg) 596ms 71.44s
Response Time (max) 1.27s 226.38s
Response Time (total) 9.54s 928.77s

Top Models by Score

Score vs Total Cost

Response Time (avg)

Avg Score vs Response Time (avg)

Total Output Tokens

Avg Score vs Total Output Tokens

Category Breakdown

Anti-AI Tricks Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens
Mercury 2 10.0 10.0 0.0% 0 466ms 274 0
Qwen3.5-9B 4.0 7.2 55.6% 1 31.54s 2,410 10,913
Combined Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens
Mercury 2 10.0 10.0 0.0% 0 606ms 131 0
Qwen3.5-9B 10.0 10.0 0.0% 0 0ms 0 0
Data parsing and extraction Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens
Mercury 2 5.5 5.9 83.3% 1 667ms 180 0
Qwen3.5-9B 5.0 5.6 33.3% 1 87.31s 1,383 32,113
Domain specific Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens
Mercury 2 4.0 7.2 44.4% 1 534ms 46 0
Qwen3.5-9B 10.0 7.2 22.2% 1 137.75s 11,549 48,475
General Intelligence Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens
Mercury 2 4.0 10.0 0.0% 0 628ms 159 0
Qwen3.5-9B 10.0 1.6 33.3% 1 226.38s 0 30,695
Instructions following Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens
Mercury 2 5.5 10.0 50.0% 0 551ms 82 0
Qwen3.5-9B 5.5 5.8 66.7% 1 17.15s 599 4,517
Puzzle Solving Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens
Mercury 2 10.0 10.0 0.0% 0 533ms 234 0
Qwen3.5-9B 10.0 10.0 0.0% 0 33.38s 1,545 11,844
Tool Calling Score Consistency Attempt pass rate Flaky tests Tests Correct Response Time (avg) Output Tokens Reasoning Tokens
Mercury 2 10.0 10.0 100.0% 0 1.27s 197 0
Qwen3.5-9B 10.0 10.0 100.0% 0 4.31s 444 1,149

Quick Compare

Switch Comparison Pair