Domain specific Model Ranking

AI BENCHY Category

See which AI models perform best on Domain specific, which ones stay reliable, and where the biggest gaps appear. Sort by: Tests Correct ↓.

Models Shown

Average Domain specific Score

4.8

Best Model

Failure Reasons

With failure reason Wrong answer314 With failure reason Timed out34 With failure reason Extra formatting12 With failure reason API error6 With failure reason No answer5 With failure reason Did not follow instructions1

Rank	Model	Company	Domain specific Score	Score	Tests Correct	Response Time (avg)
#77	Claude Sonnet 4.6 none	Anthropic	7.7	6.8	2/3	3.54s
#85	Gemma 4 31B none	Google	7.7	6.5	2/3	3.22s
#108	Qwen3.5-Flash none	Qwen	7.7	5.8	2/3	905ms
#117	Qwen3.5-35B-A3B none	Qwen	7.7	5.6	2/3	485ms
#118	Qwen3.6 27B none	Qwen	7.7	5.6	2/3	3.03s
#122	GLM 4.7 Flash none	Z.ai	7.7	5.5	2/3	744ms
#5	Qwen3.7 Max medium	Qwen	5.9	9.1	1/3	24.9s
#6	GPT-5.5 low	OpenAI	5.3	9.0	1/3	28.1s
#9	GPT-5.5 medium	OpenAI	5.3	8.8	1/3	164.1s
#10	Claude Opus 4.8 medium	Anthropic	5.3	8.7	1/3	14.2s
#12	Gemini 3.1 Flash Lite Preview high	Google	5.3	8.6	1/3	127.6s
#13	Grok 4.20 Beta medium	X AI	5.3	8.5	1/3	21.3s
#15	GPT-5.3-Codex medium	OpenAI	5.9	8.4	1/3	64.3s
#16	Gemini 3 Flash Preview low	Google	5.3	8.4	1/3	8.05s
#19	Seed-2.0-Lite medium	Bytedance Seed	5.9	8.2	1/3	88.7s

Domain specific Ranking