Clasament al eșecurilor pentru Formatare suplimentară

Vezi ce modele AI se lovesc cel mai des de Formatare suplimentară, ca să identifici riscurile de fiabilitate înainte să alegi.

Modele afișate

Eșecuri totale

Modelul cel mai afectat

Claude Opus 4.6 5

Categorii

În categoria Trucuri anti-AI20 În categoria Programare18 În categoria Specific domeniului17 În categoria Rezolvare de puzzle-uri7 În categoria Parsare și extragere de date6 În categoria Respectarea instrucțiunilor3 În categoria Combinat1

41/41

Rang	Model	Companie	Număr de Formatare suplimentară	Scor	Cost total	Teste corecte	Timp de răspuns (mediu)
#43	Claude Opus 4.6 medium	Anthropic	5	7.7	$3.059	13/22	34.3s
Total teste 22 Teste greșite 9 Cost total $3.059 Timp de răspuns (mediu) 34.3s
#62	Claude Sonnet 4.6 none	Anthropic	4	7.3	$0.661	12/22	8.12s
Total teste 22 Teste greșite 10 Cost total $0.661 Timp de răspuns (mediu) 8.12s
#108	Claude Sonnet 5 none	Anthropic	4	6.3	$0.548	8/22	6.04s
Total teste 22 Teste greșite 14 Cost total $0.548 Timp de răspuns (mediu) 6.04s
#154	KAT-Coder-Air V2.5 low	Kwaipilot	4	5.4	$0.041	7/22	10.1s
Total teste 22 Teste greșite 15 Cost total $0.041 Timp de răspuns (mediu) 10.1s
#40	Claude Sonnet 4.6 medium	Anthropic	3	7.8	$2.057	14/22	25.9s
Total teste 22 Teste greșite 8 Cost total $2.057 Timp de răspuns (mediu) 25.9s
#48	Grok Build 0.1 medium	X AI	3	7.6	$1.097	14/22	52.1s
Total teste 22 Teste greșite 8 Cost total $1.097 Timp de răspuns (mediu) 52.1s
#65	Claude Opus 4.8 none	Anthropic	3	7.3	$1.166	13/22	4.91s
Total teste 22 Teste greșite 9 Cost total $1.166 Timp de răspuns (mediu) 4.91s
#83	MiMo-V2.5-Pro medium	Xiaomi	3	6.9	$0.187	12/22	33.9s
Total teste 22 Teste greșite 10 Cost total $0.187 Timp de răspuns (mediu) 33.9s
#140	KAT-Coder-Air V2.5 high	Kwaipilot	3	5.6	$0.077	7/22	15.9s
Total teste 22 Teste greșite 15 Cost total $0.077 Timp de răspuns (mediu) 15.9s
#178	KAT-Coder-Air V2.5 none	Kwaipilot	3	4.8	$0.067	5/22	12.2s
Total teste 22 Teste greșite 17 Cost total $0.067 Timp de răspuns (mediu) 12.2s
#98	MiMo-V2.5 medium	Xiaomi	2	6.5	$0.082	12/22	32.2s
Total teste 22 Teste greșite 10 Cost total $0.082 Timp de răspuns (mediu) 32.2s
#133	North Mini Code medium	Cohere	2	5.9	$0.000	9/22	137.1s
Total teste 22 Teste greșite 13 Cost total $0.000 Timp de răspuns (mediu) 137.1s
#146	DeepSeek V4 Flash none	DeepSeek	2	5.6	$0.044	5/22	36.8s
Total teste 22 Teste greșite 17 Cost total $0.044 Timp de răspuns (mediu) 36.8s
#167	North Mini Code none	Cohere	2	5.1	$0.000	4/22	29.9s
Total teste 22 Teste greșite 18 Cost total $0.000 Timp de răspuns (mediu) 29.9s
#169	DeepSeek V3.2 none	DeepSeek	2	5.0	$0.054	6/22	18.3s
Total teste 22 Teste greșite 16 Cost total $0.054 Timp de răspuns (mediu) 18.3s

Eșecuri Formatare suplimentară

Filtrează modelele

Top modele după Număr de Formatare suplimentară

Număr de Formatare suplimentară vs Scor

Top modele după Timp de răspuns (mediu)