Ranking de falhas por Não seguiu as instruções

Veja quais modelos de IA encontram Não seguiu as instruções com mais frequência para identificar riscos de confiabilidade antes de escolher. Ordenar por: Pontuação ↑.

Modelos exibidos

Falhas totais

245

Modelo mais afetado

LFM2-24B-A2B 1

Categorias

Na categoria Resolução de quebra-cabeças90 Na categoria Inteligência geral78 Na categoria Truques anti-IA33 Na categoria Seguimento de instruções18 Na categoria Programação16 Na categoria Chamada de ferramentas8 Na categoria Combinado1 Na categoria Específico do domínio1

140/140

Posição	Modelo	Empresa	Contagem de Não seguiu as instruções	Pontuação	Custo total	Testes corretos	Tempo de resposta (médio)
#193	Elephant Alpha none	Openrouter	3	4.3	$0.000	5/21	1.22s
Total de testes 21 Testes errados 16 Custo total $0.000 Tempo de resposta (médio) 1.22s
#191	Grok 4.20 Beta none	X AI	1	4.4	$0.087	6/18	1.19s
Total de testes 18 Testes errados 12 Custo total $0.087 Tempo de resposta (médio) 1.19s
#190	MiniMax M2.5 medium	Minimax	3	4.6	$0.340	5/22	68.3s
Total de testes 22 Testes errados 17 Custo total $0.340 Tempo de resposta (médio) 68.3s
#189	Mercury 2 none	Inception	1	4.6	$0.030	4/22	829ms
Total de testes 22 Testes errados 18 Custo total $0.030 Tempo de resposta (médio) 829ms
#188	Cobuddy medium	Baidu	3	4.7	$0.000	7/21	39.9s
Total de testes 21 Testes errados 14 Custo total $0.000 Tempo de resposta (médio) 39.9s
#187	Qwen3 Coder Next medium	Qwen	3	4.7	$0.032	4/22	9.61s
Total de testes 22 Testes errados 18 Custo total $0.032 Tempo de resposta (médio) 9.61s
#186	Laguna M.1 medium	Poolside	1	4.7	$0.033	9/19	14.7s
Total de testes 19 Testes errados 10 Custo total $0.033 Tempo de resposta (médio) 14.7s
#185	Grok 4.1 Fast medium	X AI	4	4.7	$0.069	9/19	23.8s
Total de testes 19 Testes errados 10 Custo total $0.069 Tempo de resposta (médio) 23.8s
#184	Hunter Alpha medium	OpenRouter	2	4.7	$0.000	8/18	10.3s
Total de testes 18 Testes errados 10 Custo total $0.000 Tempo de resposta (médio) 10.3s
#183	Trinity Large Preview none	Arcee AI	3	4.8	$0.008	4/21	2.98s
Total de testes 21 Testes errados 17 Custo total $0.008 Tempo de resposta (médio) 2.98s
#181	Grok 4.20 Multi Agent Beta medium	X AI	2	4.8	$5.599	8/18	9.69s
Total de testes 18 Testes errados 10 Custo total $5.599 Tempo de resposta (médio) 9.69s
#180	GPT-5.4 Nano none	OpenAI	2	4.8	$0.041	4/22	2.57s
Total de testes 22 Testes errados 18 Custo total $0.041 Tempo de resposta (médio) 2.57s
#179	Ring-2.6-1T none	Inclusionai	2	4.8	$0.026	9/22	55.1s
Total de testes 22 Testes errados 13 Custo total $0.026 Tempo de resposta (médio) 55.1s
#178	Ling-2.6-flash none	Inclusionai	2	4.9	$0.002	6/22	10.7s
Total de testes 22 Testes errados 16 Custo total $0.002 Tempo de resposta (médio) 10.7s
#177	Nemotron 3 Super none	NVIDIA	2	4.9	$0.008	5/22	5.97s
Total de testes 22 Testes errados 17 Custo total $0.008 Tempo de resposta (médio) 5.97s

Falhas por Não seguiu as instruções

Filtrar modelos

Melhores modelos por Contagem de Não seguiu as instruções

Contagem de Não seguiu as instruções vs Pontuação

Melhores modelos por Tempo de resposta (médio)