Ranking de falhas por Não seguiu as instruções

Veja quais modelos de IA encontram Não seguiu as instruções com mais frequência para identificar riscos de confiabilidade antes de escolher. Ordenar por: Testes corretos ↓.

Modelos exibidos

Falhas totais

250

Modelo mais afetado

Gemini 3.5 Flash 1

Categorias

Na categoria Resolução de quebra-cabeças90 Na categoria Inteligência geral78 Na categoria Truques anti-IA33 Na categoria Seguimento de instruções23 Na categoria Programação16 Na categoria Chamada de ferramentas8 Na categoria Combinado1 Na categoria Específico do domínio1

145/145

Posição	Modelo	Empresa	Contagem de Não seguiu as instruções	Pontuação	Custo total	Testes corretos	Tempo de resposta (médio)
#142	GPT-5.4 Mini none	OpenAI	3	5.9	$0.095	6/22	1.53s
Total de testes 22 Testes errados 16 Custo total $0.095 Tempo de resposta (médio) 1.53s
#148	Qwen3.5-122B-A10B none	Qwen	2	5.7	$0.247	6/22	12.9s
Total de testes 22 Testes errados 16 Custo total $0.247 Tempo de resposta (médio) 12.9s
#160	MiMo-V2.5-Pro none	Xiaomi	4	5.5	$0.068	6/22	4.12s
Total de testes 22 Testes errados 16 Custo total $0.068 Tempo de resposta (médio) 4.12s
#172	Inkling none	Thinkingmachines	1	5.2	$0.147	6/22	3.50s
Total de testes 22 Testes errados 16 Custo total $0.147 Tempo de resposta (médio) 3.50s
#182	DeepSeek V3.2 none	DeepSeek	1	5.0	$0.054	6/22	18.3s
Total de testes 22 Testes errados 16 Custo total $0.054 Tempo de resposta (médio) 18.3s
#185	GLM 4.7 Flash none	Z.ai	1	4.9	$0.016	6/22	9.15s
Total de testes 22 Testes errados 16 Custo total $0.016 Tempo de resposta (médio) 9.15s
#187	Ling-2.6-flash none	Inclusionai	2	4.9	$0.002	6/22	10.7s
Total de testes 22 Testes errados 16 Custo total $0.002 Tempo de resposta (médio) 10.7s
#215	Laguna Xs.2 none	Poolside	1	3.8	$0.004	5/19	806ms
Total de testes 19 Testes errados 14 Custo total $0.004 Tempo de resposta (médio) 806ms
#203	Elephant Alpha none	Openrouter	3	4.3	$0.000	5/21	1.22s
Total de testes 21 Testes errados 16 Custo total $0.000 Tempo de resposta (médio) 1.22s
#156	DeepSeek V4 Flash none	DeepSeek	1	5.6	$0.044	5/22	36.8s
Total de testes 22 Testes errados 17 Custo total $0.044 Tempo de resposta (médio) 36.8s
#168	Laguna XS 2.1 none	Poolside	1	5.3	$0.008	5/22	1.55s
Total de testes 22 Testes errados 17 Custo total $0.008 Tempo de resposta (médio) 1.55s
#173	Mistral Small 4 none	Mistral	1	5.1	$0.022	5/22	1.20s
Total de testes 22 Testes errados 17 Custo total $0.022 Tempo de resposta (médio) 1.20s
#174	Qwen3 Coder Next none	Qwen	1	5.1	$0.025	5/22	9.12s
Total de testes 22 Testes errados 17 Custo total $0.025 Tempo de resposta (médio) 9.12s
#175	Mistral Small 4 medium	Mistral	2	5.1	$0.096	5/22	10.8s
Total de testes 22 Testes errados 17 Custo total $0.096 Tempo de resposta (médio) 10.8s
#176	MiMo-V2.5 none	Xiaomi	1	5.1	$0.025	5/22	4.62s
Total de testes 22 Testes errados 17 Custo total $0.025 Tempo de resposta (médio) 4.62s

Falhas por Não seguiu as instruções

Filtrar modelos

Melhores modelos por Contagem de Não seguiu as instruções

Contagem de Não seguiu as instruções vs Pontuação

Melhores modelos por Tempo de resposta (médio)