Ranking de falhas por Não seguiu as instruções

Veja quais modelos de IA encontram Não seguiu as instruções com mais frequência para identificar riscos de confiabilidade antes de escolher. Ordenar por: Testes corretos ↓.

Modelos exibidos

Falhas totais

245

Modelo mais afetado

Gemini 3.5 Flash 1

Categorias

Na categoria Resolução de quebra-cabeças90 Na categoria Inteligência geral78 Na categoria Truques anti-IA33 Na categoria Seguimento de instruções18 Na categoria Programação16 Na categoria Chamada de ferramentas8 Na categoria Combinado1 Na categoria Específico do domínio1

140/140

Posição	Modelo	Empresa	Contagem de Não seguiu as instruções	Pontuação	Custo total	Testes corretos	Tempo de resposta (médio)
#9	Gemini 3.5 Flash medium	Google	1	9.1	$0.642	19/22	8.20s
Total de testes 22 Testes errados 3 Custo total $0.642 Tempo de resposta (médio) 8.20s
#163	Gemini 3.1 Flash Lite Preview high	Google	1	5.3	$2.310	13/16	68.1s
Total de testes 16 Testes errados 3 Custo total $2.310 Tempo de resposta (médio) 68.1s
#131	Grok 4.20 Beta medium	X AI	1	6.0	$0.750	14/18	9.75s
Total de testes 18 Testes errados 4 Custo total $0.750 Tempo de resposta (médio) 9.75s
#12	Grok 4.5 high	X AI	1	8.9	$1.707	17/22	76.5s
Total de testes 22 Testes errados 5 Custo total $1.707 Tempo de resposta (médio) 76.5s
#13	GPT-5.3-Codex medium	OpenAI	2	8.9	$0.920	16/22	17.0s
Total de testes 22 Testes errados 6 Custo total $0.920 Tempo de resposta (médio) 17.0s
#23	Claude Sonnet 5 medium	Anthropic	1	8.3	$0.922	16/22	12.5s
Total de testes 22 Testes errados 6 Custo total $0.922 Tempo de resposta (médio) 12.5s
#42	GLM 5 medium	Z.ai	1	7.7	$0.307	15/21	33.5s
Total de testes 21 Testes errados 6 Custo total $0.307 Tempo de resposta (médio) 33.5s
#16	Muse Spark 1.1 medium	Meta	2	8.6	$1.357	15/22	25.0s
Total de testes 22 Testes errados 7 Custo total $1.357 Tempo de resposta (médio) 25.0s
#18	GPT-5.4 medium	OpenAI	2	8.5	$1.533	15/22	23.1s
Total de testes 22 Testes errados 7 Custo total $1.533 Tempo de resposta (médio) 23.1s
#25	Gemini 2.5 Flash medium	Google	1	8.2	$0.643	15/22	21.2s
Total de testes 22 Testes errados 7 Custo total $0.643 Tempo de resposta (médio) 21.2s
#28	Inkling high	Thinkingmachines	1	8.0	$1.006	15/22	64.2s
Total de testes 22 Testes errados 7 Custo total $1.006 Tempo de resposta (médio) 64.2s
#37	Qwen3.6 Plus medium	Qwen	1	7.8	$0.405	15/22	43.1s
Total de testes 22 Testes errados 7 Custo total $0.405 Tempo de resposta (médio) 43.1s
#49	GLM 5 Turbo medium	Z.ai	1	7.6	$0.323	14/21	23.0s
Total de testes 21 Testes errados 7 Custo total $0.323 Tempo de resposta (médio) 23.0s
#100	Hy3 preview medium	Tencent	1	6.5	$0.018	14/21	16.3s
Total de testes 21 Testes errados 7 Custo total $0.018 Tempo de resposta (médio) 16.3s
#21	GPT-5.2 medium	OpenAI	3	8.4	$0.951	14/22	22.6s
Total de testes 22 Testes errados 8 Custo total $0.951 Tempo de resposta (médio) 22.6s

Falhas por Não seguiu as instruções

Filtrar modelos

Melhores modelos por Contagem de Não seguiu as instruções

Contagem de Não seguiu as instruções vs Pontuação

Melhores modelos por Tempo de resposta (médio)