Ranking de falhas por Não seguiu as instruções

Veja quais modelos de IA encontram Não seguiu as instruções com mais frequência para identificar riscos de confiabilidade antes de escolher. Ordenar por: Testes corretos ↑.

Modelos exibidos

Falhas totais

246

Modelo mais afetado

Granite 4.1 8B 4

Categorias

Na categoria Resolução de quebra-cabeças90 Na categoria Inteligência geral78 Na categoria Truques anti-IA33 Na categoria Seguimento de instruções19 Na categoria Programação16 Na categoria Chamada de ferramentas8 Na categoria Combinado1 Na categoria Específico do domínio1

141/141

Posição	Modelo	Empresa	Contagem de Não seguiu as instruções	Pontuação	Custo total	Testes corretos	Tempo de resposta (médio)
#153	Mimo V2 PRO none	Xiaomi	2	5.6	$0.045	7/21	2.27s
Total de testes 21 Testes errados 14 Custo total $0.045 Tempo de resposta (médio) 2.27s
#154	Owl Alpha none	Openrouter	3	5.6	$0.000	7/21	9.88s
Total de testes 21 Testes errados 14 Custo total $0.000 Tempo de resposta (médio) 9.88s
#194	Cobuddy medium	Baidu	3	4.7	$0.000	7/21	39.9s
Total de testes 21 Testes errados 14 Custo total $0.000 Tempo de resposta (médio) 39.9s
#197	Grok 4.20 Beta none	X AI	1	4.4	$0.087	6/18	1.19s
Total de testes 18 Testes errados 12 Custo total $0.087 Tempo de resposta (médio) 1.19s
#202	Hunter Alpha none	OpenRouter	2	4.2	$0.000	6/18	4.70s
Total de testes 18 Testes errados 12 Custo total $0.000 Tempo de resposta (médio) 4.70s
#109	Qwen3.5-27B none	Qwen	2	6.5	$0.090	8/22	4.76s
Total de testes 22 Testes errados 14 Custo total $0.090 Tempo de resposta (médio) 4.76s
#118	Claude Sonnet 5 none	Anthropic	1	6.3	$0.548	8/22	6.04s
Total de testes 22 Testes errados 14 Custo total $0.548 Tempo de resposta (médio) 6.04s
#132	Qwen3.5 Plus 2026-04-20 none	Qwen	2	6.1	$0.122	8/22	13.6s
Total de testes 22 Testes errados 14 Custo total $0.122 Tempo de resposta (médio) 13.6s
#135	Nemotron 3 Ultra none	NVIDIA	1	6.1	$0.095	8/22	3.87s
Total de testes 22 Testes errados 14 Custo total $0.095 Tempo de resposta (médio) 3.87s
#138	GPT-5.6 Terra none	OpenAI	1	6.0	$0.349	8/22	1.65s
Total de testes 22 Testes errados 14 Custo total $0.349 Tempo de resposta (médio) 1.65s
#146	Nemotron 3 Super medium	NVIDIA	3	5.7	$0.055	8/22	52.0s
Total de testes 22 Testes errados 14 Custo total $0.055 Tempo de resposta (médio) 52.0s
#155	KAT-Coder-Air V2.5 medium	Kwaipilot	1	5.6	$0.048	8/22	8.42s
Total de testes 22 Testes errados 14 Custo total $0.048 Tempo de resposta (médio) 8.42s
#162	Gemma 4 26B A4B none	Google	2	5.5	$0.015	8/22	7.64s
Total de testes 22 Testes errados 14 Custo total $0.015 Tempo de resposta (médio) 7.64s
#208	Grok Build 0.1 none	X AI	2	4.0	$0.547	7/19	28.7s
Total de testes 19 Testes errados 12 Custo total $0.547 Tempo de resposta (médio) 28.7s
#151	GLM 5V Turbo none	Z.ai	2	5.6	$0.052	8/21	2.99s
Total de testes 21 Testes errados 13 Custo total $0.052 Tempo de resposta (médio) 2.99s

Falhas por Não seguiu as instruções

Filtrar modelos

Melhores modelos por Contagem de Não seguiu as instruções

Contagem de Não seguiu as instruções vs Pontuação

Melhores modelos por Tempo de resposta (médio)