Invalid tool call Failure Ranking

See which AI models run into Invalid tool call most often, so you can spot reliability risks before choosing one. Sort by: Score ↓.

Models Shown

Total Failures

100

Most Affected Model

Gemini 3.5 Flash 1

Categories

In category Combined91 In category Tool Calling9

83/83

Rank	Model	Company	Invalid tool call Count	Score	Total Cost	Tests Correct	Response Time (avg)
#55	GPT-5.6 Terra low	OpenAI	1	7.5	$0.519	13/22	5.31s
Total Tests 22 Wrong Tests 9 Total Cost $0.519 Response Time (avg) 5.31s
#56	GPT-5.4 Mini medium	OpenAI	1	7.5	$0.756	12/22	25.9s
Total Tests 22 Wrong Tests 10 Total Cost $0.756 Response Time (avg) 25.9s
#57	Qwen3.5 Plus 2026-02-15 medium	Qwen	1	7.5	$0.437	14/22	89.2s
Total Tests 22 Wrong Tests 8 Total Cost $0.437 Response Time (avg) 89.2s
#58	Qwen3.5-27B medium	Qwen	1	7.4	$1.627	13/22	111.9s
Total Tests 22 Wrong Tests 9 Total Cost $1.627 Response Time (avg) 111.9s
#64	Gemini 3.1 Flash Lite Preview medium	Google	1	7.3	$0.115	13/22	4.61s
Total Tests 22 Wrong Tests 9 Total Cost $0.115 Response Time (avg) 4.61s
#65	Gemini 3.1 Flash Lite medium	Google	1	7.3	$0.117	13/22	4.27s
Total Tests 22 Wrong Tests 9 Total Cost $0.117 Response Time (avg) 4.27s
#67	Step 3.7 Flash low	Stepfun	1	7.3	$0.454	12/22	20.7s
Total Tests 22 Wrong Tests 10 Total Cost $0.454 Response Time (avg) 20.7s
#68	Kimi K2.6 medium	Moonshot AI	1	7.2	$1.036	12/22	110.0s
Total Tests 22 Wrong Tests 10 Total Cost $1.036 Response Time (avg) 110.0s
#69	KAT-Coder-Pro V2.5 high	Kwaipilot	1	7.2	$0.482	11/22	20.8s
Total Tests 22 Wrong Tests 11 Total Cost $0.482 Response Time (avg) 20.8s
#72	Qwen3.5-122B-A10B medium	Qwen	1	7.1	$1.046	14/22	64.2s
Total Tests 22 Wrong Tests 8 Total Cost $1.046 Response Time (avg) 64.2s
#75	Grok 4.20 medium	X AI	1	7.1	$0.777	12/22	29.5s
Total Tests 22 Wrong Tests 10 Total Cost $0.777 Response Time (avg) 29.5s
#76	DeepSeek V3.2 medium	DeepSeek	1	7.0	$0.078	11/22	68.6s
Total Tests 22 Wrong Tests 11 Total Cost $0.078 Response Time (avg) 68.6s
#77	Kimi K2.5 medium	Moonshot AI	1	7.0	$0.600	10/22	99.0s
Total Tests 22 Wrong Tests 12 Total Cost $0.600 Response Time (avg) 99.0s
#78	Mercury 2 medium	Inception	1	7.0	$0.093	10/22	2.72s
Total Tests 22 Wrong Tests 12 Total Cost $0.093 Response Time (avg) 2.72s
#82	DeepSeek V4 Pro none	DeepSeek	1	6.9	$0.096	10/22	11.6s
Total Tests 22 Wrong Tests 12 Total Cost $0.096 Response Time (avg) 11.6s

←

1 2 3 4 5 6

→

Invalid tool call Failures

Filter models

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)