Combined x Invalid tool call Ranking

See which AI models are most likely to hit Invalid tool call on Combined, so you can spot weak points faster. Sort by: Tests Correct ↓.

Models Shown

Total Failures

Most Affected Model

Gemini 3.5 Flash 1

Failure Reasons

Invalid tool call91 Wrong answer68 No answer29 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Combined91 Tool Calling9

77/77

Rank	Model	Company	Invalid tool call Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#2	Gemini 3.5 Flash high	Google	1	8.2	$1.976	1/2	84.1s
Total Tests 2 Wrong Tests 1 Total Cost $1.976 Response Time (avg) 84.1s
#8	Qwen3.7 Max medium	Qwen	1	8.7	$1.116	1/2	287.8s
Total Tests 2 Wrong Tests 1 Total Cost $1.116 Response Time (avg) 287.8s
#11	Gemini 3.5 Flash low	Google	1	8.2	$0.433	1/2	30.0s
Total Tests 2 Wrong Tests 1 Total Cost $0.433 Response Time (avg) 30.0s
#16	Muse Spark 1.1 medium	Meta	1	8.3	$1.357	1/2	42.6s
Total Tests 2 Wrong Tests 1 Total Cost $1.357 Response Time (avg) 42.6s
#17	Claude Fable 5 medium	Anthropic	1	6.5	$3.478	1/2	27.5s
Total Tests 2 Wrong Tests 1 Total Cost $3.478 Response Time (avg) 27.5s
#23	Claude Sonnet 5 medium	Anthropic	1	7.3	$0.922	1/2	51.9s
Total Tests 2 Wrong Tests 1 Total Cost $0.922 Response Time (avg) 51.9s
#24	Muse Spark 1.1 low	Meta	1	6.6	$0.647	1/2	29.4s
Total Tests 2 Wrong Tests 1 Total Cost $0.647 Response Time (avg) 29.4s
#28	Inkling high	Thinkingmachines	1	7.3	$1.006	1/2	63.8s
Total Tests 2 Wrong Tests 1 Total Cost $1.006 Response Time (avg) 63.8s
#29	Step 3.7 Flash medium	Stepfun	1	7.3	$0.515	1/2	80.9s
Total Tests 2 Wrong Tests 1 Total Cost $0.515 Response Time (avg) 80.9s
#34	GPT-5.6 Terra high	OpenAI	1	8.7	$1.055	1/2	13.7s
Total Tests 2 Wrong Tests 1 Total Cost $1.055 Response Time (avg) 13.7s
#36	Qwen3.7 Plus medium	Qwen	1	8.2	$0.267	1/2	190.3s
Total Tests 2 Wrong Tests 1 Total Cost $0.267 Response Time (avg) 190.3s
#45	DeepSeek V4 Flash high	DeepSeek	1	6.4	$0.042	1/2	104.1s
Total Tests 2 Wrong Tests 1 Total Cost $0.042 Response Time (avg) 104.1s
#51	Nemotron 3 Ultra medium	NVIDIA	1	6.3	$0.774	1/2	218.2s
Total Tests 2 Wrong Tests 1 Total Cost $0.774 Response Time (avg) 218.2s
#55	GPT-5.6 Terra low	OpenAI	1	8.7	$0.519	1/2	9.68s
Total Tests 2 Wrong Tests 1 Total Cost $0.519 Response Time (avg) 9.68s
#56	GPT-5.4 Mini medium	OpenAI	1	6.9	$0.756	1/2	59.6s
Total Tests 2 Wrong Tests 1 Total Cost $0.756 Response Time (avg) 59.6s

1 2 3 4 5 6

→

Filter models

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Invalid tool call

Filter models

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost