Combined x Invalid tool call Ranking

See which AI models are most likely to hit Invalid tool call on Combined, so you can spot weak points faster.

Models Shown

Total Failures

Most Affected Model

Muse Spark 1.1 2

Failure Reasons

Invalid tool call91 Wrong answer68 No answer29 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Combined91 Tool Calling9

77/77

Rank	Model	Company	Invalid tool call Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#27	Muse Spark 1.1 high	Meta	2	5.9	$1.694	0/2	70.3s
Total Tests 2 Wrong Tests 2 Total Cost $1.694 Response Time (avg) 70.3s
#88	Gemini 3.5 Flash minimal	Google	2	3.0	$0.300	0/2	14.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.300 Response Time (avg) 14.4s
#99	Qwen3.6 27B medium	Qwen	2	6.7	$0.779	0/2	584.1s
Total Tests 2 Wrong Tests 2 Total Cost $0.779 Response Time (avg) 584.1s
#123	Inkling low	Thinkingmachines	2	2.9	$0.187	0/2	22.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.187 Response Time (avg) 22.7s
#124	Qwen3.6 Flash none	Qwen	2	3.8	$0.062	0/2	26.5s
Total Tests 2 Wrong Tests 2 Total Cost $0.062 Response Time (avg) 26.5s
#150	DeepSeek V4 Flash none	DeepSeek	2	4.6	$0.044	0/2	179.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.044 Response Time (avg) 179.6s
#152	Qwen3.6 27B none	Qwen	2	3.2	$0.087	0/2	83.1s
Total Tests 2 Wrong Tests 2 Total Cost $0.087 Response Time (avg) 83.1s
#169	Qwen3.5-9B none	Qwen	2	3.0	$0.021	0/2	194.0s
Total Tests 2 Wrong Tests 2 Total Cost $0.021 Response Time (avg) 194.0s
#171	North Mini Code none	Cohere	2	3.2	$0.000	0/2	96.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.000 Response Time (avg) 96.2s
#173	DeepSeek V3.2 none	DeepSeek	2	4.8	$0.054	0/2	113.5s
Total Tests 2 Wrong Tests 2 Total Cost $0.054 Response Time (avg) 113.5s
#176	GLM 4.7 Flash none	Z.ai	2	3.0	$0.016	0/2	50.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.016 Response Time (avg) 50.2s
#178	Ling-2.6-flash none	Inclusionai	2	3.0	$0.002	0/2	35.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.002 Response Time (avg) 35.7s
#194	GLM 4.7 Flash medium	Z.ai	2	2.9	$0.166	0/2	802.8s
Total Tests 2 Wrong Tests 2 Total Cost $0.166 Response Time (avg) 802.8s
#201	Granite 4.1 8B none	IBM Granite	2	3.0	$0.007	0/2	9.28s
Total Tests 2 Wrong Tests 2 Total Cost $0.007 Response Time (avg) 9.28s
#2	Gemini 3.5 Flash high	Google	1	8.2	$1.976	1/2	84.1s
Total Tests 2 Wrong Tests 1 Total Cost $1.976 Response Time (avg) 84.1s

1 2 3 4 5 6

→

Filter models

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Invalid tool call

Filter models

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost