Combined x Invalid tool call Ranking

See which AI models are most likely to hit Invalid tool call on Combined, so you can spot weak points faster. Sort by: Tests Correct ↑.

Models Shown

Total Failures

Most Affected Model

Muse Spark 1.1 2

Failure Reasons

Invalid tool call96 Wrong answer71 No answer33 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Combined96 Tool Calling9

80/80

Rank	Model	Company	Invalid tool call Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#133	Qwen3.5-35B-A3B none	Qwen	1	3.8	$0.106	0/2	128.3s
Total Tests 2 Wrong Tests 2 Total Cost $0.106 Response Time (avg) 128.3s
#138	GPT-5.6 Terra none	OpenAI	1	2.9	$0.349	0/2	7.02s
Total Tests 2 Wrong Tests 2 Total Cost $0.349 Response Time (avg) 7.02s
#143	North Mini Code medium	Cohere	1	2.9	$0.000	0/2	554.9s
Total Tests 2 Wrong Tests 2 Total Cost $0.000 Response Time (avg) 554.9s
#148	Qwen3.5-122B-A10B none	Qwen	1	5.2	$0.247	0/2	129.3s
Total Tests 2 Wrong Tests 2 Total Cost $0.247 Response Time (avg) 129.3s
#156	DeepSeek V4 Flash none	DeepSeek	2	4.6	$0.044	0/2	179.6s
Total Tests 2 Wrong Tests 2 Total Cost $0.044 Response Time (avg) 179.6s
#157	GLM 5.1 none	Z.ai	1	2.8	$0.164	0/2	46.9s
Total Tests 2 Wrong Tests 2 Total Cost $0.164 Response Time (avg) 46.9s
#158	Qwen3.6 27B none	Qwen	2	3.2	$0.087	0/2	83.1s
Total Tests 2 Wrong Tests 2 Total Cost $0.087 Response Time (avg) 83.1s
#162	Gemma 4 26B A4B none	Google	1	3.0	$0.015	0/2	37.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.015 Response Time (avg) 37.2s
#164	Laguna S 2.1 medium	Poolside	2	3.2	$0.059	0/2	284.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.059 Response Time (avg) 284.7s
#166	GPT-5.6 Luna none	OpenAI	1	3.2	$0.142	0/2	6.68s
Total Tests 2 Wrong Tests 2 Total Cost $0.142 Response Time (avg) 6.68s
#167	Laguna S 2.1 high	Poolside	2	2.9	$0.127	0/2	702.3s
Total Tests 2 Wrong Tests 2 Total Cost $0.127 Response Time (avg) 702.3s
#168	Laguna XS 2.1 none	Poolside	1	3.0	$0.008	0/2	10.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 10.4s
#172	Inkling none	Thinkingmachines	1	2.9	$0.147	0/2	25.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.147 Response Time (avg) 25.7s
#177	Qwen3.5-9B none	Qwen	2	3.0	$0.021	0/2	194.0s
Total Tests 2 Wrong Tests 2 Total Cost $0.021 Response Time (avg) 194.0s
#179	North Mini Code none	Cohere	2	3.2	$0.000	0/2	96.2s
Total Tests 2 Wrong Tests 2 Total Cost $0.000 Response Time (avg) 96.2s

←

1 2 3 4 5 6

→

Filter models

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Invalid tool call

Filter models

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost