Combined x Invalid tool call Ranking

See which AI models are most likely to hit Invalid tool call on Combined, so you can spot weak points faster. Sort by: Response Time (avg) ↑.

Models Shown

Total Failures

Most Affected Model

Laguna M.1 1

Failure Reasons

Invalid tool call91 Wrong answer68 No answer29 API error26 Timed out5 Did not follow instructions1 Extra formatting1

Categories

Combined91 Tool Calling9

77/77

Rank	Model	Company	Invalid tool call Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#192	Laguna M.1 none	Poolside	1	1.5	$0.009	0/1	4.32s
Total Tests 1 Wrong Tests 1 Total Cost $0.009 Response Time (avg) 4.32s
#197	Grok 4.20 none	X AI	1	1.5	$0.057	0/1	6.04s
Total Tests 1 Wrong Tests 1 Total Cost $0.057 Response Time (avg) 6.04s
#191	Grok 4.20 Beta none	X AI	1	1.5	$0.087	0/1	6.48s
Total Tests 1 Wrong Tests 1 Total Cost $0.087 Response Time (avg) 6.48s
#159	GPT-5.6 Luna none	OpenAI	1	3.2	$0.142	0/2	6.68s
Total Tests 2 Wrong Tests 2 Total Cost $0.142 Response Time (avg) 6.68s
#132	GPT-5.6 Terra none	OpenAI	1	2.9	$0.349	0/2	7.02s
Total Tests 2 Wrong Tests 2 Total Cost $0.349 Response Time (avg) 7.02s
#78	Mercury 2 medium	Inception	1	6.7	$0.093	1/2	7.84s
Total Tests 2 Wrong Tests 1 Total Cost $0.093 Response Time (avg) 7.84s
#201	Granite 4.1 8B none	IBM Granite	2	3.0	$0.007	0/2	9.28s
Total Tests 2 Wrong Tests 2 Total Cost $0.007 Response Time (avg) 9.28s
#55	GPT-5.6 Terra low	OpenAI	1	8.7	$0.519	1/2	9.68s
Total Tests 2 Wrong Tests 1 Total Cost $0.519 Response Time (avg) 9.68s
#160	Laguna XS 2.1 none	Poolside	1	3.0	$0.008	0/2	10.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.008 Response Time (avg) 10.4s
#117	GPT-5.6 Luna low	OpenAI	1	2.8	$0.249	0/2	13.7s
Total Tests 2 Wrong Tests 2 Total Cost $0.249 Response Time (avg) 13.7s
#34	GPT-5.6 Terra high	OpenAI	1	8.7	$1.055	1/2	13.7s
Total Tests 2 Wrong Tests 1 Total Cost $1.055 Response Time (avg) 13.7s
#88	Gemini 3.5 Flash minimal	Google	2	3.0	$0.300	0/2	14.4s
Total Tests 2 Wrong Tests 2 Total Cost $0.300 Response Time (avg) 14.4s
#93	GLM 5V Turbo medium	Z.ai	1	3.4	$0.457	0/1	15.1s
Total Tests 1 Wrong Tests 1 Total Cost $0.457 Response Time (avg) 15.1s
#64	Gemini 3.1 Flash Lite Preview medium	Google	1	7.2	$0.115	1/2	16.6s
Total Tests 2 Wrong Tests 1 Total Cost $0.115 Response Time (avg) 16.6s
#65	Gemini 3.1 Flash Lite medium	Google	1	7.2	$0.117	1/2	18.5s
Total Tests 2 Wrong Tests 1 Total Cost $0.117 Response Time (avg) 18.5s

1 2 3 4 5 6

→

Filter models

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost

Combined: Invalid tool call

Filter models

Top Models by Invalid tool call Count

Invalid tool call Count vs Score

Top Models by Response Time (avg)

Top Models by Estimated Wasted Cost