Tool Calling x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Tool Calling, so you can spot weak points faster.

Models Shown

Total Failures

Most Affected Model

Failure Reasons

API error17 Invalid tool call9 Did not follow instructions8 Wrong answer3 No answer2

Categories

Domain specific412 Anti-AI Tricks293 Coding252 Puzzle Solving201 Trivia168 Combined68 Instructions following61 General Intelligence59 Data parsing and extraction41 Tool Calling3

3/3

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#55	GPT-5.6 Terra low	OpenAI	1	4.7	$0.519	0/1	6.69s
Total Tests 1 Wrong Tests 1 Total Cost $0.519 Response Time (avg) 6.69s
#176	GLM 4.7 Flash none	Z.ai	1	2.8	$0.016	0/1	7.05s
Total Tests 1 Wrong Tests 1 Total Cost $0.016 Response Time (avg) 7.05s
#203	Grok 4.1 Fast none	X AI	1	2.8	$0.008	0/1	5.51s
Total Tests 1 Wrong Tests 1 Total Cost $0.008 Response Time (avg) 5.51s

Filter models