Tool Calling x Wrong answer Ranking

See which AI models are most likely to hit Wrong answer on Tool Calling, so you can spot weak points faster. Sort by: Failure Count ↑.

Models Shown

Total Failures

Most Affected Model

Failure Reasons

API error17 Invalid tool call9 Did not follow instructions8 Wrong answer3 No answer2

Categories

Domain specific412 Anti-AI Tricks293 Coding252 Puzzle Solving201 Trivia168 Combined68 Instructions following61 General Intelligence59 Data parsing and extraction41 Tool Calling3

3/3

Rank	Model	Company	Wrong answer Count	Category Score	Total Cost	Tests Correct	Response Time (avg)
#55	GPT-5.6 Terra low	OpenAI	1	4.7	$0.519	0/1	6.69s
Total Tests 1 Wrong Tests 1 Total Cost $0.519 Response Time (avg) 6.69s
#176	GLM 4.7 Flash none	Z.ai	1	2.8	$0.016	0/1	7.05s
Total Tests 1 Wrong Tests 1 Total Cost $0.016 Response Time (avg) 7.05s
#203	Grok 4.1 Fast none	X AI	1	2.8	$0.008	0/1	5.51s
Total Tests 1 Wrong Tests 1 Total Cost $0.008 Response Time (avg) 5.51s

Filter models