MoonshotAI vs Chutes BFCL (tool) benchmark

Execution

git clone https://github.com/ShishirPatil/gorilla
cd gorilla/berkeley-function-call-leaderboard
python3 -m venv venv
./venv/bin/pip install -e .
# Apply diffs per provider
./venv/bin/bfcl generate --model kimi-k2-0905 --skip-server-setup --num-threads 8

Raw files/data

Description	URL
CSV summary, Moonshot	moonshot_results.csv
CSV summary, Chutes	chutes_results.csv
Repo diff for Moonshot run	moonshot-bfcl.diff
Repo diff for Chutes run	chutes-bfcl.diff
Raw results, Moonshot	moonshot-results.tar.gz
Raw results, Chutes	chutes-results.tar.gz

Overall Performance

Benchmark	MoonshotAI	Chutes
Overall Accuracy	43.88%	45.62%
Total Cost	$5.83	$4.70
Latency Mean	4.18s	3.32s
Latency Std Dev	3.52s	10.03s
Latency 95th Percentile	9.87s	7.83s

Non-Live AST Performance

Benchmark	MoonshotAI	Chutes
Non-Live AST Accuracy	87.50%	89.42%
Non-Live Simple AST	80.00%	79.67%
Non-Live Multiple AST	94.00%	92.50%
Non-Live Parallel AST	86.00%	93.50%
Non-Live Parallel Multiple AST	90.00%	92.00%

Live Performance

Benchmark	MoonshotAI	Chutes
Live Accuracy	79.13%	78.61%
Live Simple AST	87.98%	87.98%
Live Multiple AST	77.02%	76.07%
Live Parallel AST	87.50%	100.00%
Live Parallel Multiple AST	70.83%	75.00%

Multi-Turn Performance

Benchmark	MoonshotAI	Chutes
Multi Turn Accuracy	43.38%	43.62%
Multi Turn Base	54.50%	54.50%
Multi Turn Miss Func	44.00%	46.50%
Multi Turn Miss Param	38.00%	37.00%
Multi Turn Long Context	37.00%	36.50%

Web Search & Memory

Benchmark	MoonshotAI	Chutes
Web Search Accuracy	2.00%	2.00%
Web Search Base	2.00%	1.00%
Web Search No Snippet	2.00%	3.00%
Memory Accuracy	32.69%	39.78%
Memory KV	18.71%	26.45%
Memory Vector	23.23%	29.68%
Memory Recursive Summarization	56.13%	63.23%

Detection & Sensitivity

Benchmark	MoonshotAI	Chutes
Relevance Detection	75.00%	81.25%
Irrelevance Detection	72.66%	73.75%
Format Sensitivity Max Delta	7.50	8.50
Format Sensitivity Std Dev	2.17	2.20

jondurbin/chutes-tool-calling.md

Select an option

No results found

Select an option

No results found

MoonshotAI vs Chutes BFCL (tool) benchmark

Execution

Raw files/data

Overall Performance

Non-Live AST Performance

Live Performance

Multi-Turn Performance

Web Search & Memory

Detection & Sensitivity

sirouk commented Oct 15, 2025

Uh oh!