Skip to content

Instantly share code, notes, and snippets.

@jondurbin
Last active October 15, 2025 20:23
Show Gist options
  • Save jondurbin/56fcf1a95f5789b10772e2e89663328f to your computer and use it in GitHub Desktop.
Save jondurbin/56fcf1a95f5789b10772e2e89663328f to your computer and use it in GitHub Desktop.
Kimi function calling benchmarks

MoonshotAI vs Chutes BFCL (tool) benchmark

Execution

git clone https://github.com/ShishirPatil/gorilla
cd gorilla/berkeley-function-call-leaderboard
python3 -m venv venv
./venv/bin/pip install -e .
# Apply diffs per provider
./venv/bin/bfcl generate --model kimi-k2-0905 --skip-server-setup --num-threads 8

Raw files/data

Description URL
CSV summary, Moonshot moonshot_results.csv
CSV summary, Chutes chutes_results.csv
Repo diff for Moonshot run moonshot-bfcl.diff
Repo diff for Chutes run chutes-bfcl.diff
Raw results, Moonshot moonshot-results.tar.gz
Raw results, Chutes chutes-results.tar.gz

Overall Performance

Benchmark MoonshotAI Chutes
Overall Accuracy 43.88% 45.62%
Total Cost $5.83 $4.70
Latency Mean 4.18s 3.32s
Latency Std Dev 3.52s 10.03s
Latency 95th Percentile 9.87s 7.83s

Non-Live AST Performance

Benchmark MoonshotAI Chutes
Non-Live AST Accuracy 87.50% 89.42%
Non-Live Simple AST 80.00% 79.67%
Non-Live Multiple AST 94.00% 92.50%
Non-Live Parallel AST 86.00% 93.50%
Non-Live Parallel Multiple AST 90.00% 92.00%

Live Performance

Benchmark MoonshotAI Chutes
Live Accuracy 79.13% 78.61%
Live Simple AST 87.98% 87.98%
Live Multiple AST 77.02% 76.07%
Live Parallel AST 87.50% 100.00%
Live Parallel Multiple AST 70.83% 75.00%

Multi-Turn Performance

Benchmark MoonshotAI Chutes
Multi Turn Accuracy 43.38% 43.62%
Multi Turn Base 54.50% 54.50%
Multi Turn Miss Func 44.00% 46.50%
Multi Turn Miss Param 38.00% 37.00%
Multi Turn Long Context 37.00% 36.50%

Web Search & Memory

Benchmark MoonshotAI Chutes
Web Search Accuracy 2.00% 2.00%
Web Search Base 2.00% 1.00%
Web Search No Snippet 2.00% 3.00%
Memory Accuracy 32.69% 39.78%
Memory KV 18.71% 26.45%
Memory Vector 23.23% 29.68%
Memory Recursive Summarization 56.13% 63.23%

Detection & Sensitivity

Benchmark MoonshotAI Chutes
Relevance Detection 75.00% 81.25%
Irrelevance Detection 72.66% 73.75%
Format Sensitivity Max Delta 7.50 8.50
Format Sensitivity Std Dev 2.17 2.20
@sirouk
Copy link

sirouk commented Oct 15, 2025

🔥 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment