git clone https://github.com/ShishirPatil/gorilla
cd gorilla/berkeley-function-call-leaderboard
python3 -m venv venv
./venv/bin/pip install -e .
# Apply diffs per provider
./venv/bin/bfcl generate --model kimi-k2-0905 --skip-server-setup --num-threads 8| Description | URL |
|---|---|
| CSV summary, Moonshot | moonshot_results.csv |
| CSV summary, Chutes | chutes_results.csv |
| Repo diff for Moonshot run | moonshot-bfcl.diff |
| Repo diff for Chutes run | chutes-bfcl.diff |
| Raw results, Moonshot | moonshot-results.tar.gz |
| Raw results, Chutes | chutes-results.tar.gz |
| Benchmark | MoonshotAI | Chutes |
|---|---|---|
| Overall Accuracy | 43.88% | 45.62% |
| Total Cost | $5.83 | $4.70 |
| Latency Mean | 4.18s | 3.32s |
| Latency Std Dev | 3.52s | 10.03s |
| Latency 95th Percentile | 9.87s | 7.83s |
| Benchmark | MoonshotAI | Chutes |
|---|---|---|
| Non-Live AST Accuracy | 87.50% | 89.42% |
| Non-Live Simple AST | 80.00% | 79.67% |
| Non-Live Multiple AST | 94.00% | 92.50% |
| Non-Live Parallel AST | 86.00% | 93.50% |
| Non-Live Parallel Multiple AST | 90.00% | 92.00% |
| Benchmark | MoonshotAI | Chutes |
|---|---|---|
| Live Accuracy | 79.13% | 78.61% |
| Live Simple AST | 87.98% | 87.98% |
| Live Multiple AST | 77.02% | 76.07% |
| Live Parallel AST | 87.50% | 100.00% |
| Live Parallel Multiple AST | 70.83% | 75.00% |
| Benchmark | MoonshotAI | Chutes |
|---|---|---|
| Multi Turn Accuracy | 43.38% | 43.62% |
| Multi Turn Base | 54.50% | 54.50% |
| Multi Turn Miss Func | 44.00% | 46.50% |
| Multi Turn Miss Param | 38.00% | 37.00% |
| Multi Turn Long Context | 37.00% | 36.50% |
| Benchmark | MoonshotAI | Chutes |
|---|---|---|
| Web Search Accuracy | 2.00% | 2.00% |
| Web Search Base | 2.00% | 1.00% |
| Web Search No Snippet | 2.00% | 3.00% |
| Memory Accuracy | 32.69% | 39.78% |
| Memory KV | 18.71% | 26.45% |
| Memory Vector | 23.23% | 29.68% |
| Memory Recursive Summarization | 56.13% | 63.23% |
| Benchmark | MoonshotAI | Chutes |
|---|---|---|
| Relevance Detection | 75.00% | 81.25% |
| Irrelevance Detection | 72.66% | 73.75% |
| Format Sensitivity Max Delta | 7.50 | 8.50 |
| Format Sensitivity Std Dev | 2.17 | 2.20 |
🔥 🚀