Skip to content

Instantly share code, notes, and snippets.

@jondurbin
Created October 16, 2025 22:34
Show Gist options
  • Save jondurbin/1f645b92c280492e9aad136427bb9299 to your computer and use it in GitHub Desktop.
Save jondurbin/1f645b92c280492e9aad136427bb9299 to your computer and use it in GitHub Desktop.

Kimi-K2-Instruct vs GLM-4.6 (BFCL Tool Benchmark)

Raw data glm-4.6-results.tar.gz

Overall Performance

Benchmark Kimi-K2-Instruct GLM-4.6-FP8
Overall Accuracy 45.62% 60.13%
Latency Mean 3.32 s 6.66 s
Latency Std Dev 10.03 s 7.50 s
Latency 95th Percentile 7.83 s 16.14 s

Non-Live AST Performance

Benchmark Kimi-K2-Instruct GLM-4.6-FP8
Non-Live AST Accuracy 89.42% 87.48%
Non-Live Simple AST 79.67% 73.92%
Non-Live Multiple AST 92.50% 90.00%
Non-Live Parallel AST 93.50% 90.00%
Non-Live Parallel Multiple AST 92.00% 89.00%

Live Performance

Benchmark Kimi-K2-Instruct GLM-4.6-FP8
Live Accuracy 78.61% 80.74%
Live Simple AST 87.98% 90.00%
Live Multiple AST 76.07% 78.13%
Live Parallel AST 100.00% 92.50%
Live Parallel Multiple AST 75.00% 70.83%

Multi-Turn Performance

Benchmark Kimi-K2-Instruct GLM-4.6-FP8
Multi-Turn Accuracy 43.62% 52.00%
Multi-Turn Base 54.50% 57.00%
Multi-Turn Miss Func 46.50% 44.00%
Multi-Turn Miss Param 37.00% 39.50%
Multi-Turn Long Context 36.50% 42.00%

Web Search & Memory

Benchmark Kimi-K2-Instruct GLM-4.6-FP8
Web Search Accuracy 2.00% 5.00%
Web Search Base 1.00% 4.00%
Web Search No Snippet 3.00% 2.00%
Memory Accuracy 39.78% 56.13%
Memory KV 26.45% 52.26%
Memory Vector 29.68% 57.42%
Memory Recursive Summarization 63.23% 58.71%

Detection & Sensitivity

Benchmark Kimi-K2-Instruct GLM-4.6-FP8
Relevance Detection 81.25% 75.00%
Irrelevance Detection 73.75% 83.88%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment