jondurbin/GLM-4.6-tool-calling.md

Created October 16, 2025 22:34

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jondurbin/1f645b92c280492e9aad136427bb9299.js"></script>
Save jondurbin/1f645b92c280492e9aad136427bb9299 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

GLM-4.6-tool-calling.md

Kimi-K2-Instruct vs GLM-4.6 (BFCL Tool Benchmark)

Raw data glm-4.6-results.tar.gz

Overall Performance

Benchmark	Kimi-K2-Instruct	GLM-4.6-FP8
Overall Accuracy	45.62%	60.13%
Latency Mean	3.32 s	6.66 s
Latency Std Dev	10.03 s	7.50 s
Latency 95th Percentile	7.83 s	16.14 s

Non-Live AST Performance

Benchmark	Kimi-K2-Instruct	GLM-4.6-FP8
Non-Live AST Accuracy	89.42%	87.48%
Non-Live Simple AST	79.67%	73.92%
Non-Live Multiple AST	92.50%	90.00%
Non-Live Parallel AST	93.50%	90.00%
Non-Live Parallel Multiple AST	92.00%	89.00%

Live Performance

Benchmark	Kimi-K2-Instruct	GLM-4.6-FP8
Live Accuracy	78.61%	80.74%
Live Simple AST	87.98%	90.00%
Live Multiple AST	76.07%	78.13%
Live Parallel AST	100.00%	92.50%
Live Parallel Multiple AST	75.00%	70.83%

Multi-Turn Performance

Benchmark	Kimi-K2-Instruct	GLM-4.6-FP8
Multi-Turn Accuracy	43.62%	52.00%
Multi-Turn Base	54.50%	57.00%
Multi-Turn Miss Func	46.50%	44.00%
Multi-Turn Miss Param	37.00%	39.50%
Multi-Turn Long Context	36.50%	42.00%

Web Search & Memory

Benchmark	Kimi-K2-Instruct	GLM-4.6-FP8
Web Search Accuracy	2.00%	5.00%
Web Search Base	1.00%	4.00%
Web Search No Snippet	3.00%	2.00%
Memory Accuracy	39.78%	56.13%
Memory KV	26.45%	52.26%
Memory Vector	29.68%	57.42%
Memory Recursive Summarization	63.23%	58.71%

Detection & Sensitivity

Benchmark	Kimi-K2-Instruct	GLM-4.6-FP8
Relevance Detection	81.25%	75.00%
Irrelevance Detection	73.75%	83.88%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment