Wei w32zhong

Understanding SGLang's Radix Cache, the LeetCode Way

Overview

What is Radix Cache?

When an LLM processes a prompt, it computes a Key and Value vector for every token — the KV cache. If many requests share the same system prompt, recomputing its KV cache from scratch each time is wasteful. Radix Cache stores these computed prefixes in a Radix Tree and reuses them across requests, which is one of the main reasons SGLang achieves high throughput.

Why Read mini-sglang Instead of SGLang Directly?

Gateway

app/gateway/discovery.js

app/gateway/apisix.yml

app/gateway/entrypoint.sh

🤗Huggingface Model Downloader

Note

(2025-01-08) Add feature for 🏷️Tag(Revision) Selection, contributed by @Bamboo-D.
(2024-12-17) Add feature for ⚡Quick Startup and ⏭️Fast Resume, enabling skipping of downloaded files, while removing the git clone dependency to accelerate file list retrieval.

Considering the lack of multi-threaded download support in the official huggingface-cli, and the inadequate error handling in hf_transfer, This command-line tool leverages curl and aria2c for fast and robust downloading of models and datasets.

Features

⏯️ Resume from breakpoint: You can re-run it or Ctrl+C anytime.

	global
	daemon
	maxconn 10000

	defaults
	mode tcp
	timeout connect 5s
	timeout client 1h
	timeout server 1h

	[package]
	name = "chinese_search"
	version = "0.1.0"
	edition = "2024"

	[dependencies]
	jieba-rs = "0.9.0"
	tantivy = "0.26.0"
	tantivy-jieba = "0.19.0"

	{
	"architectures": [
	"Qwen3ForCausalLMEagle"
	],
	"attention_bias": false,
	"attention_dropout": 0.0,
	"bos_token_id": 151643,
	"eos_token_id": 151645,
	"head_dim": 128,
	"hidden_act": "silu",

	## Setup
	# conda create -n modular python=3.11
	# uv pip install modular --extra-index-url https://download.pytorch.org/whl/cpu --index-url https://dl.modular.com/public/nightly/python/simple/ --index-strategy unsafe-best-match --prerelease allow
	# conda install -c conda-forge gcc=12.1.0

	model_path = 'Qwen/Qwen2.5-0.5B'

	import time
	from max.entrypoints.llm import LLM
	from max.pipelines import PipelineConfig

	window.scrollTo(0, 0)
	var bodyRect = document.body.getBoundingClientRect();

	var items = Array.prototype.slice.call(
	document.querySelectorAll('*')
	).map(function(element) {
	var rect=element.getBoundingClientRect();
	return {
	element: element,
	include: (element.tagName === "BUTTON" \|\| element.tagName === "A" \|\| (element.onclick != null) \|\| window.getComputedStyle(element).cursor == "pointer"),

	import os
	import torch
	import psutil
	import datasets
	import glob

	from transformers import (
	AutoTokenizer, LlamaConfig, LlamaForCausalLM, Trainer, TrainingArguments,
	DataCollatorForLanguageModeling
	)

	FROM nvcr.io/nvidia/pytorch:23.11-py3
	WORKDIR /workspace
	RUN pip install -r r1.txt
	ADD requirements.txt r2.txt
	# FlashAttention-2 compatibility copied from https://github.com/Dao-AILab/flash-attention/issues/836#issuecomment-1951433985
	RUN pip install flash-attn==2.5.1.post1
	RUN apt update && apt install -y tmux git-lfs
	RUN pip install nvitop
	ADD . myproject
	WORKDIR /workspace/myproject