Throughput Benchmark based on deepspeed-based LLM traning code.

We training LLM with the code and report the training speed of different settings (see the Table). We use a machine with A800 x 8, 1 TB CPU memory, Intel 8358 CPU x 2. For the software, we use CUDA 12.1, PyTorch 2.2.0, Deepspeed 0.14.2.

Table. Benchmark of LLaMA-7B models using deepspeed-based traning code. The squence length is 4096.

Zero Stage	Ckpt.¹	Optim. Off.²	Param. Off.³	Zero++⁴	BS⁵	CPU Mem.⁶	GPU Mem.⁷	Th.put
2	×	×	×	×	1/64	320.1	19.4/44.8	5.33
2	√	×	×	×	1/64	320.0	19.4/23.5	4.19
2	√	√	×	×	1/64	361.3	13.4/16.9	1.81
2	√	×	×	×	4/64	320.4	27.2/38.6	4.69
3	×	×	×	×	2/64	319.5	14.8/75.7	4.95
3	√	×	×	×	2/64	319.6	14.8/20.4	4.45
3	√	√	×	×	2/64	387.4	3.8/9.4	2.05
3	√	√	√	×	4/64	398.9	2.2/7.9	2.06
3	√	√	√	√	4/64	411.1	2.2/7.9	1.85
3	√	×	×	×	8/64	319.6	17.7/39.1	4.73
3	√	×	×	×	8/128	319.9	21.4/63.9	4.32

Ckpt. indicates whether to enable HF gradient checkpointing for the model. ↩
Optim. Off. indicates whether to enable HF offload_optimizer in the config zero_optimization. ↩
Param. Off. indicates whether to enable HF offload_param in the config zero_optimization. ↩
Zero++ represents the techiques at https://www.deepspeed.ai/tutorials/zeropp/ ↩
BS represents batch size per device per iteration/batch size for gradient decent ↩
CPU Mem. denotes psutil.virtual_memory().used ↩
GPU Mem. represents torch.cuda.memory_allocated()/torch.cuda.max_memory_allocated() ↩

chenyaofo/deepspeed-benchmark.md

Footnotes