We training LLM with the code and report the training speed of different settings (see the Table). We use a machine with A800 x 8, 1 TB CPU memory, Intel 8358 CPU x 2. For the software, we use CUDA 12.1, PyTorch 2.2.0, Deepspeed 0.14.2.
Table. Benchmark of LLaMA-7B models using deepspeed-based traning code. The squence length is 4096.
Zero Stage | Ckpt.1 | Optim. Off.2 | Param. Off.3 | Zero++4 | BS5 | CPU Mem.6 | GPU Mem.7 | Th.put |
---|---|---|---|---|---|---|---|---|
2 | × | × | × | × | 1/64 | 320.1 | 19.4/44.8 | 5.33 |
2 | √ | × | × | × | 1/64 | 320.0 | 19.4/23.5 | 4.19 |
2 | √ | √ | × | × | 1/64 | 361.3 | 13.4/16.9 | 1.81 |
2 | √ | × | × | × | 4/64 | 320.4 | 27.2/38.6 | 4.69 |
3 | × | × | × | × | 2/64 | 319.5 | 14.8/75.7 | 4.95 |
3 | √ | × | × | × | 2/64 | 319.6 | 14.8/20.4 | 4.45 |
3 | √ | √ | × | × | 2/64 | 387.4 | 3.8/9.4 | 2.05 |
3 | √ | √ | √ | × | 4/64 | 398.9 | 2.2/7.9 | 2.06 |
3 | √ | √ | √ | √ | 4/64 | 411.1 | 2.2/7.9 | 1.85 |
3 | √ | × | × | × | 8/64 | 319.6 | 17.7/39.1 | 4.73 |
3 | √ | × | × | × | 8/128 | 319.9 | 21.4/63.9 | 4.32 |
Footnotes
-
Ckpt.
indicates whether to enable HF gradient checkpointing for the model. ↩ -
Optim. Off.
indicates whether to enable HFoffload_optimizer
in the configzero_optimization
. ↩ -
Param. Off.
indicates whether to enable HFoffload_param
in the configzero_optimization
. ↩ -
Zero++
represents the techiques at https://www.deepspeed.ai/tutorials/zeropp/ ↩ -
BS
representsbatch size per device per iteration
/batch size for gradient decent
↩ -
CPU Mem.
denotespsutil.virtual_memory().used
↩ -
GPU Mem.
representstorch.cuda.memory_allocated()
/torch.cuda.max_memory_allocated()
↩