We training LLM with the code and report the training speed of different settings (see the Table). We use a machine with A800 x 8, 1 TB CPU memory, Intel 8358 CPU x 2. For the software, we use CUDA 12.1, PyTorch 2.2.0, Deepspeed 0.14.2.
Table. Benchmark of LLaMA-7B models using deepspeed-based traning code. The squence length is 4096.
Zero Stage | Ckpt.[^1] | Optim. Off.[^2] | Param. Off.[^3] | Zero++[^4] | BS[^5] | CPU Mem.[^6] | GPU Mem.[^7] | Th.put |
---|---|---|---|---|---|---|---|---|
2 | × | × | × | × | 1/64 | 320.1 | 19.4/44.8 | 5.33 |
2 | √ | × | × | × | 1/64 | 320.0 | 19.4/23.5 | 4.19 |