- A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
- Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
- Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
- Warning: might need to re-factor your own code.
- Warning: might be secretly condemned by your colleagues because using too many GPUs.
Create file /etc/systemd/system/[email protected]
. SystemD calling binaries using an absolute path. In my case is prefixed by /usr/local/bin
, you should use paths specific for your environment.
[Unit]
Description=%i service with docker compose
PartOf=docker.service
After=docker.service
A personal diary of DataFrame munging over the years.
Convert Series datatype to numeric (will error if column has non-numeric values)
(h/t @makmanalp)