- Dont use CPUs to train. This will slow experiments of others drastically. Almost always try to run your exps on GPUs excluding certain edge cases.
- Before you start a run on a GPU that someone else is already using, please do try and run first on an empty GPU, to check the GPU memory your experiment will occupy. If it exceeds the available memory (Use
gpustat --watch
to find the usage), PLEASE DONT RUN ON IT. This will crash the other person's runs. - If you are occuping GPU mem without any usage, kill them.
- But I need a GPU ASAP to run, what should I do?
- 1st way : Contact the person who is using that GPU via GoogleChats/Slack and almost always they will make up some space for you if they are not running against a deadline. I generally use
gpustat --watch
to find the name/rollno and use the iith mail ID / Slack to contact them. - 2nd way (The Best Way) : Use half the batchsize and do gradient accumulation (https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3).
- 1st way : Contact the person who is using that GPU via GoogleChats/Slack and almost always they will make up some space for you if they are not running against a deadline. I generally use
- Once in a while use the command ncdu to find your unused large files and delete them. Nothing worse than starting a run before sleep and waking up to realize that your run crashed because the memory is full.
- When using
ncdu
, you can just press d directly to delete that particular dir/file from within ncdu. - When using
ncdu
, you find out that the conda dir is occupying the most space, use condaclean --all
- When using
- If you are using an account of a PhD or someone else, make a directory in your name first and do everything from inside that so that its easy for them to identify you.
For general training related pointers, check https://gist.github.com/rahulvigneswaran/8b5e6ecd2cae9698e360dbf6d6fc7ed3