- If you have limited GPU memory, always lazy load the data. Instead of loading the images of the entire dataset at the same time, load the images batch by batch.
- Check https://discuss.pytorch.org/t/loading-huge-data-functionality/346/3 for example.
- Batch size plays an important role in your training, so if you have a limited GPU memory and can't fit the entire batch in it, do gradient accumulation. In this you do
optimizer.step
andmodel.zero_grad()
once in few steps.- Check this link https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3.
- If you have GPU bottleneck and decide to go with
torch.nn.DataParallel
, there is a catch regarding the batch size.- Look at this thread https://twitter.com/somethingmyname/status/1400042667543654402?s=20 for more.
- Whenever possible, use the following,
- DDP instead of DP
- Mixed Precision
- Useful utilities/commands:
gpustat --watch
glances
- If you are running multiple experiments but have limited number of cores, use
taskset --cpu-list <starting_thread>-<ending thread number> <your_code>.py
. This will make sure your specific runs use only the allotted threads from<starting_thread>
to<ending thread number>
and prevents from constant reallocation of CPU threads as each run fight for the threads. Note that this is helpful only if everyone on the server respects the core allotment. - More
num_workers
doesn't lead to a faster data loader. In fact, in most cases having highernum_workers
will lead to a slower data loader. As far as I know, there is no thumb rule but there does exist a sweet spot that is mostly identified through trial and error.- Check this thread https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813 for more.
- Useful utilities/commands:
htop
glances
- Make sure you are running (read, log, train) on SSD. HDD causes I/O bottlenecks which are hard to get over even if you sell your soul to satan.
- Check with this
lsblk -o NAME,MOUNTPOINT,MODEL,ROTA,SIZE
. ROTA == 0 means, the drive is an SSD.
- Check with this
- Instead of loading your data from SSD or HDD, you can directly move it to the RAM.
/dev/shm/
is the RAM dir. First check whether you have sufficient RAM size, then move the entire dataset to the RAM. Then make your dataloader load from/dev/shm
directly. - Useful utilities/commands:
ncdu
df -h
- If you have implemented a new type of loss, do a overfit test first. Instead of running the experiments on the entire dataset, overfit a single batch. You should be able to get your lower bound of the implemented loss and 100% train accuracy, else something is wrong with the implementation.
nan
related issues:- If you have some custom loss and don't know where the
nan
is coming from, use https://pytorch.org/docs/master/autograd.html#anomaly-detection. - Ways to tackle
nan
and related issues: https://youtu.be/XlYD8jn1ayE?list=PLoWh1paHYVRfygApBdss1HCt-TFZRXs0k&t=1616
- If you have some custom loss and don't know where the
- If you want to know whether that idea that's keeping you awake at night works but ImageNet takes too much time, then check these datasets:
- Tiny ImageNet : 200 class version of ImageNet - https://www.kaggle.com/c/tiny-imagenet
- mini ImageNet : 100 class version of ImageNet - https://github.com/yaoyao-liu/mini-imagenet-tools
- Note that the above version is for few-shot learning. You can convert it into normal classification using this -
- Imagenette : Easier 10 class version of ImageNet - https://github.com/fastai/imagenette
- Noisy Imagenette: https://github.com/fastai/imagenette/tree/master/noisy_labels
- Imagewoof: Relatively hard version of Imagenette - https://github.com/fastai/imagenette#imagewoof
- Noisy Imagewoof: https://github.com/fastai/imagenette/tree/master/noisy_labels
- Imagewang: Semi-supervised version which combines both Imagenette and Imagewoof - https://github.com/fastai/imagenette#image%E7%BD%91
- Implement resume functionality ASAP. Trust me, this will prevent crying yourself to sleep at night.
- If you are using wandb, then you can even resume the logging. Feels like magic. Check this thread https://twitter.com/somethingmyname/status/1400237720413171713?s=20 for more.
- If you come from the happy land of CIFARs and MNISTs like me, don't stare at the runs. It will take days to weeks. Get a hobby or it's finally time to open that "Interesting Papers" folder.
- As a final note, make sure to avoid everything in this thread https://twitter.com/karpathy/status/1013244313327681536?s=20 by Karpathy.