- If you have limited GPU memory, always lazy load the data. Instead of loading the images of the entire dataset at the same time, load the images batch by batch.
- Check https://discuss.pytorch.org/t/loading-huge-data-functionality/346/3 for example.
- Batch size plays an important role in your training, so if you have a limited GPU memory and can't fit the entire batch in it, do gradient accumulation. In this you do
once in few steps.- Check this link https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3.
- If you have GPU bottleneck and decide to go with
, there is a catch regarding the batch size.- Look at this thread https://twitter.com/somethingmyname/status/1400042667543654402?s=20 for more.
- Whenever possible, use the following,
- DDP instead of DP
- Mixed Precision
- Useful utilities/commands:
gpustat --watch
- If you are running multiple experiments but have limited number of cores, use
taskset --cpu-list <starting_thread>-<ending thread number> <your_code>.py
. This will make sure your specific runs use only the allotted threads from<starting_thread>
to<ending thread number>
and prevents from constant reallocation of CPU threads as each run fight for the threads. Note that this is helpful only if everyone on the server respects the core allotment. - More
doesn't lead to a faster data loader. In fact, in most cases having highernum_workers
will lead to a slower data loader. As far as I know, there is no thumb rule but there does exist a sweet spot that is mostly identified through trial and error.- Check this thread https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813 for more.
- Useful utilities/commands:
- Make sure you are running (read, log, train) on SSD. HDD causes I/O bottlenecks which are hard to get over even if you sell your soul to satan.
- Check with this
. ROTA == 0 means, the drive is an SSD.
- Check with this
- Instead of loading your data from SSD or HDD, you can directly move it to the RAM.
is the RAM dir. First check whether you have sufficient RAM size, then move the entire dataset to the RAM. Then make your dataloader load from/dev/shm
directly. - Useful utilities/commands:
df -h
- If you have implemented a new type of loss, do a overfit test first. Instead of running the experiments on the entire dataset, overfit a single batch. You should be able to get your lower bound of the implemented loss and 100% train accuracy, else something is wrong with the implementation.
related issues:- If you have some custom loss and don't know where the
is coming from, use https://pytorch.org/docs/master/autograd.html#anomaly-detection. - Ways to tackle
and related issues: https://youtu.be/XlYD8jn1ayE?list=PLoWh1paHYVRfygApBdss1HCt-TFZRXs0k&t=1616
- If you have some custom loss and don't know where the
- If you want to know whether that idea that's keeping you awake at night works but ImageNet takes too much time, then check these datasets:
- Tiny ImageNet : 200 class version of ImageNet - https://www.kaggle.com/c/tiny-imagenet
- mini ImageNet : 100 class version of ImageNet - https://github.com/yaoyao-liu/mini-imagenet-tools
- Note that the above version is for few-shot learning. You can convert it into normal classification using this -
- Imagenette : Easier 10 class version of ImageNet - https://github.com/fastai/imagenette
- Noisy Imagenette: https://github.com/fastai/imagenette/tree/master/noisy_labels
- Imagewoof: Relatively hard version of Imagenette - https://github.com/fastai/imagenette#imagewoof
- Noisy Imagewoof: https://github.com/fastai/imagenette/tree/master/noisy_labels
- Imagewang: Semi-supervised version which combines both Imagenette and Imagewoof - https://github.com/fastai/imagenette#image%E7%BD%91
- Implement resume functionality ASAP. Trust me, this will prevent crying yourself to sleep at night.
- If you are using wandb, then you can even resume the logging. Feels like magic. Check this thread https://twitter.com/somethingmyname/status/1400237720413171713?s=20 for more.
- If you come from the happy land of CIFARs and MNISTs like me, don't stare at the runs. It will take days to weeks. Get a hobby or it's finally time to open that "Interesting Papers" folder.
- As a final note, make sure to avoid everything in this thread https://twitter.com/karpathy/status/1013244313327681536?s=20 by Karpathy.