things_i_learnt_from_running_experiments_on_large_datasets.md

GPU

If you have limited GPU memory, always lazy load the data. Instead of loading the images of the entire dataset at the same time, load the images batch by batch.
- Check https://discuss.pytorch.org/t/loading-huge-data-functionality/346/3 for example.
Batch size plays an important role in your training, so if you have a limited GPU memory and can't fit the entire batch in it, do gradient accumulation. In this you do optimizer.step and model.zero_grad() once in few steps.
- Check this link https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3.
If you have GPU bottleneck and decide to go with torch.nn.DataParallel, there is a catch regarding the batch size.
- Look at this thread https://twitter.com/somethingmyname/status/1400042667543654402?s=20 for more.
Whenever possible, use the following,
- DDP instead of DP
- Mixed Precision
Useful utilities/commands:
- gpustat --watch
- glances

If you are running multiple experiments but have limited number of cores, use taskset --cpu-list <starting_thread>-<ending thread number> <your_code>.py. This will make sure your specific runs use only the allotted threads from <starting_thread>to <ending thread number> and prevents from constant reallocation of CPU threads as each run fight for the threads. Note that this is helpful only if everyone on the server respects the core allotment.
More num_workers doesn't lead to a faster data loader. In fact, in most cases having higher num_workers will lead to a slower data loader. As far as I know, there is no thumb rule but there does exist a sweet spot that is mostly identified through trial and error.
- Check this thread https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813 for more.
Useful utilities/commands:
- htop
- glances

Make sure you are running (read, log, train) on SSD. HDD causes I/O bottlenecks which are hard to get over even if you sell your soul to satan.
- Check with this lsblk -o NAME,MOUNTPOINT,MODEL,ROTA,SIZE. ROTA == 0 means, the drive is an SSD.
Instead of loading your data from SSD or HDD, you can directly move it to the RAM. /dev/shm/ is the RAM dir. First check whether you have sufficient RAM size, then move the entire dataset to the RAM. Then make your dataloader load from /dev/shm directly.
Useful utilities/commands:
- ncdu
- df -h

If you have implemented a new type of loss, do a overfit test first. Instead of running the experiments on the entire dataset, overfit a single batch. You should be able to get your lower bound of the implemented loss and 100% train accuracy, else something is wrong with the implementation.
- Check http://karpathy.github.io/2019/04/25/recipe/#:~:text=overfit%20one%20batch.%20Overfit%20a%20single%20batch,we%20cannot%20continue%20to%20the%20next%20stage.
nan related issues:
- If you have some custom loss and don't know where the nan is coming from, use https://pytorch.org/docs/master/autograd.html#anomaly-detection.
- Ways to tackle nan and related issues: https://youtu.be/XlYD8jn1ayE?list=PLoWh1paHYVRfygApBdss1HCt-TFZRXs0k&t=1616

Implement resume functionality ASAP. Trust me, this will prevent crying yourself to sleep at night.
- If you are using wandb, then you can even resume the logging. Feels like magic. Check this thread https://twitter.com/somethingmyname/status/1400237720413171713?s=20 for more.
If you come from the happy land of CIFARs and MNISTs like me, don't stare at the runs. It will take days to weeks. Get a hobby or it's finally time to open that "Interesting Papers" folder.
As a final note, make sure to avoid everything in this thread https://twitter.com/karpathy/status/1013244313327681536?s=20 by Karpathy.