Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save ZijiaLewisLu/eabdca955110833c0ce984d34eb7ff39 to your computer and use it in GitHub Desktop.
Save ZijiaLewisLu/eabdca955110833c0ce984d34eb7ff39 to your computer and use it in GitHub Desktop.
Tricks to Speed Up Data Loading with PyTorch

In most of deep learning projects, the training scripts always start with lines to load in data, which can easily take a handful minutes. Only after data ready can start testing my buggy code. It is so frustratingly often that I wait for ten minutes just to find I made a stupid typo, then I have to restart and wait for another ten minutes hoping no other typos are made.

In order to make my life easy, I devote lots of effort to reduce the overhead of I/O loading. Here I list some useful tricks I found and hope they also save you some time.

  1. use Numpy Memmap to load array and say goodbye to HDF5.

    I used to relay on HDF5 to read/write data, especially when loading only sub-part of all data. Yet that was before I realized how fast and charming Numpy Memmapfile is. In short, Memmapfile does not load in the whole array at open, and only later "lazily" load in the parts that are required for real operations.

    Sometimes I may want to copy the full array to memory at once, as it makes later operations faster. Using Memmapfile is still much faster than HDF5. Just do array = numpy.array(memmap_file). It reduces the several minutes with HDF5 to several seconds. Pretty impressive, isn't it!

    A usefully tool to check out is sharearray. It hides for you the verbose details of creating memmap file.

    If you want to create memmap array that is too large to reside in your memory, use numpy.memmap().

  2. torch.from_numpy() to avoid extra copy.

    While torch.Tensor make a copy of the passing-in numpy array. torch.from_numpy() use the same storage as the numpy array.

  3. torch.utils.data.DataLoader for multithread loading.

    I think most people are aware of it. With DataLoader, a optional argument num_workers can be passed in to set how many threads to create for loading data.

  4. A simple trick to overlap data-copy time and GPU Time.

    Copying data to GPU can be relatively slow, you would want to overlap I/O and GPU time to hide the latency. Unfortunatly, PyTorch does not provide a handy tools to do it. Here is a simple snippet to hack around it with DataLoader, pin_memory and .cuda(async=True).

from torch.utils.data import DataLoader

# some code

loader = DataLoader(your_dataset, ..., pin_memory=True)
data_iter = iter(loader)

next_batch = data_iter.next() # start loading the first batch
next_batch = [ _.cuda(non_blocking=True) for _ in next_batch ]  # with pin_memory=True and non_blocking=True, this will copy data to GPU non blockingly

for i in range(len(loader)):
    batch = next_batch 
    if i + 2 != len(loader): 
        # start copying data of next batch
        next_batch = data_iter.next()
        next_batch = [ _.cuda(async=True) for _ in next_batch]
    
    # training code
@Subangkar
Copy link

Subangkar commented Jul 22, 2020

There is a typo nex_batch in Line: 9 next_batch = [ _.cuda(async=True) for _ in nex_batch ] # with pin_memory=True and async=True, this will copy data to GPU non blockingly
And async keyword arg of cuda is deprecated and changed to non_blocking=True

@bfeeny
Copy link

bfeeny commented Jul 28, 2020

With regard to:

2. torch.from_numpy() to avoid extra copy.

While torch.Tensor make a copy of the passing-in numpy array. torch.from_numpy() use the same storage as the numpy array.

This wouldn't hold true if you were creating the Tensor in GPU right? Typically a numpy array is instantiated from a CPU instance, and then moved to a GPU torch tensor. So I would think it would not save you anything in that workflow. Obviously if you are moving from CPU memory Numpy to CPU memory torch it could. Thoughts?

@ZijiaLewisLu
Copy link
Author

@Subangkar Thank you! I have updated the code.

@ZijiaLewisLu
Copy link
Author

@bfeeny You are right. It is for CPU to CPU conversion. In my project, I often have to load in batch data from disk at each iteration in Numpy format then convert it to Pytorch Tensor, so I found it is helpful.

@bfeeny
Copy link

bfeeny commented Aug 10, 2020

This should be if i + 1 != len(loader): not if i + 2 != len(loader):

Example as you have it:

loader = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 
train_iter = iter(loader)  
next_batch = next(train_iter)

for i in range(len(loader)):
    batch = next_batch
    print(i, batch)
    if i + 2 != len(loader):
        next_batch = next(train_iter)

0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 8

Corrected:

loader = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 
train_iter = iter(loader)  
next_batch = next(train_iter)

for i in range(len(loader)):
    batch = next_batch
    print(i, batch)
    if i + 1 != len(loader):
        next_batch = next(train_iter)

0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9

@hxtruong6
Copy link

hxtruong6 commented Aug 14, 2020

I got issue return next(self._sampler_iter) # may raise StopIteration at line next_batch = data_iter.next()
And I can use data_iter.next(), instead I use data_iter.__next__() ?!!
How to solve it?
Thanks

@FeryET
Copy link

FeryET commented Dec 31, 2021

I think your opinion on HDF being much slower than numpy is misguided.

HDF needs to be carefully parameterized using rdcc_w0 and rdcc_nslots but if you give them good values it's not only as fast as memarrays, it's also easier to maintain. You can have both of your training data and training labels in single HDF file and version that via dvc and etc.

@inikishev
Copy link

in my testing deeplake offers the fastest dataloading from your disk, plus unlike numpy it can do compression

@inikishev
Copy link

deeplake is a bit annoying to work with though because you are stunlocked into their dataloader etc. Zarr ZipStore is also quite fast and more flexible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment