Skip to content

Instantly share code, notes, and snippets.

View radiradev's full-sized avatar
🤠

Radi Radev radiradev

🤠
View GitHub Profile
@ZijiaLewisLu
ZijiaLewisLu / Tricks to Speed Up Data Loading with PyTorch.md
Last active October 2, 2025 11:26
Tricks to Speed Up Data Loading with PyTorch

In most of deep learning projects, the training scripts always start with lines to load in data, which can easily take a handful minutes. Only after data ready can start testing my buggy code. It is so frustratingly often that I wait for ten minutes just to find I made a stupid typo, then I have to restart and wait for another ten minutes hoping no other typos are made.

In order to make my life easy, I devote lots of effort to reduce the overhead of I/O loading. Here I list some useful tricks I found and hope they also save you some time.

  1. use Numpy Memmap to load array and say goodbye to HDF5.

    I used to relay on HDF5 to read/write data, especially when loading only sub-part of all data. Yet that was before I realized how fast and charming Numpy Memmapfile is. In short, Memmapfile does not load in the whole array at open, and only later "lazily" load in the parts that are required for real operations.

Sometimes I may want to copy the full array to memory at once, as it makes later operations

@TrentBrick
TrentBrick / PyTorch_bucket_by_sequence_length.py
Last active June 14, 2024 06:39
PyTorch BatchSampler for bucketing sequences by length
"""
PyTorch has pack_padded_sequence this doesn’t work with dense layers. For sequence data with high variance in its length
the best way to minimize padding and masking within a batch is by feeding in data that is already grouped by sequence length
(while still shuffling it somewhat). Here is my current solution in numpy.
I will need to convert every function over to torch to allow it to run on the GPU and am sure there are many other
ways to optimize it further. Hope this helps others and that maybe it can become a new PyTorch Batch Sampler someday.
General approach to how it works:
Decide what your bucket boundaries for the data are.
@francois-rozet
francois-rozet / flow_matching.py
Last active October 19, 2025 22:48
Flow Matching in 100 LOC
#!/usr/bin/env python
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from sklearn.datasets import make_moons
from torch import Tensor
from tqdm import tqdm