Skip to content

Instantly share code, notes, and snippets.

@linhduongtuan
Forked from Birch-san/img-folder-chunking.md
Created February 3, 2024 18:30
Show Gist options
  • Save linhduongtuan/7b98550d0bb02a5884c1e0ebc4560d5f to your computer and use it in GitHub Desktop.
Save linhduongtuan/7b98550d0bb02a5884c1e0ebc4560d5f to your computer and use it in GitHub Desktop.
Chunking a folder of pngs into .tar files

Uploading a folder of many files to HF, by chunking it into .tars

So you generated 50000 images for computing FID or whatever, and now you want to upload those samples to HF.
You try, but one of the filetransfers fails, and you lose all your progress.
I mean it'd be nice if HF could just… fix this… like, put retries into huggingface-cli upload instead of just discarding tens of gigabytes of progress… but we live in the world in which we live.

So let's make it easier. instead of 50k small files, let's upload 50 big files. Collate 'em into .tars.

I'm not sure this makes a valid WDS, but it's close; I think you would need to rename the files to 000000.img.png if you wanted that.

Starting point

Directory structure (as given by tree command):

samples
├── 000000.png
├── 000001.png
├──      …
└── 049999.png

Let's make a sibling directory, splits:

.
├── samples
│   ├── 000000.png
│   ├── 000001.png
│   ├──      …
│   └── 049999.png
└── splits

Create splits

Ensure you are cded into the splits directory.

We'll generate text files x00…x49 detailing the list of files we want in each chunk:

split -l 1000 --numeric-suffixes --suffix-length=2 <(find ../samples -printf '%P\n' -type f -name '*.png' | awk NF | sort -V)

Now we have the following files:

.
├── samples
│   ├── 000000.png
│   ├── 000001.png
│   ├──      …
│   └── 049999.png
└── splits
    ├── x00
    ├── x01
    ├──  …
    └── x49

Split files such as x00 have content like this (a list of files):

000000.png
000001.png
…
000999.png

tar the splits

Still in the splits directory, let's make a tar directory:

.
├── samples
│   ├── 000000.png
│   ├── 000001.png
│   ├──      …
│   └── 049999.png
└── splits
    ├── tar
    ├── x00
    ├── x01
    ├──  …
    └── x49

Now let's read the file listings in every x00…x49 split, and create .tar chunks of said file listings:

for i in {0..49}; do tar -C ../samples/ -cvf "$(printf 'tar/%02d000.tar' $i)" --files-from "$(printf 'x%02d' $i)"; done

This gives us a folder of .tars:

.
├── eval_0
│   ├── 000000.png
│   ├── 000001.png
│   ├──      …
│   └── 049999.png
└── splits
    ├── tar
    │   ├── 00000.tar
    │   ├── 01000.tar
    │   ├──     …
    │   └── 49000.tar
    ├── x00
    ├── x01
    ├──  …
    └── x50

Each such tar contains 1000 pngs:

tar -tvf tar/00000.tar
-rw-rw-r-- birch/birch 1174871 2024-02-02 23:51 000000.png
-rw-rw-r-- birch/birch 1415042 2024-02-02 23:51 000001.png
…
-rw-rw-r-- birch/birch 1488682 2024-02-02 23:57 000999.png

Uploading to HF

cd into the tar directory, and upload all its files to a dataset on HF:

huggingface-cli upload --repo-type=dataset hfusername/my-cool-dataset . .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment