Skip to content

Instantly share code, notes, and snippets.

@Johnz86
Created October 25, 2024 09:40
Show Gist options
  • Save Johnz86/888f44a4a0b8812c10b905a9932bb615 to your computer and use it in GitHub Desktop.
Save Johnz86/888f44a4a0b8812c10b905a9932bb615 to your computer and use it in GitHub Desktop.
Downloading Smaller Parquet Datasets from Hugging Face

When working with large datasets on Hugging Face, downloading in JSON format can be cumbersome due to the large size of the files. Fortunately, some datasets provide automatically converted Parquet versions, which are much smaller and optimized for efficient data storage and access.

In this example, we demonstrate how to download only the Parquet files instead of the entire dataset in JSON.


Example: ScaleQuest-Math

Dataset URL: ScaleQuest-Math

  • Size of the original dataset files in JSON: 1.51 GB
  • Size of the auto-converted Parquet files: 642 MB

Steps to download only the Parquet files:

# Step 1: Clone the repository without checking out the main branch
git clone https://huggingface.co/datasets/dyyyyyyyy/ScaleQuest-Math --no-checkout

# Step 2: Change directory to the dataset folder
cd ScaleQuest-Math

# Step 3: Fetch the Parquet branch (convert/parquet) where the smaller files are stored
git fetch origin refs/convert/parquet:refs/remotes/origin/convert/parquet

# Step 4: Check out the Parquet branch
git checkout convert/parquet

General Example

If you are working with other datasets, replace {username} and {dataset-name} with the appropriate values:

git clone https://huggingface.co/datasets/{username}/{dataset-name} --no-checkout
cd {dataset-name}
git fetch origin refs/convert/parquet:refs/remotes/origin/convert/parquet
git checkout convert/parquet

This process enables you to download smaller, auto-converted Parquet datasets without downloading the large JSON files from the main branch. The Parquet format is more efficient, saving you bandwidth and storage space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment