Downloading Smaller Parquet Datasets from Hugging Face

When working with large datasets on Hugging Face, downloading in JSON format can be cumbersome due to the large size of the files. Fortunately, some datasets provide automatically converted Parquet versions, which are much smaller and optimized for efficient data storage and access.

In this example, we demonstrate how to download only the Parquet files instead of the entire dataset in JSON.

Example: ScaleQuest-Math

Dataset URL: ScaleQuest-Math

Size of the original dataset files in JSON: 1.51 GB
Size of the auto-converted Parquet files: 642 MB

Steps to download only the Parquet files:

# Step 1: Clone the repository without checking out the main branch
git clone https://huggingface.co/datasets/dyyyyyyyy/ScaleQuest-Math --no-checkout

# Step 2: Change directory to the dataset folder
cd ScaleQuest-Math

# Step 3: Fetch the Parquet branch (convert/parquet) where the smaller files are stored
git fetch origin refs/convert/parquet:refs/remotes/origin/convert/parquet

# Step 4: Check out the Parquet branch
git checkout convert/parquet

General Example

If you are working with other datasets, replace {username} and {dataset-name} with the appropriate values:

git clone https://huggingface.co/datasets/{username}/{dataset-name} --no-checkout
cd {dataset-name}
git fetch origin refs/convert/parquet:refs/remotes/origin/convert/parquet
git checkout convert/parquet

This process enables you to download smaller, auto-converted Parquet datasets without downloading the large JSON files from the main branch. The Parquet format is more efficient, saving you bandwidth and storage space.

Johnz86/CLONE_DATASET.md

Select an option

No results found

Select an option

No results found

Example: ScaleQuest-Math

Steps to download only the Parquet files:

General Example