When working with large datasets on Hugging Face, downloading in JSON format can be cumbersome due to the large size of the files. Fortunately, some datasets provide automatically converted Parquet versions, which are much smaller and optimized for efficient data storage and access.
In this example, we demonstrate how to download only the Parquet files instead of the entire dataset in JSON.
Dataset URL: ScaleQuest-Math
- Size of the original dataset files in JSON: 1.51 GB
- Size of the auto-converted Parquet files: 642 MB
# Step 1: Clone the repository without checking out the main branch
git clone https://huggingface.co/datasets/dyyyyyyyy/ScaleQuest-Math --no-checkout
# Step 2: Change directory to the dataset folder
cd ScaleQuest-Math
# Step 3: Fetch the Parquet branch (convert/parquet) where the smaller files are stored
git fetch origin refs/convert/parquet:refs/remotes/origin/convert/parquet
# Step 4: Check out the Parquet branch
git checkout convert/parquet
If you are working with other datasets, replace {username}
and {dataset-name}
with the appropriate values:
git clone https://huggingface.co/datasets/{username}/{dataset-name} --no-checkout
cd {dataset-name}
git fetch origin refs/convert/parquet:refs/remotes/origin/convert/parquet
git checkout convert/parquet
This process enables you to download smaller, auto-converted Parquet datasets without downloading the large JSON files from the main branch. The Parquet format is more efficient, saving you bandwidth and storage space.