Skip to content

Instantly share code, notes, and snippets.

@crypdick
Created August 12, 2025 20:49
Show Gist options
  • Save crypdick/6ead8517d220c602587b793b4fdd1cbd to your computer and use it in GitHub Desktop.
Save crypdick/6ead8517d220c602587b793b4fdd1cbd to your computer and use it in GitHub Desktop.
Count the number of rows in a sharded parquet dataset without loading the shards. Works by reading just the metadata headers. Works with S3 datasets
import pyarrow.dataset as ds
def count_parquet_rows(dataset_path: str) -> int:
"""
Count the number of rows in a parquet file without reading the data into memory.
https://stackoverflow.com/a/79118602/4212158
"""
dataset = ds.dataset(dataset_path, format="parquet")
row_count = sum(row_group.num_rows for fragment in dataset.get_fragments() for row_group in fragment.row_groups)
return row_count
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment