Created
August 12, 2025 20:49
-
-
Save crypdick/6ead8517d220c602587b793b4fdd1cbd to your computer and use it in GitHub Desktop.
Count the number of rows in a sharded parquet dataset without loading the shards. Works by reading just the metadata headers. Works with S3 datasets
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import pyarrow.dataset as ds | |
| def count_parquet_rows(dataset_path: str) -> int: | |
| """ | |
| Count the number of rows in a parquet file without reading the data into memory. | |
| https://stackoverflow.com/a/79118602/4212158 | |
| """ | |
| dataset = ds.dataset(dataset_path, format="parquet") | |
| row_count = sum(row_group.num_rows for fragment in dataset.get_fragments() for row_group in fragment.row_groups) | |
| return row_count |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment