Skip to content

Instantly share code, notes, and snippets.

@alexlib
Forked from lpillmann/read_parquet.py
Created April 3, 2021 21:54
Show Gist options
  • Save alexlib/7ea18a5e34e39beed9a75e182bcc6e52 to your computer and use it in GitHub Desktop.
Save alexlib/7ea18a5e34e39beed9a75e182bcc6e52 to your computer and use it in GitHub Desktop.
Read partitioned parquet files into pandas DataFrame from Google Cloud Storage using PyArrow
import gcsfs
import pyarrow
def read_parquet(gs_directory_path, to_pandas=True):
"""
Reads multiple (partitioned) parquet files from a GS directory
e.g. 'gs://<bucket>/<directory>' (without ending /)
"""
gs = gcsfs.GCSFileSystem()
arrow_df = pyarrow.parquet.ParquetDataset(gs_directory_path, filesystem=gs)
if to_pandas:
return arrow_df.read_pandas().to_pandas()
return arrow_df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment