Skip to content

Instantly share code, notes, and snippets.

@lpillmann
Last active November 16, 2023 05:52
Show Gist options
  • Save lpillmann/fa1874c7deb8434ca8cba8e5a045dde2 to your computer and use it in GitHub Desktop.
Save lpillmann/fa1874c7deb8434ca8cba8e5a045dde2 to your computer and use it in GitHub Desktop.
Read partitioned parquet files into pandas DataFrame from Google Cloud Storage using PyArrow
import gcsfs
import pyarrow
def read_parquet(gs_directory_path, to_pandas=True):
"""
Reads multiple (partitioned) parquet files from a GS directory
e.g. 'gs://<bucket>/<directory>' (without ending /)
"""
gs = gcsfs.GCSFileSystem()
arrow_df = pyarrow.parquet.ParquetDataset(gs_directory_path, filesystem=gs)
if to_pandas:
return arrow_df.read_pandas().to_pandas()
return arrow_df
@rjurney
Copy link

rjurney commented Apr 19, 2021

This actually doesn't work for me either for a directory like this:

gs://bar-foo/derived/2021_01/person_company_examples.parquet/
gs://bar-foo/derived/2021_01/person_company_examples.parquet/_SUCCESS
gs://bar-foo/derived/2021_01/person_company_examples.parquet/part-00000-1fkd-2614-4922-aee4-815c44abf-c000.snappy.parquet

@felipejardimf
Copy link

@lpillmann thanks!!
I can't use in the GS filesystem too..

i have this structure :

gs://bucket/folder/DATA_PART=201801/1.parquet
gs://bucket/folder/DATA_PART=201801/2.parquet
gs://bucket/folder/DATA_PART=201802/1.parquet
gs://bucket/folder/DATA_PART=201801/2.parquet

It seens that none engines could reach the GS path ..

Can anyone help with that ?

@lpillmann
Copy link
Author

lpillmann commented Jul 7, 2021

Hi @felipejardimf and @rjurney!

It has been a while since I've worked with GS - I'm not currently able to reproduce it.

Just to be sure, please note that gs_directory_path argument is the path of the "folder" in which the files are stored, without the ending /.

For the structure

gs://bucket/folder/DATA_PART=201801/1.parquet
gs://bucket/folder/DATA_PART=201801/2.parquet
gs://bucket/folder/DATA_PART=201802/1.parquet
gs://bucket/folder/DATA_PART=201801/2.parquet

the argument gs_directory_path would be gs://bucket/folder/DATA_PART=201801.

For the structure

gs://bar-foo/derived/2021_01/person_company_examples.parquet/
gs://bar-foo/derived/2021_01/person_company_examples.parquet/_SUCCESS
gs://bar-foo/derived/2021_01/person_company_examples.parquet/part-00000-1fkd-2614-4922-aee4-815c44abf-c000.snappy.parquet

the argument gs_directory_path would be gs://bar-foo/derived/2021_01/person_company_examples.parquet.

Can you try with these arguments and see if it works? If it doesn't, please share the error message or behavior and I can try to debug with you.

@felipejardimf
Copy link

Hey @lpillmann !

Yeah, if i set this path i can reach the files : gs://bucket/folder/DATA_PART=201801

but how to access paths like this? gs://bucket/folder/*

I ask you because in other environments I can usually look for this path.

thank you for your help!!

@lpillmann
Copy link
Author

Got it @@felipejardimf.

I'd expect PyArrow to be able to read from that path if you pass gs://bucket/folder as gs_directory_path.

However, I'm not able to test it right now. You might want to take a look at pyarrow.parquet.ParquetDataset documentation and see if you need to tweak any of the parameters in order for that to work.

@uchiiii
Copy link

uchiiii commented Aug 11, 2021

Hi everyone!
Unfortunately, I got errors like below.

OSError: Passed non-file path:  gs://<bucket>/<folder>

or

ArrowInvalid: Parquet file size is 0 bytes

I found another way here to achieve the same, which could hopefully help someone.

Note that pandas dons not support this

@freedomtowin
Copy link

cool, thank you

@samos123
Copy link

It worked perfectly for me! Thanks a bunch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment