-
-
Save lpillmann/fa1874c7deb8434ca8cba8e5a045dde2 to your computer and use it in GitHub Desktop.
import gcsfs | |
import pyarrow | |
def read_parquet(gs_directory_path, to_pandas=True): | |
""" | |
Reads multiple (partitioned) parquet files from a GS directory | |
e.g. 'gs://<bucket>/<directory>' (without ending /) | |
""" | |
gs = gcsfs.GCSFileSystem() | |
arrow_df = pyarrow.parquet.ParquetDataset(gs_directory_path, filesystem=gs) | |
if to_pandas: | |
return arrow_df.read_pandas().to_pandas() | |
return arrow_df | |
Hi @felipejardimf and @rjurney!
It has been a while since I've worked with GS - I'm not currently able to reproduce it.
Just to be sure, please note that gs_directory_path
argument is the path of the "folder" in which the files are stored, without the ending /
.
For the structure
gs://bucket/folder/DATA_PART=201801/1.parquet
gs://bucket/folder/DATA_PART=201801/2.parquet
gs://bucket/folder/DATA_PART=201802/1.parquet
gs://bucket/folder/DATA_PART=201801/2.parquet
the argument gs_directory_path
would be gs://bucket/folder/DATA_PART=201801
.
For the structure
gs://bar-foo/derived/2021_01/person_company_examples.parquet/
gs://bar-foo/derived/2021_01/person_company_examples.parquet/_SUCCESS
gs://bar-foo/derived/2021_01/person_company_examples.parquet/part-00000-1fkd-2614-4922-aee4-815c44abf-c000.snappy.parquet
the argument gs_directory_path
would be gs://bar-foo/derived/2021_01/person_company_examples.parquet
.
Can you try with these arguments and see if it works? If it doesn't, please share the error message or behavior and I can try to debug with you.
Hey @lpillmann !
Yeah, if i set this path i can reach the files : gs://bucket/folder/DATA_PART=201801
but how to access paths like this? gs://bucket/folder/*
I ask you because in other environments I can usually look for this path.
thank you for your help!!
Got it @@felipejardimf.
I'd expect PyArrow to be able to read from that path if you pass gs://bucket/folder
as gs_directory_path
.
However, I'm not able to test it right now. You might want to take a look at pyarrow.parquet.ParquetDataset
documentation and see if you need to tweak any of the parameters in order for that to work.
Hi everyone!
Unfortunately, I got errors like below.
OSError: Passed non-file path: gs://<bucket>/<folder>
or
ArrowInvalid: Parquet file size is 0 bytes
I found another way here to achieve the same, which could hopefully help someone.
Note that pandas dons not support this
cool, thank you
It worked perfectly for me! Thanks a bunch!
@lpillmann thanks!!
I can't use in the GS filesystem too..
i have this structure :
gs://bucket/folder/DATA_PART=201801/1.parquet
gs://bucket/folder/DATA_PART=201801/2.parquet
gs://bucket/folder/DATA_PART=201802/1.parquet
gs://bucket/folder/DATA_PART=201801/2.parquet
It seens that none engines could reach the GS path ..
Can anyone help with that ?