Last active
November 22, 2022 14:45
-
-
Save iamaziz/5e4e85e9d63ff8d12f2848938fec7b0a to your computer and use it in GitHub Desktop.
Read csv files from tar.gz in S3 into pandas dataframes without untar or download (using with S3FS, tarfile, io, and pandas)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -- read csv files from tar.gz in S3 with S3FS and tarfile (https://s3fs.readthedocs.io/en/latest/) | |
bucket = 'mybucket' | |
key = 'mycompressed_csv_files.tar.gz' | |
import s3fs | |
import tarfile | |
import io | |
import pandas as pd | |
fs = s3fs.S3FileSystem() | |
f = fs.open(f'{bucket}/{key}', 'rb') | |
tar = tarfile.open(f, 'r:gz') | |
csv_files = [f.name for f in tar.getmembers() if f.name.endswith('.csv')] | |
csv_file = csv_files[0] # here we read first csv file only | |
csv_contents = tar.extractfile(csv_file).read() | |
df = pd.read_csv(io.BytesIO(csv_contents), encoding='utf8') | |
f.close() |
Hey @rrpelgrim it's been a while i've not used this, it was working though. The new awswrangler
package by aws might be a better option https://github.com/awslabs/aws-data-wrangler
Thanks for the tip, taking a look now.
@iamaziz - any chance you could point me in the right direction within awswrangler
? The wr.s3.read_csv
doesn't read the .tgz compressed file... would really appreciate it 🙏
Thank you very much for you share. Much appreciated!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks for sharing this gist.
I'm getting a
TypeError: expected str, bytes or os.PathLike object, not S3File
. Does this work for you?