Last active
April 19, 2022 10:05
-
-
Save ervinne13/3e75d079d8525e9740cc1742b3c0df5e to your computer and use it in GitHub Desktop.
A quick and easy CSV to Parquet converter from one bucket to another. This may be attached to a lambda function that can be triggered whenever a new s3:PutObject is executed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from io import BytesIO | |
import boto3 | |
import pandas as pd | |
from os import environ | |
def convert(bucket, key): | |
s3_client = boto3.client('s3', region_name=environ['REGION']) | |
s3_object = s3_client.get_object(Bucket=bucket, Key=key) | |
df = pd.read_csv(s3_object['Body']) | |
df.columns.astype(str) | |
target_bucket = f"{bucket}-parquet" | |
target_key = f"{key.split('-')[0]}.parquet" | |
parquet_out_buffer = BytesIO() | |
df.to_parquet(parquet_out_buffer, index=False, engine='fastparquet') | |
s3_res = boto3.resource('s3') | |
s3_res.Object(target_bucket, target_key).put(Body=parquet_out_buffer.getvalue()) | |
return { | |
'bucket': target_bucket, | |
'key': target_key | |
} |
Also using fastparquet instead of pyarrow as pyarrow goes over the 50mb limit even with auto layers on.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Create a REGION environment variable in your lambda function btw