Created
November 4, 2020 22:31
-
-
Save gordol/85226ac917de7984010c507591078792 to your computer and use it in GitHub Desktop.
Corrupt data landing in s3 via aws_s3.query_export_to_s3 extension
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- When using aws_s3.query_export_to_s3, on Aurora Postgres, version 11, data is reliably and predictably corrupted at certain positions of the payloads that land in s3. | |
-- It seems that when data falls over a certain byte length, it is reset and resent mid-stream to s3. | |
-- We have been able to reproduce this positively, as well as reproduce a negative as well; ensuring that data falls at powers of 2 on even byte lengths seems to prevent this issue from occurring. | |
-- As the length of data increases, as does the frequency of the corruption. For small data sets, it may never surface. | |
-- You will see upon running this SQL that it occurs once in 100,000 records with 128+9 byte length records, twice for 256+9, and 4 times for 512+9, and so-on, where each record looks like {"A":""}\n where \n is a new line control character, amounting to 9 extra bytes on top of the value in the JSON objects. | |
-- these queries will result in broken json in s3 | |
select * from aws_s3.query_export_to_s3( | |
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,128)), '''') A FROM generate_series(1,100000)) test', | |
aws_commons.create_s3_uri( | |
'example-bucket/debug', | |
'test_data-128.jsonl', | |
'us-east-1'), | |
options :='format text'); | |
select * from aws_s3.query_export_to_s3( | |
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,256)), '''') A FROM generate_series(1,100000)) test', | |
aws_commons.create_s3_uri( | |
'example-bucket/debug', | |
'test_data-256.jsonl', | |
'us-east-1'), | |
options :='format text'); | |
select * from aws_s3.query_export_to_s3( | |
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,512)), '''') A FROM generate_series(1,100000)) test', | |
aws_commons.create_s3_uri( | |
'example-bucket/debug', | |
'test_data-512.jsonl', | |
'us-east-1'), | |
options :='format text'); | |
-- these queries will NOT result in broken json in s3 | |
select * from aws_s3.query_export_to_s3( | |
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,128-9)), '''') A FROM generate_series(1,100000)) test', | |
aws_commons.create_s3_uri( | |
'example-bucket/debug', | |
'test_data-128-9.jsonl', | |
'us-east-1'), | |
options :='format text'); | |
select * from aws_s3.query_export_to_s3( | |
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,256-9)), '''') A FROM generate_series(1,100000)) test', | |
aws_commons.create_s3_uri( | |
'example-bucket/debug', | |
'test_data-256-9.jsonl', | |
'us-east-1'), | |
options :='format text'); | |
select * from aws_s3.query_export_to_s3( | |
'select row_to_json(test) from (SELECT array_to_string(ARRAY(SELECT ''A'' FROM generate_series(1,512-9)), '''') A FROM generate_series(1,100000)) test', | |
aws_commons.create_s3_uri( | |
'example-bucket/debug', | |
'test_data-512-9.jsonl', | |
'us-east-1'), | |
options :='format text'); |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Discussion here: https://repost.aws/questions/QURCnz8badSv6TMXCNQVSYpw/corrupt-data-landing-in-s-3-via-aws-s-3-query-export-to-s-3-extension