Skip to content

Instantly share code, notes, and snippets.

@holypriest
Created December 10, 2020 02:00
Show Gist options
  • Save holypriest/2ce76279f345cc8f39fa37ecd718c892 to your computer and use it in GitHub Desktop.
Save holypriest/2ce76279f345cc8f39fa37ecd718c892 to your computer and use it in GitHub Desktop.
How to move files to Glacier leveraging Spark distributed capabilities
def move_file_to_glacier(list_of_rows):
sess = boto3.session.Session(region_name='us-east-1')
s3res = sess.resource('s3')
for row in list_of_rows:
copy_source = {
'Bucket': row[0],
'Key': row[1]
}
s3res.meta.client.copy(
CopySource=copy_source,
Bucket='my-destination-bucket',
Key=row[1],
ExtraArgs={'StorageClass': 'GLACIER'}
)
yield Row(
bucket=row[0],
key=row[1],
file_number=row[2],
total_files=row[3]
)
files = sc.parallelize(rows).repartition(sc.defaultParallelism)
output = files.mapPartitions(move_file_to_glacier).toDF().cache()
print(f"Count: {output.count()} :: Total: {output.select('total_files').limit(1).collect()[0].total_files}")
output.unpersist()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment