Created
March 13, 2023 02:08
-
-
Save mattparlane/8dfea96b5a753e139434b45181fa733c to your computer and use it in GitHub Desktop.
Simple multi-thread R2/S3 backup
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
require 'aws-sdk-core' | |
require 'aws-sdk-s3' | |
require 'digest' | |
date = Date.today.strftime('%Y-%m-%d') | |
client = Aws::S3::Client.new( | |
access_key_id: 'XXXXXX', | |
secret_access_key: 'XXXXXX', | |
endpoint: 'https://XXXXXX.r2.cloudflarestorage.com/', | |
region: 'auto', | |
) | |
client.list_buckets.buckets.each do |bucket| | |
queue = Queue.new | |
client.list_objects(bucket: bucket.name).each do |response| | |
response.contents.each do |object| | |
queue << [bucket.name, object] | |
rescue => e | |
p e | |
sleep 5 | |
retry | |
end | |
rescue => e | |
p e | |
sleep 5 | |
retry | |
end | |
threads = [] | |
8.times do | |
threads << Thread.new do | |
while !queue.empty? do | |
begin | |
bucket_name, object = queue.pop | |
path = "r2-backups/#{date}/#{bucket_name}" | |
FileUtils.mkdir_p(path) | |
file_path = "#{path}/#{object.key}" | |
if File.exist? file_path | |
md5 = Digest::MD5.hexdigest(File.read(file_path)) | |
etag = object.etag.gsub(/"/, '') # For some reason the etags are double-quoted | |
next if etag == md5 | |
end | |
puts "#{bucket_name}/#{object.key}" | |
real_object = client.get_object(bucket: bucket_name, key: object.key) | |
body = real_object.body.read | |
IO.write(file_path, body) | |
rescue => e | |
p e | |
sleep 5 | |
retry | |
end | |
end | |
end | |
end | |
threads.each(&:join) | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is a simple multi-threaded backup for R2/S3. I am using R2 so I'm only testing with that, but it uses AWS's Gem so it should be compatible with S3.
It uses the current date as the base path and then a separate directory for each bucket.
It processes one bucket at a time, and uses 8 threads per bucket.
It checks the MD5 of the local file against the etag of the remote file and only downloads if they don't match. I am locally recursively copying the last backup directory and then running this process over the local files, in order to reduce the total time.
Let me know if any questions/bugs, happy to iterate.