Skip to content

Instantly share code, notes, and snippets.

@aerostitch
Last active August 29, 2015 14:07
Show Gist options
  • Save aerostitch/702c9d01b293a9482e1d to your computer and use it in GitHub Desktop.
Save aerostitch/702c9d01b293a9482e1d to your computer and use it in GitHub Desktop.
How to guess the chunk size of an AWS file uploaded using multipart upload method.
#!/usr/bin/env ruby
##
# This gist tries to guess the chunk size in Megabytes of a multipart upload of
# an AWS file using its local copy and the etag of the remote file.
# Here we suppose the chunk size chosen for the upload is an integer in Megabytes.
#
# To run correctly, it requires the `get_etag` function from
# https://gist.github.com/aerostitch/538eda0b2d1d8dd914bf
#
# You can get the etag of the remote file by using the `get_AWS_file_checksum` function
# from https://gist.github.com/aerostitch/22992e88315215f100b8
#
# == Parameters:
# local_file::
# The local copy path of the remote AWS file you want to work with.
#
# remote_etag::
# The etag of the AWS object
#
# == Returns:
# The chunk size in Megabytes or 0 if the remote etag is the etag of a file not
# uploaded using multipart upload method.
#
# == Examples:
# We are using the `get_AWS_file_checksum` function from the Gist specified
# earlier to get the remote etag.
# The local copy of the file specified here is /tmp/my_file.tar.gz.
# remote_file_etag = get_AWS_file_checksum(
# 'ABCDEFGHIJKLMNOPQRS',
# 'abcdefghijklmnopqrstuvwxyz/1234567890za',
# 's3://my_bucket/documents/myfile.tar.gz',
# )
# puts guess_multipart_chunk_MB_from_etag(
# '/tmp/my_file.tar.gz',
# remote_file_etag
# )
#
#
# Author:: Joseph Herlant ([email protected])
# Copyright:: Copyright (c) 2014 Joseph Herlant
# License:: Distributed under the terms of the Apache 2 license
#
def guess_multipart_chunk_MB_from_etag(local_file, remote_etag)
# If not multipart, returns 0
chunk_size = 0
remote_etag.tr!('"','')
# If you are running ruby < 1.9, use these 2 lines instead of the 1st if:
# chunks = remote_etag.gsub(/^[a-fA-F0-9]{32}\-/,'')
# if chunks != remote_etag
# Multipart uploads are recognize by using this format
if /^(?<remote_md5>\h{32})\-(?<chunks>\d+)$/ =~ remote_etag
raise("Local file #{local_file} not found.") unless File.exists?(local_file)
# Here we suppose that the chunk size will be in MB
file_size = File.size(local_file).to_f / 1024 / 1024
chunks = 2
# Here we calculate the min and max possible sizes of the chunk
min_sz = (file_size / chunks.to_f).round
max_sz = chunks.to_f > 1 ? (file_size / (chunks.to_f - 1)).to_i : file_size.round
# Now checking, the etags for the possible chunk sizes
(min_sz..max_sz).each do |sz|
local_etag = get_etag(local_file, sz*1024*1024)
if local_etag == "#{remote_md5}-#{chunks}" then
chunk_size = sz
break
end
end
raise('Chunk size not found. Local and remote files seem different.') if chunk_size == 0
end
chunk_size
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment