Last active
August 29, 2015 14:07
-
-
Save aerostitch/702c9d01b293a9482e1d to your computer and use it in GitHub Desktop.
How to guess the chunk size of an AWS file uploaded using multipart upload method.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env ruby | |
## | |
# This gist tries to guess the chunk size in Megabytes of a multipart upload of | |
# an AWS file using its local copy and the etag of the remote file. | |
# Here we suppose the chunk size chosen for the upload is an integer in Megabytes. | |
# | |
# To run correctly, it requires the `get_etag` function from | |
# https://gist.github.com/aerostitch/538eda0b2d1d8dd914bf | |
# | |
# You can get the etag of the remote file by using the `get_AWS_file_checksum` function | |
# from https://gist.github.com/aerostitch/22992e88315215f100b8 | |
# | |
# == Parameters: | |
# local_file:: | |
# The local copy path of the remote AWS file you want to work with. | |
# | |
# remote_etag:: | |
# The etag of the AWS object | |
# | |
# == Returns: | |
# The chunk size in Megabytes or 0 if the remote etag is the etag of a file not | |
# uploaded using multipart upload method. | |
# | |
# == Examples: | |
# We are using the `get_AWS_file_checksum` function from the Gist specified | |
# earlier to get the remote etag. | |
# The local copy of the file specified here is /tmp/my_file.tar.gz. | |
# remote_file_etag = get_AWS_file_checksum( | |
# 'ABCDEFGHIJKLMNOPQRS', | |
# 'abcdefghijklmnopqrstuvwxyz/1234567890za', | |
# 's3://my_bucket/documents/myfile.tar.gz', | |
# ) | |
# puts guess_multipart_chunk_MB_from_etag( | |
# '/tmp/my_file.tar.gz', | |
# remote_file_etag | |
# ) | |
# | |
# | |
# Author:: Joseph Herlant ([email protected]) | |
# Copyright:: Copyright (c) 2014 Joseph Herlant | |
# License:: Distributed under the terms of the Apache 2 license | |
# | |
def guess_multipart_chunk_MB_from_etag(local_file, remote_etag) | |
# If not multipart, returns 0 | |
chunk_size = 0 | |
remote_etag.tr!('"','') | |
# If you are running ruby < 1.9, use these 2 lines instead of the 1st if: | |
# chunks = remote_etag.gsub(/^[a-fA-F0-9]{32}\-/,'') | |
# if chunks != remote_etag | |
# Multipart uploads are recognize by using this format | |
if /^(?<remote_md5>\h{32})\-(?<chunks>\d+)$/ =~ remote_etag | |
raise("Local file #{local_file} not found.") unless File.exists?(local_file) | |
# Here we suppose that the chunk size will be in MB | |
file_size = File.size(local_file).to_f / 1024 / 1024 | |
chunks = 2 | |
# Here we calculate the min and max possible sizes of the chunk | |
min_sz = (file_size / chunks.to_f).round | |
max_sz = chunks.to_f > 1 ? (file_size / (chunks.to_f - 1)).to_i : file_size.round | |
# Now checking, the etags for the possible chunk sizes | |
(min_sz..max_sz).each do |sz| | |
local_etag = get_etag(local_file, sz*1024*1024) | |
if local_etag == "#{remote_md5}-#{chunks}" then | |
chunk_size = sz | |
break | |
end | |
end | |
raise('Chunk size not found. Local and remote files seem different.') if chunk_size == 0 | |
end | |
chunk_size | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment