Created
April 23, 2020 22:54
-
-
Save skchronicles/7861fd08004876af7bd6f19356fb0b41 to your computer and use it in GitHub Desktop.
Calculate S3 ETag
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
set -euo pipefail | |
help() { cat << EOF | |
Calculates S3 etag | |
USAGE: | |
s3etag [OPTIONS] input_file [chunk_size_in_MB] | |
Files uploaded to Amazon S3 that are smaller than 1GB have an etag that is the | |
MD5 checksum of the uploaded file; however, when a file is larger than than 1GB, | |
it is broken up into N chunks (of 'chunk_size_in_MB' size) and a checksum is calculated for | |
each chunk. Each of these chunk's checksums are concatentated together and a final checksum | |
is calculated. This script takes a file breaks into N chunks depending on user defined | |
chunk_size_in_MB argument and calculates the checksum of checksums for these chunks. | |
Positional Arguments: | |
[1] input_file Calculate S3 etag of this file | |
[2] chunk_size_in_MB Chunk size in MB for S3 etag calculation [Default: 5] | |
OPTIONS: | |
-h, --help Displays usage and help information | |
NOTE: | |
If a file's size is less than 50 GB, then chunk_size_in_MB should be set to 5 | |
If a file's size is greater than 50 GB, then chunk_size_in_MB should be set to 50 | |
Examples: | |
./s3etag /path/to/file.fastq.gz 5 # Calculates etag, breaking up file into 5MB chunks | |
./s3etag -h # Display usage and help information | |
EOF | |
} | |
s3etag(){ | |
# Calculate S3 etag | |
file="$1" && if [ ! -f "$file" ]; then echo "Error: File ${file} not found!"; help ; exit 1; fi | |
partSizeInMb="$2" | |
checksumFile="$3" | |
fileSizeInMb=$(du -m "$file" | cut -f 1) | |
parts=$((fileSizeInMb / partSizeInMb)) | |
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then parts=$((parts + 1)); fi | |
# Break up file and calculate indivdual checksums of chunks | |
for (( part=0; part<$parts; part++ )); do | |
skip=$((partSizeInMb * part)) | |
$(dd bs=1M count=$partSizeInMb skip=$skip if="$file" 2> /dev/null | md5sum >> $checksumFile) | |
done | |
# Calculate checksum of checksums | |
etag=$(echo $(xxd -r -p $checksumFile | md5sum)-$parts | sed 's/ --/-/') | |
echo -e "${1}\t${etag}" | |
} | |
# Main: check usage | |
if [ $# -eq 0 ]; then help; exit 1; fi | |
# Check options | |
case "$1" in | |
-h | --help) help && exit 0;; | |
-*) help && exit 1;; | |
esac | |
# Parse Args | |
file="$1" | |
partSizeInMb="${2:-5}" # Default chunk size: 5MB | |
checksumFile=$(mktemp -t s3md5.XXXXXXXXXXXXX) # tmp file for appending checksums | |
trap "rm -f $checksumFile" EXIT | |
# Calculate S3 etag | |
s3etag "$file" "$partSizeInMb" "$checksumFile" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment