-
-
Save rajivnarayan/1a8e5f2b6783701e0b3717dbcfd324ba to your computer and use it in GitHub Desktop.
Calculate checksum corresponding to the entity-tag hash (ETag) of Amazon S3 objects
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
# | |
# Calculate checksum corresponding to the entity-tag hash (ETag) of Amazon S3 objects | |
# | |
# Usage: compute_etag.sh <filename> <part_size_mb> | |
# | |
# filename: file to process | |
# part_size_mb: chunk size in MiB used for multipart uploads. | |
# This is 8M by default for the AWS CLI See: | |
# https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart_chunksize | |
# | |
# The Etag for an S3 object can be obtained from the command-line using: | |
# aws s3api head-object --bucket <bucket-name> --key <key-name> --query ETag --output text | |
# Note that the Etag may or may not correspond to the MD5 digest, see here for details: | |
# https://docs.aws.amazon.com/AmazonS3/latest/API/API_Object.html | |
# Adapted from: https://gist.github.com/emersonf/7413337 | |
# Changes | |
# 7/23/2022 | |
# - Parallelized hash calculation | |
# - Removed need for temporary files | |
# Script requires: dd, md5sum, xxd | |
set -euo pipefail | |
NUM_PARALLEL=$(nproc) | |
# Minimum filesize in bytes to switch to multipart uploads | |
# https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#multipart-threshold | |
MULTIPART_MINSIZE=$((8*1024*1024)) | |
if [[ $# -ne 2 ]]; then | |
echo "Usage: $0 file partSizeInMb"; | |
exit 0; | |
fi | |
file="$1" | |
partSizeInMb=$2 | |
if [[ ! -f "$file" ]]; then | |
echo "Error: $file not found." | |
exit 1; | |
fi | |
# Calculate checksum for a specified file chunk | |
# inputs: file, partSizeInMb, chunk | |
# output: chunk md5sum | |
hash_chunk(){ | |
file="$1" | |
partSizeInMb="$2" | |
chunk="$3" | |
skip=$((partSizeInMb * chunk)) | |
# output chunk + md5 (to allow sorting later) | |
dd bs=1M count="$partSizeInMb" skip="$skip" if="$file" 2> /dev/null | echo -e "$chunk $(md5sum)" | |
} | |
# Integer quotient a/b after rounding up | |
div_round_up(){ | |
echo $((($1 + $2 - 1)/$2)) | |
} | |
partSizeInB=$((partSizeInMb * 1024 * 1024)) | |
fileSizeInB=$(du -b "$file" | cut -f1 ) | |
parts=$(div_round_up fileSizeInB partSizeInB) | |
if [[ $fileSizeInB -gt $MULTIPART_MINSIZE ]]; then | |
export -f hash_chunk | |
etag=$(seq 0 $((parts-1)) | \ | |
xargs -P ${NUM_PARALLEL} -I{} bash -c 'hash_chunk "$@"' -- "$file" "$partSizeInMb" {} | \ | |
sort -n -k1,1 |tr -s ' '|cut -f2,3 -d' '|xxd -r -p|md5sum|cut -f1 -d' ')"-$parts" | |
else | |
etag=$(md5sum "$file"|cut -f1 -d' ') | |
fi | |
echo -e "${file}\t${etag}" |
I modified the script slightly (my version is here). S3 doesn't support more than 10,000 parts. I've seen some other libraries (like Uppy) switch to a variable part-size when they hit 10,000 parts (see this issue for an example of the calculation). The modification now takes this 10,000 parts limit into account and will then switch to a variable part-size calculation. The part-size is also printed at the end as output so that you can get an idea of the part-size that was used.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Verified working with correct output on macOS with the added steps:
brew install coreutils PATH="/opt/homebrew/opt/coreutils/libexec/gnubin:$PATH"
This will set you up with the otherwise missing
md5sum
andnproc
(equivalent tosysctl -n hw.logicalcpu
), and provides a compatibledu
(which on macOS does not have the-b
flag).The
PATH
modification is only needed sincedu
ships with macOS and installingcoreutils
aliases the GNU versions asgdu
without thePATH
modification. You can also chose to forego thatPATH
modification and change line 63 to usegdu
instead ofdu
.