-
-
Save emersonf/7413337 to your computer and use it in GitHub Desktop.
#!/bin/bash | |
if [ $# -ne 2 ]; then | |
echo "Usage: $0 file partSizeInMb"; | |
exit 0; | |
fi | |
file=$1 | |
if [ ! -f "$file" ]; then | |
echo "Error: $file not found." | |
exit 1; | |
fi | |
partSizeInMb=$2 | |
fileSizeInMb=$(du -m "$file" | cut -f 1) | |
parts=$((fileSizeInMb / partSizeInMb)) | |
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then | |
parts=$((parts + 1)); | |
fi | |
checksumFile=$(mktemp -t s3md5) | |
for (( part=0; part<$parts; part++ )) | |
do | |
skip=$((partSizeInMb * part)) | |
$(dd bs=1m count=$partSizeInMb skip=$skip if="$file" 2>/dev/null | md5 >>$checksumFile) | |
done | |
echo $(xxd -r -p $checksumFile | md5)-$parts | |
rm $checksumFile |
@RichardBronosky
I finally understand.
https://stackoverflow.com/questions/12186993/what-is-the-algorithm-to-compute-the-amazon-s3-etag-for-a-file-larger-than-5gb#answer-19896823
Note: If you uploaded with aws-cli via aws s3 cp then you most likely have a 8MB chunksize. According to the docs, that is the default.
We should use this, if uploaded with aws-cli
via aws s3 cp
.
$ ./s3etag.sh something.zip 8
Thanks for this, it helped me validate a heap of files I had in S3.
Note that AWS S3 supports a maximum of 10,000 parts. I recently exceeded this on a project with a 54GB file (5MB part size). The AWS SDK adjusts the part size to fit 10,000 parts. I used this expression to get the right part size to calculate the ETag correctly, if you happen to exceed 10,000 parts. I also specified the partsize in bytes for better accuracy.
partsize = (filesize / 10000) + 1
Thank you. This is helpful. Are there any alternatives for xxd
on linux?
Awesome script - it doesn't work for SSE-KMS files so if you happen to uncover any intel on how AWS is generating the MD5 for that scenario, please share. Again, awesome job here.
Thanks for the script, this is incredibly helpful and we're incorporating it into our workflows!
Linux users
Here is an equivalent script if you are not using OSX. I hope this helps!
#!/bin/bash
set -euo pipefail
if [ $# -ne 2 ]; then
echo "Usage: $0 file partSizeInMb";
exit 0;
fi
file=$1
if [ ! -f "$file" ]; then
echo "Error: $file not found."
exit 1;
fi
partSizeInMb=$2
fileSizeInMb=$(du -m "$file" | cut -f 1)
parts=$((fileSizeInMb / partSizeInMb))
if [[ $((fileSizeInMb % partSizeInMb)) -gt 0 ]]; then
parts=$((parts + 1));
fi
checksumFile=$(mktemp -t s3md5.XXXXXXXXXXXXX)
for (( part=0; part<$parts; part++ ))
do
skip=$((partSizeInMb * part))
$(dd bs=1M count=$partSizeInMb skip=$skip if="$file" 2> /dev/null | md5sum >> $checksumFile)
done
etag=$(echo $(xxd -r -p $checksumFile | md5sum)-$parts | sed 's/ --/-/')
echo -e "${1}\t${etag}"
rm $checksumFile
Thank you @skchronicles
I think there's an error in the parts
calculations, now fixed below
https://gist.github.com/emersonf/7413337?permalink_comment_id=3244707#gistcomment-3244707
#!/bin/bash
set -euo pipefail
if [ $# -ne 2 ]; then
echo "Usage: $0 file partSizeInMb";
exit 0;
fi
file=$1
if [ ! -f "$file" ]; then
echo "Error: $file not found."
exit 1;
fi
partSizeInMb=$2
partSizeInB=$((partSizeInMb * 1024 * 1024)) ### I added this
fileSizeInB=$(du -b "$file" | cut -f 1) ### I edited this
parts=$((fileSizeInB / partSizeInB)) ### I edited this and the next line
if [[ $((fileSizeInB % partSizeInB)) -gt 0 ]]; then
parts=$((parts + 1));
fi
checksumFile=$(mktemp -t s3md5.XXXXXXXXXXXXX)
for (( part=0; part<$parts; part++ ))
do
skip=$((partSizeInMb * part))
$(dd bs=1M count=$partSizeInMb skip=$skip if="$file" 2> /dev/null | md5sum >> $checksumFile)
done
etag=$(echo $(xxd -r -p $checksumFile | md5sum)-$parts | sed 's/ --/-/')
echo -e "${1}\t${etag}"
rm $checksumFile
Thanks, this is quite useful.
I modified the script to speedup the hash computation and avoid generating temporary files. Link to script
How do you know what part size was used/to use?
(Size: 9476171423 ETage: 44dab9123b49dab2c2b3b10c360ceda1-1130)