Skip to content

Instantly share code, notes, and snippets.

@fabiolimace
Last active January 15, 2022 20:33
Show Gist options
  • Save fabiolimace/884710321c5a91adbaf89cca8794462e to your computer and use it in GitHub Desktop.
Save fabiolimace/884710321c5a91adbaf89cca8794462e to your computer and use it in GitHub Desktop.
Deduplicate Photos

Deduplicate Photos

About

The program deduplicate-photos.sh helps to deduplicate the family photos.

List of sub-parts:

  • deduplicate-photos-copy.sh
  • deduplicate-photos-exif.sh
  • deduplicate-photos-sha1.sh

Dependencies

This program uses the EXIF tool. Install it executing in terminal:

sudo apt install exif

How to use

Don't use this script in the original directory of your photos! It is extremely discouraged!

  1. Make a WORKING_DIRECTORY:
mkdir WORKING_DIRECTORY;
  1. Make an OUTPUT_DIRECTORY inside the WORKING_DIRECTORY:
mkdir WORKING_DIRECTORY/OUTPUT_DIRECTORY
  1. Copy the program deduplicate-photos.sh and it's sub-parts to the WORKING_DIRECTORY:
cp deduplicate-photos*.sh WORKING_DIRECTORY/
  1. Make a full copy of the photos directory inside the WORKING_DIRECTORY:
cp -a "PHOTOS_DIRECTORY" "WORKING_DIRECTORY/FULL_COPY_OF_PHOTOS_DIRECTORY"

Warning: This is an important step to avoid running the program in the original photos directory.

  1. Change the current directory to WORKING_DIRECTORY:
cd WORKING_DIRECTORY
  1. Execute the program:
./deduplicate-photos.sh "FULL_COPY_OF_PHOTOS_DIRECTORY" "OUTPUT_DIRECTORY"
  1. Follow processing output:
tail -f OUTPUT_DIRECTORY/deduplicate-photos-20220115_165820-output.log
  1. Follow processing errors:
tail -f OUTPUT_DIRECTORY/deduplicate-photos-20220115_165820-errors.log
#!/bin/bash
FILE="${1}"
DSTDIR="${2}"
PATH=$(head -n 1 "${FILE}");
echo "/bin/cp --archive --parents --link --update \"${PATH}\" \"${DSTDIR}\"";
/bin/cp --archive --parents --link --update "${PATH}" "${DSTDIR}";
#!/bin/bash
#
# Install the EXIF tool:
#
# sudo apt install exif
#
LIST="${1}"
DSTDIR="${2}"
EXT="${3}"
FILE=$(head -n 1 "${LIST}");
NAME=$(basename "${FILE}");
# the original name was the suffix,
# but now I prefer not using it, because
# it rarelly brings useful information.
# SUFFIX=" ${NAME}";
SUFFIX=".${EXT}";
DATE=$(/usr/bin/exif --tag=DateTimeOriginal --machine-readable --no-fixup "${FILE}" | /usr/bin/cut -d ' ' -f 1 | /usr/bin/tr ':' '-');
DATETIME="${DATE}-"$(/usr/bin/exif --tag=DateTimeOriginal --machine-readable --no-fixup "${FILE}" | /usr/bin/cut -d ' ' -f 2 | /usr/bin/tr -d ':');
if [[ -z "${DATE}" ]]
then
# Use the modification date of the file instead
DATE=$(/usr/bin/stat --printf=%y "${FILE}" | /usr/bin/cut -c1-10);
DATETIME=$(/usr/bin/stat --printf=%y "${FILE}" | /usr/bin/cut -c1-19 | /usr/bin/tr -d ':' | /usr/bin/tr ' ' '-');
DATE="${DATE}_"; # add underline to directory name
fi;
if [[ ! -d "${DSTDIR}/${DATE}" ]]
then
/bin/mkdir "${DSTDIR}/${DATE}"
fi;
if [[ ! -f "${DSTDIR}/${DATE}/${DATETIME}${SUFFIX}" ]]
then
echo "/bin/cp --archive --link --update \"${FILE}\" \"${DSTDIR}/${DATE}/${DATETIME}${SUFFIX}\"";
/bin/cp --archive --link --update "${FILE}" "${DSTDIR}/${DATE}/${DATETIME}${SUFFIX}";
else
echo "/bin/mv \"${DSTDIR}/${DATE}/${DATETIME}${SUFFIX}\" \"${DSTDIR}/${DATE}/${DATETIME}-0000${SUFFIX}\"";
/bin/mv "${DSTDIR}/${DATE}/${DATETIME}${SUFFIX}" "${DSTDIR}/${DATE}/${DATETIME}-0000${SUFFIX}";
for COUNTER in `seq -w 1 9999`; do
if [[ ! -f "${DSTDIR}/${DATE}/${DATETIME}-${COUNTER}${SUFFIX}" ]]
then
echo "/bin/cp --archive --link --update \"${FILE}\" \"${DSTDIR}/${DATE}/${DATETIME}-${COUNTER}${SUFFIX}\"";
/bin/cp --archive --link --update "${FILE}" "${DSTDIR}/${DATE}/${DATETIME}-${COUNTER}${SUFFIX}";
break;
fi;
done;
fi;
#!/bin/bash
#
# It is used to generate lists of duplicates, file by file.
#
# It computes the SHA1 of a file and appends it's path in a list of duplicates.
#
FILE="${1}"
DSTDIR="${2}"
PATH=$(readlink -f "${FILE}");
SHA1=$(/usr/bin/sha1sum "${FILE}" | /usr/bin/cut -c1-40);
DUPLICATES="${DSTDIR}/${SHA1}.duplicates"
SORTED="${DSTDIR}/${SHA1}.duplicates.sorted"
echo "echo \"${PATH}\" >> \"${DUPLICATES}\"";
echo "${PATH}" >> "${DUPLICATES}";
echo "/usr/bin/sort --ignore-case \"${DUPLICATES}\" --output=\"${SORTED}\"";
/usr/bin/sort --ignore-case "${DUPLICATES}" --output="${SORTED}"
#!/bin/bash
#
# Parts of this program:
#
# - deduplicate-photos-sha1.sh
# - deduplicate-photos-copy.sh
# - deduplicate-photos-exif.sh
#
# Install the EXIF tool:
#
# sudo apt install exif
#
# @author: Fabio Lima
#
SRCDIR="${1}"
DSTDIR="${2}"
EXT="jpg"
DATE=$(date +%Y%m%d_%H%M%S)
OUTLOG="${DSTDIR}"/deduplicate-photos-${DATE}-output.log
ERRLOG="${DSTDIR}"/deduplicate-photos-${DATE}-errors.log
START=$(date +%Y-%m-%d_%H:%M:%S)
SHA1DIR="${DSTDIR}/deduplicate-photos-${DATE}-sha1";
test ! -d "${SHA1DIR}" && mkdir "${SHA1DIR}"
COPYDIR="${DSTDIR}/deduplicate-photos-${DATE}-copy";
test ! -d "${COPYDIR}" && /bin/mkdir "${COPYDIR}"
EXIFDIR="${DSTDIR}/deduplicate-photos-${DATE}-exif";
test ! -d "${EXIFDIR}" && /bin/mkdir "${EXIFDIR}"
echo "START: $START"
echo "START: $START" >> "${OUTLOG}"
echo >> "${OUTLOG}"
echo "Generating lists of duplicates"
echo "Generating lists of duplicates" >> "${OUTLOG}"
echo "find \"${SRCDIR}\" -type f -iname \"*.${EXT}\" -exec ./deduplicate-photos-sha1.sh {} \"${SHA1DIR}\"" \; >> "${OUTLOG}" 2>> "${ERRLOG}"
find "${SRCDIR}" -type f -iname "*.${EXT}" -exec ./deduplicate-photos-sha1.sh {} "${SHA1DIR}" \; >> "${OUTLOG}" 2>> "${ERRLOG}"
echo >> "${OUTLOG}"
echo "Copying the first photo of each list of duplicates"
echo "Copying the first photo of each list of duplicates" >> "${OUTLOG}"
echo "find \"${SHA1DIR}\" -type f -name \"*.duplicates.sorted\" -exec ./deduplicate-photos-copy.sh {} \"${COPYDIR}\"" \; >> "${OUTLOG}" 2>> "${ERRLOG}"
find "${SHA1DIR}" -type f -name "*.duplicates.sorted" -exec ./deduplicate-photos-copy.sh {} "${COPYDIR}" \; >> "${OUTLOG}" 2>> "${ERRLOG}"
echo >> "${OUTLOG}"
echo "Copying the first photo of each list of duplicates to direcories with date"
echo "Copying the first photo of each list of duplicates to direcories with date" >> "${OUTLOG}"
echo "find \"${SHA1DIR}\" -type f -name \"*.duplicates.sorted\" -exec ./deduplicate-photos-exif.sh {} \"${EXIFDIR}\" \"${EXT}\"" \; >> "${OUTLOG}" 2>> "${ERRLOG}"
find "${SHA1DIR}" -type f -name "*.duplicates.sorted" -exec ./deduplicate-photos-exif.sh {} "${EXIFDIR}" "${EXT}" \; >> "${OUTLOG}" 2>> "${ERRLOG}"
END=$(date +%Y-%m-%d_%H:%M:%S)
echo >> "${OUTLOG}";
echo "SUMMARY" >> "${OUTLOG}";
echo "Started: $START" >> "${OUTLOG}";
echo "Finished: $END" >> "${OUTLOG}";
echo -n "Photos found: " >> "${OUTLOG}"
find "${SRCDIR}" -type f -iname "*.${EXT}" | wc -l >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Photos in lists of duplicates: " >> "${OUTLOG}"
find "${SHA1DIR}" -name "*.duplicates" -exec cat {} \; | wc -l >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Photos in sorted lists of duplicates: " >> "${OUTLOG}"
find "${SHA1DIR}" -name "*.duplicates.sorted" -exec cat {} \; | wc -l >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Lists of duplicates generated: " >> "${OUTLOG}"
find "${SHA1DIR}" -name "*.duplicates" | wc -l >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Sorted lists of duplicates generated: " >> "${OUTLOG}"
find "${SHA1DIR}" -name "*.duplicates.sorted" | wc -l >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Photos copied WITHOUT DATE: " >> "${OUTLOG}"
find "${COPYDIR}" -type f -iname "*.${EXT}" | wc -l >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Photos copied WITH DATE: " >> "${OUTLOG}"
find "${EXIFDIR}" -type f -iname "*.${EXT}" | wc -l >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Data in source directory: " >> "${OUTLOG}"
du -sh "${SRCDIR}" | cut -f1 >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Data in source directory in kilobytes: " >> "${OUTLOG}"
du -s "${SRCDIR}" | cut -f1 >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Data copied to directories WITHOUT DATE: " >> "${OUTLOG}"
du -sh "${COPYDIR}" | cut -f1 >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Data copied to directories WITHOUT DATE in kilobytes: " >> "${OUTLOG}"
du -s "${COPYDIR}" | cut -f1 >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Data copied to directories WITH DATE: " >> "${OUTLOG}"
du -sh "${EXIFDIR}" | cut -f1 >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Data copied to directories WITH DATE in kilobytes: " >> "${OUTLOG}"
du -s "${EXIFDIR}" | cut -f1 >> "${OUTLOG}" 2>> "${ERRLOG}"
echo -n "Lines in error output file: " >> "${OUTLOG}"
cat "${ERRLOG}" | wc -l >> "${OUTLOG}" 2>> "${ERRLOG}"
# Show output SUMARY
tail -n 18 "${OUTLOG}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment