Skip to content

Instantly share code, notes, and snippets.

@CodeArtha
Created October 21, 2022 08:53
Show Gist options
  • Select an option

  • Save CodeArtha/2e34064fa64ef17d41417517e983a039 to your computer and use it in GitHub Desktop.

Select an option

Save CodeArtha/2e34064fa64ef17d41417517e983a039 to your computer and use it in GitHub Desktop.
Bash script to keep the lightest version of duplicate audio or video files that have been compressed and share a common filename beginning.
#!/usr/bin/env bash
# This script looks for near duplicates in the current directory and keeps only the one with the smallest filesize.
# By that I mean that a file and its duplicate have the same name beginning up to a certain character position
# (i.e: a title) with an additionnal variable part at the end (i.e: a timestamp).
#
# The duplicates contain the "same" information for a human, but are actually encoded differently. Hence they can't
# be compared using a hash. For instance video files encoded differently, or a raw music file and its compressed mp3
# counterpart.
# The heaviest file is moved to an archive directory to review before deleting
#
# This script is courtesy of u/Mount_Gamer on reddit where he helped me solve this problem
# src: https://www.reddit.com/r/bash/comments/y70r7j/comment/istxaj2/?utm_source=share&utm_medium=web2x&context=3
archive_dir="./archive-dir/" # Dir to store the heavier duplicate instead of deleting them immediately
num_common_chars=14 # Number of characters common for the shortest filename in the directory
for i in *
do
for z in *
do
if [[ "${i:0:num_common_chars}" == "${z:0:num_common_chars}" ]] && [[ "$i" != "$z" ]]
then
outerloop=$(du -m "$i" | awk '{print $1}')
innerloop=$(du -m "$z" | awk '{print $1}')
if (( innerloop < outerloop ))
then
mv "$i" "$archive_dir"
else
mv "$z" "$archive_dir"
fi
fi
done
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment