-
-
Save tvwerkhoven/9a30c6adc7f95a0278e895b5563b900b to your computer and use it in GitHub Desktop.
#!/usr/bin/env zsh | |
# | |
# # About | |
# | |
# Since APFS supports de-duplication on block-level, it can be useful to | |
# manually de-duplicate your files if you've migrated/upgrade to APFS not | |
# using a fresh install. | |
# | |
# I've written this simple script with the aim to: | |
# - Be simple, easy to read and understand (for users to check) | |
# - Use native cp -c for de-duplication (for robustness) | |
# - Use byte-wise file comparison instead of hashing (while rare, hash collisions are possible) | |
# - To use jdupes for speed | |
# - Preserve file metadata via GNU cp | |
# | |
# See also this stackexchange thread: https://apple.stackexchange.com/questions/316435/replace-existing-duplicate-files-on-apfs-with-clones | |
# | |
# # Known bugs | |
# | |
# - Does not preserve target directory timestamps | |
# - Does not preserve xattrs if larger than 128kb (https://apple.stackexchange.com/questions/226485/copy-extended-attributes-to-new-file-ffmpeg) | |
# | |
# # Background info | |
# | |
# https://developer.apple.com/documentation/foundation/file_system/about_apple_file_system | |
# https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf | |
# https://eclecticlight.co/2019/01/05/aliases-hard-links-symlinks-and-copies-in-mojaves-apfs/ | |
# https://eclecticlight.co/2017/11/02/taking-stock-using-apfs-in-high-sierra-10-13-1/ | |
# | |
# # Alternatives (https://apple.stackexchange.com/questions/316435/replace-existing-duplicate-files-on-apfs-with-clones) | |
# | |
# Python, uses hashes (collision risk): https://github.com/ranvel/clonefile-dedup | |
# Python, uses hashes (collision risk, does not preserve metadata?): https://bitbucket.org/dchevell/apfs-deduplicate/src/master/ | |
# C, checks for duplication, does not de-duplicate: https://github.com/dyorgio/apfs-clone-checker | |
# Does not preserve metadata: https://github.com/deckarep/apfs-compactor | |
# Paid: http://diskdedupe.com/ | |
# Paid: https://macpaw.com/gemini | |
### Init: identify files and programs | |
# File to hold duplicate file data | |
DUPEFILE=./jdupes-output | |
# File to temporarily store old file for metadata | |
TEMPFILE=./tmp-preserved-for-metadata | |
# Determine which methods to use xattr/metadata preservation, can be gcp, preserves most metadata and/or xattr, which preserves up to 128kb of xattributes and is significantly slower | |
ATTRUSEGCP=1 | |
ATTRUSEXATTR=1 | |
# Critical programs to use | |
PCP=/bin/cp # Should be Mac native cp supporting clonefile(2)! | |
PMV=/bin/mv | |
PGCP=/opt/local/bin/gcp # Should be GNU cp, installed via macports -- Not be confused with alias for git cherry-pick | |
PJDUPES=/opt/local/bin/jdupes | |
test ! -x "${PCP}" && echo "Error: path to cp wrong" && exit | |
test ! -x "${PMV}" && echo "Error: path to mv wrong" && exit | |
test ! -x "${PGCP}" && echo "Error: path to gnu-cp wrong" && exit | |
test ! -x "${PJDUPES}" && echo "Error: path to jdupes wrong" && exit | |
### Optional: check how much data can be saved | |
${PJDUPES} --recurse --omitfirst ./ | tee ${DUPEFILE} | |
# Loop over lines, if line is not empty, check size, sum in awk | |
cat ${DUPEFILE} | while read thisfile; do | |
test ! -z $thisfile && du -k "$thisfile" | |
done | awk '{i+=$1} END {print i" kb"}' | |
### Find duplicates | |
# Find duplicates, use NUL character to separate to allow for newlines in | |
# filenames (rare but possible). | |
${PJDUPES} --print-null --recurse ./ > ${DUPEFILE} | |
# Check number of sets of duplicates by counting occurence of two consecutive | |
# NUL characters. | |
# Count number of NUL characters in file Source: https://stackoverflow.com/questions/371115/count-all-occurrences-of-a-string-in-lots-of-files-with-grep | |
NPAIRS=$(grep -oaE '\x00\x00' ${DUPEFILE} | wc -l) | |
echo "Found ${NPAIRS} sets of duplicates" | |
### Start de-duplication | |
# Loop over files separated by NUL characters, use first file of paired | |
# filenames as source for all other files in this set, e.g. | |
# | |
# file1\x00 | |
# file2\x00 | |
# file3\x00\x00 | |
# | |
# will cause file2 and file3 to be overwritten by file1 | |
# | |
# - If the file is empty, a new set will begin and we will unset SOURCEFILE. | |
# Also true for the first set we will encounter as SOURCEFILE starts unset | |
# - If SOURCEFILE is unset, use the current file to set this | |
# - If the file is not empty AND SOURCEFILE is set, make a copy: | |
# -- Move the target file to a new temporary location | |
# -- Clone the source file over the target file | |
# -- Copy attributes from source file to target file | |
SOURCEFILE="" | |
cat ${DUPEFILE} | while read -d $'\0' FILE; do | |
if [[ -z $FILE ]]; then | |
SOURCEFILE="" | |
elif [[ -z $SOURCEFILE ]]; then | |
SOURCEFILE=${FILE} | |
else | |
# Presever original file for metadata | |
${PMV} "${FILE}" "${TEMPFILE}"; | |
# Test that move was successful | |
test ! -e "${TEMPFILE}" && echo "Error: move failed on ${FILE}, aborting." && break | |
# Use cp -c to use APFS clonefile(2) | |
# Use cp -a to preserve metadata, recurse, and not follow symlinks | |
${PCP} -ca "${SOURCEFILE}" "${FILE}"; | |
# Test that copy was successful (protect against e.g. empty $PCP string) | |
test ! -e "${FILE}" && echo "Error: copy failed on ${FILE}, aborting." && break | |
# Copy over attributes | |
if [[ "${ATTRUSEGCP}" -eq 1 ]]; then | |
# Using gnu copy -- xattrs not preserved | |
# https://unix.stackexchange.com/a/93842 | |
# https://unix.stackexchange.com/questions/402862/cp-losing-files-metadata#402869 | |
# Poorer alternative: https://unix.stackexchange.com/questions/91080/maintain-or-restore-file-permissions-when-replacing-file | |
${PGCP} --preserve=all --attributes-only "${TEMPFILE}" "${FILE}" | |
fi | |
if [[ "${ATTRUSEXATTR}" -eq 1 ]]; then | |
# Using macOS native xattr, preserving xattrs up to 128K of data | |
# To properly preserve metadata use the Apple Standard C Library copyfile copyfile(..., COPYFILE_METADATA) | |
# via https://apple.stackexchange.com/questions/226485/copy-extended-attributes-to-new-file-ffmpeg | |
# and @worldpoop and @kapitainsky at https://gist.github.com/tvwerkhoven/9a30c6adc7f95a0278e895b5563b900b | |
set -e | |
IFS=$'\n' attr_names=($(xattr "${TEMPFILE}")) | |
for attr in $attr_names; do | |
value=$(xattr -p -x "${attr}" "${TEMPFILE}" | tr -d " \n") | |
xattr -w -x "${attr}" "${value}" "${FILE}" | |
done | |
fi | |
fi | |
done | |
## Usin fdupes - bash (not tested) | |
# Get matches | |
# https://unix.stackexchange.com/questions/34366/is-there-a-way-of-deleting-duplicates-more-refined-than-fdupes-rdn | |
# DUPEFILE=fdupes-20200101a | |
# fdupes --sameline --recurse ./ | tee ${DUPEFILE} | |
# cat ${DUPEFILE} | while read SOURCEFILE DESTFILES; do | |
# # Split lines by spaces | |
# # Source https://stackoverflow.com/a/30212526 | |
# read -ra DESTFILESARR <<<${DESTFILES} | |
# for DEST in "${DESTFILESARR[@]}"; do | |
# mv "${DEST}" tmp | |
# echo cp -ca "${SOURCEFILE}" "${DEST}"; | |
# echo gcp --preserve=all --attributes-only tmp "${DEST}" | |
# done | |
# done |
However, jdupes
1.20.2 would cause data corruption if any files de-duplicated are compressed by the file system.
Ack, @adib, that's pretty serious. I assume that the use of cp in this script circumvents this jdupes defect?
this jdupes defect?
It's not a jdupes defect, it's a macOS defect.
Apologies. Mac's lovely defect. This does, then, render jdupe hazardous to use on a Mac for apfs de-duping?
@worldpoop No, this defect only affects files that are compressed using APFS transparent compression, and that's not something most people will encounter normally.
There are user-friendly utilities in the app store, and various how-tos on the interwebs, to facilitate transparent compression to help users save disk space -- meaning it's not that rare, especially among users already keen on CLI tools. And this means, owing to the MacOS defect, we have this circumstance in the wild where jdupes can do irreparable damage for a jdupes user. If nothing else, a check or a warning would be really good! .02. Thank for you the tool -- be well! - b
Using cp -c
should avoid the problem from using jdupes
to de-duplicate files on APFS. It’s better if the tool can avoid the user form installing further 3rd party tools, jdupes
included. Maybe use sha
to find duplicate files?
There was code in an issue report that was supposed to solve the problem within jdupes. It was never sent in as a pull request.
rpendleton's commits, retain flag; check flag?
Script is perfect (save for time-stamp), am using it, thank you!! Notes: 1. Requires path adjustments for Monterey. 2. test -x failed verification, had use test -e on my system. (Monterey again?) 3. Seems jdupes option should be -print-null, not -printnull. 4. Permission denied with tee (?), had to sudo run the script, otherwise an inadvertent "dry run".
dyorgio/apfs-clone-checker is pretty good. It doesn't hash, just looks for initial shared blocks, so it's very fast. dedupe.sh runs cp -c even on already deduped files -- a check with clone-checker would avert that (possibly with a few false positives, thanks to no hash, but no real harm.
Minutia: 1. A test confirmed the third-party tool would have ravaged my broadly compressed apfs volume! -- glad to have hit upon the caution and then upon your script, which marries well with jdupes. Indeed, before pulling the trigger I'd been testing with newly created junk files, which of course are not compressed when created, so the danger wasn't revealed.) 2. @adib good note on 3rd-party with this kind of operation -- sha, would be slower? 3. Not a coder, forgive anything stupid I've said. Cheers - b
Hi all, thanks for your comments. I indeed opted for Mac's native cp -c
because I hoped that would work better/safer, especially on something so potentially harmful as de-duplication.
- Which paths do you use? I get
gcp
andjdupes
via macports, which explains the /opt/local/ - That's strange,
test -x
tests for executability so if that fails you cannot execute the programs? - Oops, fixed!
- Indeed you need to have writing permissions in the working directory, else you cannot log the dupefiles or de-duplicate.
On your minutia:
- Good to hear :)
- I disagree, writing a script to check for duplication would largely duplicate jdupes.
- No worries, thanks for investigating/testing
Thanks, @tvwerkhoven
Is jdupes checking for identical files or for actual shared blocks? Testing things I recall dedupe.sh executing the cp -c on the first of any found duplicate, whether or not that dupe was already a clone. I may have been careless & will check again tonight. But if that's the case, seems like a good idea not to re-clone any files.
Yup, those files are executable. I have no explanation. On path, macports, got it -- I brewed.
Thanks again!
Is jdupes checking for identical files or for actual shared blocks?
All operations are done by reading the file. There is no filesystem-specific magic going on except for the call to clonefile()
at the end of the process. Otherwise, files are treated as discrete entities by jdupes.
One oddity I see is that the script will dedup already deduped files. Even though a file is a clone set sharing one location, dedup.sh will run the copy operation (reapply attributes through the tempfile plus change the target's timestamp again). Run it ten times over a season on a tree of five-hundred duplicates, and it will "re-"cp -c every set every time. Is there a reliable to way to filter out already deduped files? The following seems to work in my setup (with a few false negatives, I can live with that), and I'll use it, but of course means adding another executable to the prerequisites. Call it peace of mind of not over-mucking with files that APFS already holds as cloned though whatever historical means....
SOURCEFILE=""
cat ${DUPEFILE} | while read -d $'\0' FILE; do
if [[ -z $FILE ]]; then
SOURCEFILE=""
elif [[ -z $SOURCEFILE ]]; then
SOURCEFILE=${FILE}
else
# github.com/dyorgio/apfs-clone-checker
CLONECHECKED="$($CLNCHKR -qf ${SOURCEFILE} ${FILE})"
if [[ ${CLONECHECKED} == "1" ]]; then
echo "Skipping ${SOURCEFILE} -> ${FILE} -- already deduped / clones."
else
echo "Deduping ${SOURCEFILE} -> ${FILE}"
# Preseve original file for metadata
${PMV} "${FILE}" "${TEMPFILE}";
Hi @worldpoop, that would be possible, but not sure what problem that solves, can you elaborate?
As I see it 1) this would add another dependency 2) there's no harm in copying cloned files.
this script does not preserve extended attributes. All deduplicated files xattr will be replaced with ones from the first file.
Seems that gcp on macOS is build without xattr support:
# gcp --preserve=xattr --attributes-only sourceFile destFile
gcp: cannot preserve extended attributes, cp is built without xattr support
as a result for three identical files but with different xattr:
# ls
hello.txt hello1.txt hello2.txt
# shasum *
750c7735502f7c6072d8b4c9239697302d393480 hello.txt
750c7735502f7c6072d8b4c9239697302d393480 hello1.txt
750c7735502f7c6072d8b4c9239697302d393480 hello2.txt
# xattr -l *
hello.txt: test:
hello1.txt: test2:
# dedupe.sh
Scanning: 4 files, 1 items (in 1 specified)
./hello1.txt
./hello2.txt
3904 kb
Scanning: 4 files, 1 items (in 1 specified)
Found 1 sets of duplicates
# xattr -l *
hello.txt: test:
hello1.txt: test:
hello2.txt: test:
this script does not preserve extended attributes. All deduplicated files xattr will be replaced with ones from the first file.
Seems that gcp on macOS is build without xattr support:
Oh wow, @kapitainsky, you're right. MacOS Monterey -- gcp --preserve=xattr --attributes-only doesn't do a thing for me either.
This guy has a tidy way to use native xattr to copy attributes. (Cautions that any single attr with more than 128K of data will fail. But that actually seems like a ton of data unlikely to be exceeded, right... unless images or such can and are encoded into x-attrs?) Not as quick as gcp. But if gcp doesn't work anyway... Removes one dependency!
`#!/usr/bin/env zsh
if [[ $# -ne 2 ]]; then
print >&2 "usage: copy_xattrs SOURCE_FILE TARGET_FILE"
exit 1
fi
set -e
IFS=$'\n' attr_names=($(xattr "$1"))
for attr in $attr_names; do
value=$(xattr -p -x "$attr" "$1" | tr -d " \n")
xattr -w -x "$attr" "$value" "$2"
done`
not sure what problem that solves, can you elaborate?
I dunno, maybe you're right. Just felt like good form not to do unnecessary file system operations, especially if one is processing a large number of files. Running your inspired script on my end, took a good bit of time to do a large number of files on first run. (I have it recursing.) And yes, clone_checking I'm sure adds to that time, though jdupes still seems to be the lengthiest portion, sensibly.
Of note, I changed it to copy rather than move the target to tempfile, because if you hit error/break, you gotta stop, find the tempfile or original source and restore the blitzed target manually. It's cp -c to a temp on the same volume -- that way, especially for large video files (I have prores files up to 100GB), it's an instant copy no extra space is used anywhere by the tempfile -- then rm the tempfile at the end of each loop.
One small lump of havoc I hit is that gcp cannot handle any file names with "\" in them. cp, mv, and jdupes too for that matter, are fine, but I could find no way to get gcp to not always read "" as an escape no matter how I wrapped it. (I have a folder tree from an assistant editor full of audio stems all prefixed with \ ("1\6" "3\12" etc., meaning this file is first of six, that one third of twelve tracks and so on.) So, well, now on two counts gcp is to the curb!
This guy has a tidy way to use native xattr to copy attributes. (Cautions that any single attr with more than 128K of data will fail. But that actually seems like a ton of data unlikely to be exceeded, right... unless images or such can and are encoded into x-attrs?) Not as quick as gcp. But if gcp doesn't work anyway... Removes one dependency!
as it is clear from the start that it does not work always this is not really solution....
I did a bit poking around and seems that only way to do this properly is to use Apple Standard C Library copyfile copyfile(..., COPYFILE_METADATA)
You mean xattr -x doesn't always work on Mac? Good poking on standard library.
this script does not preserve extended attributes. All deduplicated files xattr will be replaced with ones from the first file.
Seems that gcp on macOS is build without xattr support:Oh wow, @kapitainsky, you're right. MacOS Monterey -- gcp --preserve=xattr --attributes-only doesn't do a thing for me either.
This guy has a tidy way to use native xattr to copy attributes. (Cautions that any single attr with more than 128K of data will fail. But that actually seems like a ton of data unlikely to be exceeded, right... unless images or such can and are encoded into x-attrs?) Not as quick as gcp. But if gcp doesn't work anyway... Removes one dependency!
Good catch, thanks. I added it, including the notice that metadata preservation is not 100% reliable. Also, xattr() seems quite slow (on my system), so I added the option to use either gcp
or xattr
, in spite of the issues you mentioned @worldpoop ;) Using the Apple Standard C lib is a bit beyond the scope of what I intended here @kapitainsky, but perhaps this helps somebody else.
Also, I'd encourage people to contribute a pull request to jdupes so this all converges nicely in a more mature project :)
yeap - this is why I shared it here - seems solution is beyond simple scrpting but maybe somebody will move it further
this script does not preserve extended attributes. All deduplicated files xattr will be replaced with ones from the first file.
Seems that gcp on macOS is build without xattr support:Oh wow, @kapitainsky, you're right. MacOS Monterey -- gcp --preserve=xattr --attributes-only doesn't do a thing for me either.
This guy has a tidy way to use native xattr to copy attributes. (Cautions that any single attr with more than 128K of data will fail. But that actually seems like a ton of data unlikely to be exceeded, right... unless images or such can and are encoded into x-attrs?) Not as quick as gcp. But if gcp doesn't work anyway... Removes one dependency!Good catch, thanks. I added it, including the notice that metadata preservation is not 100% reliable. Also, xattr() seems quite slow (on my system), so I added the option to use either
gcp
orxattr
, in spite of the issues you mentioned @worldpoop ;) Using the Apple Standard C lib is a bit beyond the scope of what I intended here @kapitainsky, but perhaps this helps somebody else.Also, I'd encourage people to contribute a pull request to jdupes so this all converges nicely in a more mature project :)
Looking for the best deduplicator for my needs for now I stick with https://github.com/pkolaczk/fclones - thx to multithreading 10x faster than jdupes on SSD disk (which is any mac unless some ancient), does not have issues with in place compression data corruption, had problems with xattr but it has been fixed this weekend. Nothing is perfect thought it does not have byte-wise file comparison but uses hashes. Good that there are many tools around to choose one for the job.
jdupes now supports APFS clonefile() natively using the same
-B/--dedupe
option as BTRFS/XFS dedupe on Linux, introduced in commit https://github.com/jbruchon/jdupes/commit/c56ebd06df0d78ef79ee7b4e9be4d54651145811 and available from jdupes v1.17.1 onwards.