-
-
Save filipenf/e9901883d66b8da65c151cf674e5f2a9 to your computer and use it in GitHub Desktop.
| #!/usr/bin/env python | |
| # Reads fdupes(-r -1) output and create relative symbolic links for each duplicate | |
| # usage: fdupes -r1 . | ./lndupes.py | |
| import os | |
| from os.path import dirname, relpath, basename, join | |
| import sys | |
| lines = sys.stdin.readlines() | |
| for line in lines: | |
| files = line.strip().split(' ') | |
| first = files[0] | |
| print "First: %s "% first | |
| for dup in files[1:]: | |
| rel = os.path.relpath(dirname(first), dirname(dup)) | |
| print "Linking duplicate: %s to %s" % (dup, join(rel,basename(first))) | |
| os.unlink(dup) | |
| os.symlink(join(rel,basename(first)), dup) | |
Based on the improved version by @jeroenvermeulen, I made this Python 3 compatible by fixing the print()s.
https://gist.github.com/oh2fih/810928a48b4b71d0b11f4bb1872dd3d2
Hi Folks,
how can I control fdupes's print order, not relying on date or name? Like, for reverse incremental backup, we like to keep latest snapshot fully intact, and dups in old snapshots removed or replaced with symlinks.
But, e.g.:
fdupes $dir_to_keep -r1 $dir_to_cleanup
the result order is all over the place, causing files in $dir_to keep got replaced with symlinks.
Thanks!
Like, for reverse incremental backup, we like to keep latest snapshot fully intact, and dups in old snapshots removed or replaced with symlinks.
fdupes isn’t the right tool for snapshot deduplication. It only groups identical files by content and outputs them in arbitrary order, with no awareness of snapshot structure or relationships. When identical files exist in multiple paths within the directories being snapshotted, fdupes only considers file content, not relative paths. Resolving snapshot ordering so that only older snapshots create symlinks to newer ones does not eliminate the fundamental problem. Older snapshots can still contain symlinks pointing to incorrect locations in newer ones.
For example, if you have a template README.md used across projects and someone copies it into a project without modifying it, you may end up with a situation where the template in an older snapshot points to an instance under a project:
snap-2025-10-01/
TEMPLATE/
README.md
projectA/
src/main.c
snap-2025-10-02/
TEMPLATE/
README.md # unchanged
projectA/
README.md # copied from TEMPLATE, to be edited later
src/main.c
The older snapshot might then contain a symlink like:
snap-2025-10-01/TEMPLATE/README.md -> ../../snap-2025-10-02/projectA/README.md
Even with careful scripting, this destroys the logical integrity of your snapshots. You lose the guarantee that each snapshot is a faithful, self-contained view of the filesystem. Use tools designed for this purpose — e.g. rsync --link-dest, cp -al, or content-aware backup systems like borg or restic.
Hi @filipenf, thanks for this great script!
I improved your script, because it did split wrong when there were spaces one of the filenames.
Updated version: https://gist.github.com/jeroenvermeulen/a1667c60dc159c8f0487acee2cca256f