Skip to content

Instantly share code, notes, and snippets.

@filipenf
Created August 19, 2016 19:04
Show Gist options
  • Select an option

  • Save filipenf/e9901883d66b8da65c151cf674e5f2a9 to your computer and use it in GitHub Desktop.

Select an option

Save filipenf/e9901883d66b8da65c151cf674e5f2a9 to your computer and use it in GitHub Desktop.
Reads fdupes(-r1) output and create relative symbolic links for each duplicate
#!/usr/bin/env python
# Reads fdupes(-r -1) output and create relative symbolic links for each duplicate
# usage: fdupes -r1 . | ./lndupes.py
import os
from os.path import dirname, relpath, basename, join
import sys
lines = sys.stdin.readlines()
for line in lines:
files = line.strip().split(' ')
first = files[0]
print "First: %s "% first
for dup in files[1:]:
rel = os.path.relpath(dirname(first), dirname(dup))
print "Linking duplicate: %s to %s" % (dup, join(rel,basename(first)))
os.unlink(dup)
os.symlink(join(rel,basename(first)), dup)
@oh2fih
Copy link

oh2fih commented Oct 11, 2025

Like, for reverse incremental backup, we like to keep latest snapshot fully intact, and dups in old snapshots removed or replaced with symlinks.

fdupes isn’t the right tool for snapshot deduplication. It only groups identical files by content and outputs them in arbitrary order, with no awareness of snapshot structure or relationships. When identical files exist in multiple paths within the directories being snapshotted, fdupes only considers file content, not relative paths. Resolving snapshot ordering so that only older snapshots create symlinks to newer ones does not eliminate the fundamental problem. Older snapshots can still contain symlinks pointing to incorrect locations in newer ones.

For example, if you have a template README.md used across projects and someone copies it into a project without modifying it, you may end up with a situation where the template in an older snapshot points to an instance under a project:

snap-2025-10-01/
    TEMPLATE/
        README.md
    projectA/
        src/main.c
        
snap-2025-10-02/
    TEMPLATE/
        README.md          # unchanged
    projectA/
        README.md          # copied from TEMPLATE, to be edited later
        src/main.c

The older snapshot might then contain a symlink like:

snap-2025-10-01/TEMPLATE/README.md -> ../../snap-2025-10-02/projectA/README.md

Even with careful scripting, this destroys the logical integrity of your snapshots. You lose the guarantee that each snapshot is a faithful, self-contained view of the filesystem. Use tools designed for this purpose — e.g. rsync --link-dest, cp -al, or content-aware backup systems like borg or restic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment