Skip to content

Instantly share code, notes, and snippets.

@cabo
Created October 9, 2010 10:01
Show Gist options
  • Save cabo/618072 to your computer and use it in GitHub Desktop.
Save cabo/618072 to your computer and use it in GitHub Desktop.
Find and clean duplicate files (Ruby)
#!/opt/local/bin/ruby1.9
require 'digest/md5'
require 'shellwords'
# argument processing -- goes through Dir[], so can use '**/*' etc.
ARGV[0] ||= '.'
filenames = ARGV.map do |dirn|
Dir[if File.directory?(dirn)
"#{dirn}/*"
else
dirn
end]
end.flatten(1)
# find sets of same-content files
equivsets = filenames.map do |fn|
[fn, File.lstat(fn)]
end.reject do |fn, stat|
stat.symlink? || stat.directory? || stat.size == 0
end.group_by { |fn, stat| stat.size }.values.reject do |list|
list.size == 1
end.map do |sublist|
sublist.map do |fn, stat|
[fn, stat, Digest::MD5.hexdigest(File.read(fn))]
end.group_by { |fn, stat, digest| digest }.values.reject do |list|
list.size == 1
end
end.flatten(1).map do |list|
list.map(&:first).sort_by(&:size)
end
# now remove the duplicate files (alternative: create links!)
commands = equivsets.map do |list|
keep = list.shift # the file we keep
list.map do |candidate|
%Q{
if cmp #{[keep, candidate].shelljoin}
then rm #{[candidate].shelljoin}
fi
}
end.join
end.join
puts commands
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment