Last active
May 17, 2024 09:13
-
-
Save vinovator/a2ba7306e829bf3a9010 to your computer and use it in GitHub Desktop.
Python script to find duplicate files from a folder
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# checkDuplicates.py | |
# Python 2.7.6 | |
""" | |
Given a folder, walk through all files within the folder and subfolders | |
and get list of all files that are duplicates | |
The md5 checcksum for each file will determine the duplicates | |
""" | |
import os | |
import hashlib | |
from collections import defaultdict | |
import csv | |
src_folder = "../../" | |
def generate_md5(fname, chunk_size=1024): | |
""" | |
Function which takes a file name and returns md5 checksum of the file | |
""" | |
hash = hashlib.md5() | |
with open(fname, "rb") as f: | |
# Read the 1st block of the file | |
chunk = f.read(chunk_size) | |
# Keep reading the file until the end and update hash | |
while chunk: | |
hash.update(chunk) | |
chunk = f.read(chunk_size) | |
# Return the hex checksum | |
return hash.hexdigest() | |
if __name__ == "__main__": | |
""" | |
Starting block of script | |
""" | |
# The dict will have a list as values | |
md5_dict = defaultdict(list) | |
file_types_inscope = ["ppt", "pptx", "pdf", "txt", "html", | |
"mp4", "jpg", "png", "xls", "xlsx", "xml", | |
"vsd", "py", "json"] | |
# Walk through all files and folders within directory | |
for path, dirs, files in os.walk(src_folder): | |
print("Analyzing {}".format(path)) | |
for each_file in files: | |
if each_file.split(".")[-1].lower() in file_types_inscope: | |
# The path variable gets updated for each subfolder | |
file_path = os.path.join(os.path.abspath(path), each_file) | |
# If there are more files with same checksum append to list | |
md5_dict[generate_md5(file_path)].append(file_path) | |
# Identify keys (checksum) having more than one values (file names) | |
duplicate_files = ( | |
val for key, val in md5_dict.items() if len(val) > 1) | |
# Write the list of duplicate files to csv file | |
with open("duplicates.csv", "w") as log: | |
# Lineterminator added for windows as it inserts blank rows otherwise | |
csv_writer = csv.writer(log, quoting=csv.QUOTE_MINIMAL, delimiter=",", | |
lineterminator="\n") | |
header = ["File Names"] | |
csv_writer.writerow(header) | |
for file_name in duplicate_files: | |
csv_writer.writerow(file_name) | |
print("Done") |
Re: vinovator/checkDuplicates.py
Mio Fratello,Thank you for responding.-TBSent using the mobile mail appOn 8/27/23 at 10:15 AM, Riccardo Andreetta wrote:
From: "Riccardo Andreetta" ***@***.***>Date: August 27, 2023To: "ricky-andre" ***@***.***>Cc: "Comment" ***@***.***>Subject: Re: vinovator/checkDuplicates.py
@ricky-andre commented on this gist.
@ricky-andre can you give us a 'clif notes' difference between that code and this one?
the link to my repository's script finds duplicates using the approach described above:
check the file's length
give that two files have the same length, check the md5 on the first 16Kbytes of data
if they look still the same, calculate the md5 on the whole files (long task, whole file needs to be read)
Save the calculated md5 hash on a text file. Of course, other things could go wrong and be improved (e.g. text file could be encrypted, checked for integrity ... ), but I've tested it with my HDD and for sure it's really efficient and fast. For someone's personal use, it's very good.
—Reply to this email directly, view it on GitHub or unsubscribe.You are receiving this email because you commented on the thread.
Triage notifications on the go with GitHub Mobile for iOS or Android.
@ricky-andre thank you!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
the link to my repository's script finds duplicates using the approach described above:
Save the calculated md5 hash on a text file. Of course, other things could go wrong and be improved (e.g. text file could be encrypted, checked for integrity ... ), but I've tested it with my HDD and for sure it's really efficient and fast. For someone's personal use, it's very good.