Created
July 28, 2020 21:01
-
-
Save algal/22c6bf20b73d3fa486d0c07d2b9b6c59 to your computer and use it in GitHub Desktop.
Read file paths, names, hashes into a data frame
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from fastai2.vision.all import * # to get L | |
import pandas as pd | |
def readMD5file(md5path:Path) -> pd.DataFrame: | |
""" | |
Generate MD5 output file by doing a search like: | |
find /home/jupyter/data/foldersToAdd/ -iname '*jpg' -print0 | xargs -0 -n 100 md5sum >> /home/jupyter/data/foldersToAdd.md5.out | |
Then read it with this to make a dataframe to check for name uniqueness, path uniqueness, etc.. | |
""" | |
with open(str(md5path),'r') as f: | |
lines = L(f.read().split('\n')).map(lambda line:tuple(line.split(' '))).filter(lambda t: len(t) == 2) | |
lines.sort() | |
dff = pd.DataFrame(list(lines),columns=['hash','path']) | |
dff['fname'] = dff['path'].map(lambda p: Path(p).parts[-1]) | |
return dff | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment