Skip to content

Instantly share code, notes, and snippets.

@jszym
Created June 4, 2020 17:38
Show Gist options
  • Save jszym/d5ee495ef475d5ea3d732a49308a66cf to your computer and use it in GitHub Desktop.
Save jszym/d5ee495ef475d5ea3d732a49308a66cf to your computer and use it in GitHub Desktop.
Given a class=folder structure, compute splits with sklearn
# a library for discovering paths
from glob import glob
from sklearn.model_selection import train_test_split
# you may need to look up the documentation for glob
# "*" is a stand=in for any string
# this assumes that the subfolders are in the same folder as the script
# if the subfolders were in a folder "data", the argument to glob would be
# "./data/*.png"
paths = glob("./*/*.png")
# >>> paths[:3]
# ['.\\A\\a29ydW5pc2hpLnR0Zg==.png', '.\\A\\a2F6b28udHRm.png', '.\\A\\a2FpcmVlLnR0Zg==.png']
# The double backslashes is because I'm on a PC but they would be forward slashes on mac/linux
# we need to seperately generate the labels.
# to do this, we need to get the labels from the path.
# I'll just split based the backslashes "\\", use "/" for mac/linux
# with these paths, it's the second element that has the class
# If the subfolders are in another folder, you might need to use e.g. the third element
labels = [path.split("\\")[1] for path in paths]
# now we can use sklearn's split method
x_train, x_test, y_train, y_test = train_test_split(paths, labels, test_size=0.2, random_state=42)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment