vacmar01 · November 11, 2018 15:02
diff --git a/gistfile1.txt b/gistfile1.txt
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to create a validation set for image classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One problem I encountered while working through the fast.ai course was when I was trying to participate in the seedlings [classification competition](https://www.kaggle.com/c/plant-seedlings-classification) on [kaggle](https://kaggle.com). Although the data comes in a format which is easy to work with in fast.ai we have a major problem: We don't have a seperate validation set. This isn't unlikely in kaggle competitons but it is something I haven't encountered yet. Since we rely on the folder structure to organize our data, some reshuffling of the files and the folder structure is necessary to build a validation set. \n",
    "\n",
    "I found a code snipped deeply buried in a notebook on github that achieves exactly this, but I thought it would be a great opportunity to implement it myself in order to train my python skills. Data preprocessing and preperation is a big weakness of mine so it just made sense to try and do it myself."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Quick Note: This code is meant to be run in a Jupyter Notebook. It uses some features that are only available there, like bash commands inside python code*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So first things first. We need some data to shuffle around. I created some dummy data just to have something to work with. 400 files that are divided into to folders resembling two different categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import os\n",
    "PATH = 'data'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "cats = ['cats', 'dogs']\n",
    "\n",
    "for cat in cats:\n",
    "    os.makedirs(f'{PATH}/train/{cat}', exist_ok=True)\n",
    "    for i in range(0,200):\n",
    "        !touch '{PATH}/train/{cat}/{cat}-{i}.png'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We created a folder for each image category with 200 empty .png files in each folder. Now that we have a data structure that resembles the one of the cats vs. dogs example in lesson-1 of fast.ai, let's see how we can build a validation set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below you see the function to create the validation set. It takes in the path to the data and the percentage of data we want the validation set to consist of. The basic idea is this:\n",
    "\n",
    "1. get a list of the categories\n",
    "2. make a `valid` folder and inside a folder for each category\n",
    "3. get a list of all the files that belong to a specific category\n",
    "4. shuffle the list of files\n",
    "5. pick a certain percentage of files (`p`) (0.1 - 0.2 is standard)\n",
    "6. move the selected files to the validation folder\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_val_set(PATH, p):\n",
    "    PATH = PATH if PATH[-1] == '/' else PATH+'/'\n",
    "    cats = os.listdir(f\"{PATH}train\") \n",
    "    for cat in cats: \n",
    "        os.makedirs(f\"{PATH}valid/{cat}\", exist_ok=True)\n",
    "        list_of_files = os.listdir(f\"{PATH}train/{cat}\")\n",
    "        random.shuffle(list_of_files)\n",
    "        n_idxs = int(len(list_of_files)*p)\n",
    "        selected_files = [list_of_files[n] for n in range(n_idxs)]\n",
    "        for file in selected_files:\n",
    "            os.rename(f\"{PATH}train/{cat}/{file}\", f\"{PATH}valid/{cat}/{file}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This should do the trick. The code isn't that complicated. First I check, that the given path ends with a '/' otherwise I add one, so that we don't get any problems with the paths. Note that the `os.rename()` function has the same behavior like the bash command `mv`. It \"renames\" the full name of the file that means the path plus the filename and thus moves the file. `mv` does exactly the same but is named after the 'move' part of its functionality and `os.rename()` is named after the renaming functionality. That can be a little bit confusing. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "make_val_set(PATH, 0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After running the code above you should have a `valid` folder with 20% of the pictures in the same folder structure like the train set. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Credit:__ The code is based on [this](https://github.com/Renga411/dl1.fastai/blob/master/Validation-set-creator.ipynb) notebook. I used it as a reference while implementing my own function."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
diff --git a/make_val_set.ipynb b/make_val_set.ipynb
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# How to create a validation set for image classification"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"One problem I encountered while working through the fast.ai course was when I was trying to participate in the seedlings [classification competition](https://www.kaggle.com/c/plant-seedlings-classification) on [kaggle](https://kaggle.com). Although the data comes in a format which is easy to work with in fast.ai we have a major problem: We don't have a seperate validation set. This isn't unlikely in kaggle competitons but it is something I haven't encountered yet. Since we rely on the folder structure to organize our data, some reshuffling of the files and the folder structure is necessary to build a validation set. \n",
	"\n",
	"I found a code snipped deeply buried in a notebook on github that achieves exactly this, but I thought it would be a great opportunity to implement it myself in order to train my python skills. Data preprocessing and preperation is a big weakness of mine so it just made sense to try and do it myself."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Quick Note: This code is meant to be run in a Jupyter Notebook. It uses some features that are only available there, like bash commands inside python code"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So first things first. We need some data to shuffle around. I created some dummy data just to have something to work with. 400 files that are divided into to folders resembling two different categories."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {},
	"outputs": [],
	"source": [
	"import random\n",
	"import os\n",
	"PATH = 'data'"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [],
	"source": [
	"cats = ['cats', 'dogs']\n",
	"\n",
	"for cat in cats:\n",
	" os.makedirs(f'{PATH}/train/{cat}', exist_ok=True)\n",
	" for i in range(0,200):\n",
	" !touch '{PATH}/train/{cat}/{cat}-{i}.png'"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We created a folder for each image category with 200 empty .png files in each folder. Now that we have a data structure that resembles the one of the cats vs. dogs example in lesson-1 of fast.ai, let's see how we can build a validation set."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Below you see the function to create the validation set. It takes in the path to the data and the percentage of data we want the validation set to consist of. The basic idea is this:\n",
	"\n",
	"1. get a list of the categories\n",
	"2. make a `valid` folder and inside a folder for each category\n",
	"3. get a list of all the files that belong to a specific category\n",
	"4. shuffle the list of files\n",
	"5. pick a certain percentage of files (`p`) (0.1 - 0.2 is standard)\n",
	"6. move the selected files to the validation folder\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [],
	"source": [
	"def make_val_set(PATH, p):\n",
	" PATH = PATH if PATH[-1] == '/' else PATH+'/'\n",
	" cats = os.listdir(f\"{PATH}train\") \n",
	" for cat in cats: \n",
	" os.makedirs(f\"{PATH}valid/{cat}\", exist_ok=True)\n",
	" list_of_files = os.listdir(f\"{PATH}train/{cat}\")\n",
	" random.shuffle(list_of_files)\n",
	" n_idxs = int(len(list_of_files)*p)\n",
	" selected_files = [list_of_files[n] for n in range(n_idxs)]\n",
	" for file in selected_files:\n",
	" os.rename(f\"{PATH}train/{cat}/{file}\", f\"{PATH}valid/{cat}/{file}\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This should do the trick. The code isn't that complicated. First I check, that the given path ends with a '/' otherwise I add one, so that we don't get any problems with the paths. Note that the `os.rename()` function has the same behavior like the bash command `mv`. It \"renames\" the full name of the file that means the path plus the filename and thus moves the file. `mv` does exactly the same but is named after the 'move' part of its functionality and `os.rename()` is named after the renaming functionality. That can be a little bit confusing. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {},
	"outputs": [],
	"source": [
	"make_val_set(PATH, 0.2)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"After running the code above you should have a `valid` folder with 20% of the pictures in the same folder structure like the train set. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"__Credit:__ The code is based on [this](https://github.com/Renga411/dl1.fastai/blob/master/Validation-set-creator.ipynb) notebook. I used it as a reference while implementing my own function."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.0"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}