Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save skyopensource/83042e629a3a05e86f1c618214de59f0 to your computer and use it in GitHub Desktop.
Save skyopensource/83042e629a3a05e86f1c618214de59f0 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<a href=\"https://cognitiveclass.ai\"><img src = \"https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png\" width = 400> </a>\n",
"\n",
"<h1 align=center><font size = 5>From Modeling to Evaluation</font></h1>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"## Introduction\n",
"\n",
"In this lab, we will continue learning about the data science methodology, and focus on the **Modeling** and **Evaluation** stages.\n",
"\n",
"------------"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"## Table of Contents\n",
"\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
"\n",
"1. [Recap](#0)<br>\n",
"2. [Data Modeling](#2)<br>\n",
"3. [Model Evaluation](#4)<br>\n",
"</div>\n",
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"# Recap <a id=\"0\"></a>\n",
"\n",
"In Lab **From Understanding to Preparation**, we explored the data and prepared it for modeling."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"The data was compiled by a researcher named Yong-Yeol Ahn, who scraped tens of thousands of food recipes (cuisines and ingredients) from three different websites, namely:"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab4_fig1_allrecipes.png\" width=500>\n",
"\n",
"www.allrecipes.com\n",
"\n",
"<img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab4_fig2_epicurious.png\" width=500>\n",
"\n",
"www.epicurious.com\n",
"\n",
"<img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab4_fig3_menupan.png\" width=500>\n",
"\n",
"www.menupan.com"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"For more information on Yong-Yeol Ahn and his research, you can read his paper on [Flavor Network and the Principles of Food Pairing](http://yongyeol.com/papers/ahn-flavornet-2011.pdf)."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<strong> Important note:</strong> Please note that you are not expected to know how to program in Python. This lab is meant to illustrate the stages of modeling and evaluation of the data science methodology, so it is totally fine if you do not understand the individual lines of code. We have a full course on programming in Python, <a href=\"http://cocl.us/PY0101EN_DS0103EN_LAB4_PYTHON_Coursera\"><strong>Python for Data Science</strong></a>, which is also offered on Coursera. So make sure to complete the Python course if you are interested in learning how to program in Python."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"### Using this notebook:\n",
"\n",
"To run any of the following cells of code, you can type **Shift + Enter** to excute the code in a cell."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Download the library and dependencies that we will need to run this lab."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"import pandas as pd # import library to read data into dataframe\n",
"pd.set_option(\"display.max_columns\", None)\n",
"import numpy as np # import numpy library\n",
"import re # import library for regular expression\n",
"import random # library for random number generation"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"We already placed the data on an IBM server for your convenience, so let's download it from server and read it into a dataframe called **recipes**."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Data read into dataframe!\n"
]
}
],
"source": [
"recipes = pd.read_csv(\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/data/recipes.csv\")\n",
"\n",
"print(\"Data read into dataframe!\") # takes about 30 seconds"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"We will repeat the preprocessing steps that we implemented in Lab **From Understanding to Preparation** in order to prepare the data for modeling. For more details on preparing the data, please refer to Lab **From Understanding to Preparation**."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"# fix name of the column displaying the cuisine\n",
"column_names = recipes.columns.values\n",
"column_names[0] = \"cuisine\"\n",
"recipes.columns = column_names\n",
"\n",
"# convert cuisine names to lower case\n",
"recipes[\"cuisine\"] = recipes[\"cuisine\"].str.lower()\n",
"\n",
"# make the cuisine names consistent\n",
"recipes.loc[recipes[\"cuisine\"] == \"austria\", \"cuisine\"] = \"austrian\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"belgium\", \"cuisine\"] = \"belgian\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"china\", \"cuisine\"] = \"chinese\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"canada\", \"cuisine\"] = \"canadian\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"netherlands\", \"cuisine\"] = \"dutch\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"france\", \"cuisine\"] = \"french\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"germany\", \"cuisine\"] = \"german\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"india\", \"cuisine\"] = \"indian\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"indonesia\", \"cuisine\"] = \"indonesian\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"iran\", \"cuisine\"] = \"iranian\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"italy\", \"cuisine\"] = \"italian\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"japan\", \"cuisine\"] = \"japanese\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"israel\", \"cuisine\"] = \"jewish\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"korea\", \"cuisine\"] = \"korean\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"lebanon\", \"cuisine\"] = \"lebanese\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"malaysia\", \"cuisine\"] = \"malaysian\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"mexico\", \"cuisine\"] = \"mexican\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"pakistan\", \"cuisine\"] = \"pakistani\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"philippines\", \"cuisine\"] = \"philippine\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"scandinavia\", \"cuisine\"] = \"scandinavian\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"spain\", \"cuisine\"] = \"spanish_portuguese\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"portugal\", \"cuisine\"] = \"spanish_portuguese\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"switzerland\", \"cuisine\"] = \"swiss\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"thailand\", \"cuisine\"] = \"thai\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"turkey\", \"cuisine\"] = \"turkish\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"vietnam\", \"cuisine\"] = \"vietnamese\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"uk-and-ireland\", \"cuisine\"] = \"uk-and-irish\"\n",
"recipes.loc[recipes[\"cuisine\"] == \"irish\", \"cuisine\"] = \"uk-and-irish\"\n",
"\n",
"\n",
"# remove data for cuisines with < 50 recipes:\n",
"recipes_counts = recipes[\"cuisine\"].value_counts()\n",
"cuisines_indices = recipes_counts > 50\n",
"\n",
"cuisines_to_keep = list(np.array(recipes_counts.index.values)[np.array(cuisines_indices)])\n",
"recipes = recipes.loc[recipes[\"cuisine\"].isin(cuisines_to_keep)]\n",
"\n",
"# convert all Yes's to 1's and the No's to 0's\n",
"recipes = recipes.replace(to_replace=\"Yes\", value=1)\n",
"recipes = recipes.replace(to_replace=\"No\", value=0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"# Data Modeling <a id=\"2\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab4_fig4_flowchart_data_modeling.png\" width=500>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Download and install more libraries and dependies to build decision trees."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting package metadata (current_repodata.json): done\n",
"Solving environment: failed with initial frozen solve. Retrying with flexible solve.\n",
"Collecting package metadata (repodata.json): done\n",
"Solving environment: - \n",
"The environment is inconsistent, please check the package plan carefully\n",
"The following packages are causing the inconsistency:\n",
"\n",
" - defaults/linux-64::_ipyw_jlab_nb_ext_conf==0.1.0=py37_0\n",
" - defaults/linux-64::alabaster==0.7.11=py37_0\n",
" - defaults/linux-64::anaconda==5.3.1=py37_0\n",
" - defaults/linux-64::anaconda-client==1.7.2=py37_0\n",
" - defaults/linux-64::anaconda-navigator==1.9.2=py37_0\n",
" - defaults/linux-64::anaconda-project==0.8.2=py37_0\n",
" - defaults/linux-64::appdirs==1.4.3=py37h28b3542_0\n",
" - defaults/linux-64::asn1crypto==0.24.0=py37_0\n",
" - defaults/linux-64::astroid==2.0.4=py37_0\n",
" - defaults/linux-64::astropy==3.0.4=py37h14c3975_0\n",
" - defaults/linux-64::atomicwrites==1.2.1=py37_0\n",
" - defaults/linux-64::automat==0.7.0=py37_0\n",
" - defaults/linux-64::babel==2.6.0=py37_0\n",
" - defaults/linux-64::backports==1.0=py37_1\n",
" - defaults/linux-64::backports.shutil_get_terminal_size==1.0.0=py37_2\n",
" - defaults/linux-64::beautifulsoup4==4.6.3=py37_0\n",
" - defaults/linux-64::bitarray==0.8.3=py37h14c3975_0\n",
" - defaults/linux-64::bkcharts==0.2=py37_0\n",
" - defaults/linux-64::blaze==0.11.3=py37_0\n",
" - defaults/linux-64::bokeh==0.13.0=py37_0\n",
" - defaults/linux-64::boto==2.49.0=py37_0\n",
" - defaults/linux-64::bottleneck==1.2.1=py37h035aef0_1\n",
" - defaults/linux-64::click==6.7=py37_0\n",
" - defaults/linux-64::cloudpickle==0.5.5=py37_0\n",
" - defaults/linux-64::clyent==1.2.2=py37_1\n",
" - defaults/linux-64::colorama==0.3.9=py37_0\n",
" - defaults/linux-64::constantly==15.1.0=py37h28b3542_0\n",
" - defaults/linux-64::contextlib2==0.5.5=py37_0\n",
" - defaults/linux-64::curl==7.61.0=h84994c4_0\n",
" - defaults/linux-64::cython==0.28.5=py37hf484d3e_0\n",
" - defaults/linux-64::cytoolz==0.9.0.1=py37h14c3975_1\n",
" - defaults/linux-64::dask==0.19.1=py37_0\n",
" - defaults/linux-64::dask-core==0.19.1=py37_0\n",
" - defaults/linux-64::datashape==0.5.4=py37_1\n",
" - defaults/linux-64::distributed==1.23.1=py37_0\n",
" - defaults/linux-64::et_xmlfile==1.0.1=py37_0\n",
" - defaults/linux-64::fastcache==1.0.2=py37h14c3975_2\n",
" - defaults/linux-64::filelock==3.0.8=py37_0\n",
" - defaults/linux-64::flask==1.0.2=py37_1\n",
" - defaults/linux-64::flask-cors==3.0.6=py37_0\n",
" - defaults/linux-64::get_terminal_size==1.0.0=haa9412d_0\n",
" - defaults/linux-64::gevent==1.3.6=py37h7b6447c_0\n",
" - defaults/linux-64::glob2==0.6=py37_0\n",
" - defaults/linux-64::gmpy2==2.0.8=py37h10f8cd9_2\n",
" - defaults/linux-64::greenlet==0.4.15=py37h7b6447c_0\n",
" - defaults/linux-64::heapdict==1.0.0=py37_2\n",
" - defaults/linux-64::hyperlink==18.0.0=py37_0\n",
" - defaults/linux-64::imagesize==1.1.0=py37_0\n",
" - defaults/linux-64::incremental==17.5.0=py37_0\n",
" - defaults/linux-64::isort==4.3.4=py37_0\n",
" - defaults/linux-64::itsdangerous==0.24=py37_1\n",
" - defaults/linux-64::jdcal==1.4=py37_0\n",
" - defaults/linux-64::jeepney==0.3.1=py37_0\n",
" - defaults/linux-64::jupyter==1.0.0=py37_7\n",
" - defaults/linux-64::jupyter_console==5.2.0=py37_1\n",
" - defaults/linux-64::jupyterlab==0.34.9=py37_0\n",
" - defaults/linux-64::jupyterlab_launcher==0.13.1=py37_0\n",
" - defaults/linux-64::keyring==13.2.1=py37_0\n",
" - defaults/linux-64::lazy-object-proxy==1.3.1=py37h14c3975_2\n",
" - defaults/linux-64::libcurl==7.61.0=h1ad7b7a_0\n",
" - defaults/linux-64::libssh2==1.8.0=h9cfc8f7_4\n",
" - defaults/linux-64::llvmlite==0.24.0=py37hdbcaa40_0\n",
" - defaults/linux-64::locket==0.2.0=py37_1\n",
" - defaults/linux-64::lxml==4.2.5=py37hefd8a0e_0\n",
" - defaults/linux-64::matplotlib==2.2.3=py37hb69df0a_0\n",
" - defaults/linux-64::mccabe==0.6.1=py37_1\n",
" - defaults/linux-64::mkl-service==1.1.2=py37h90e4bf4_5\n",
" - defaults/linux-64::mkl_fft==1.0.4=py37h4414c95_1\n",
" - defaults/linux-64::mkl_random==1.0.1=py37h4414c95_1\n",
" - defaults/linux-64::mpmath==1.0.0=py37_2\n",
" - defaults/linux-64::msgpack-python==0.5.6=py37h6bb024c_1\n",
" - defaults/linux-64::multipledispatch==0.6.0=py37_0\n",
" - defaults/linux-64::navigator-updater==0.2.1=py37_0\n",
" - defaults/linux-64::networkx==2.1=py37_0\n",
" - defaults/linux-64::nltk==3.3.0=py37_0\n",
" - defaults/linux-64::nose==1.3.7=py37_2\n",
" - defaults/linux-64::numba==0.39.0=py37h04863e7_0\n",
" - defaults/linux-64::numexpr==2.6.8=py37hd89afb7_0\n",
" - defaults/linux-64::numpy-base==1.15.1=py37h81de0dd_0\n",
" - defaults/linux-64::numpydoc==0.8.0=py37_0\n",
" - defaults/linux-64::odo==0.5.1=py37_0\n",
" - defaults/linux-64::openpyxl==2.5.6=py37_0\n",
" - defaults/linux-64::packaging==17.1=py37_0\n",
" - defaults/linux-64::partd==0.3.8=py37_0\n",
" - defaults/linux-64::path.py==11.1.0=py37_0\n",
" - defaults/linux-64::pathlib2==2.3.2=py37_0\n",
" - defaults/linux-64::pep8==1.7.1=py37_0\n",
" - defaults/linux-64::pkginfo==1.4.2=py37_1\n",
" - defaults/linux-64::pluggy==0.7.1=py37h28b3542_0\n",
" - defaults/linux-64::ply==3.11=py37_0\n",
" - defaults/linux-64::psutil==5.4.7=py37h14c3975_0\n",
" - defaults/linux-64::py==1.6.0=py37_0\n",
" - defaults/linux-64::pyasn1==0.4.4=py37h28b3542_0\n",
" - defaults/linux-64::pyasn1-modules==0.2.2=py37_0\n",
" - defaults/linux-64::pycodestyle==2.4.0=py37_0\n",
" - defaults/linux-64::pycosat==0.6.3=py37h14c3975_0\n",
" - defaults/linux-64::pycrypto==2.6.1=py37h14c3975_9\n",
" - defaults/linux-64::pycurl==7.43.0.2=py37hb7f436b_0\n",
" - defaults/linux-64::pyflakes==2.0.0=py37_0\n",
" - defaults/linux-64::pylint==2.1.1=py37_0\n",
" - defaults/linux-64::pyodbc==4.0.24=py37he6710b0_0\n",
" - defaults/linux-64::pyqt==5.9.2=py37h05f1152_2\n",
" - defaults/linux-64::pytables==3.4.4=py37ha205bf6_0\n",
" - defaults/linux-64::pytest==3.8.0=py37_0\n",
" - defaults/linux-64::pytest-arraydiff==0.2=py37h39e3cac_0\n",
" - defaults/linux-64::pytest-astropy==0.4.0=py37_0\n",
" - defaults/linux-64::pytest-doctestplus==0.1.3=py37_0\n",
" - defaults/linux-64::pytest-openfiles==0.3.0=py37_0\n",
" - defaults/linux-64::pytest-remotedata==0.3.0=py37_0\n",
" - defaults/linux-64::pywavelets==1.0.0=py37hdd07704_0\n",
" - defaults/linux-64::qt==5.9.6=h8703b6f_2\n",
" - defaults/linux-64::qtawesome==0.4.4=py37_0\n",
" - defaults/linux-64::qtconsole==4.4.1=py37_0\n",
" - defaults/linux-64::qtpy==1.5.0=py37_0\n",
" - defaults/linux-64::rope==0.11.0=py37_0\n",
" - defaults/linux-64::ruamel_yaml==0.15.46=py37h14c3975_0\n",
" - defaults/linux-64::scikit-image==0.14.0=py37hf484d3e_1\n",
" - defaults/linux-64::secretstorage==3.1.0=py37_0\n",
" - defaults/linux-64::service_identity==17.0.0=py37h28b3542_0\n",
" - defaults/linux-64::simplegeneric==0.8.1=py37_2\n",
" - defaults/linux-64::singledispatch==3.4.0.3=py37_0\n",
" - defaults/linux-64::sip==4.19.8=py37hf484d3e_0\n",
" - defaults/linux-64::snowballstemmer==1.2.1=py37_0\n",
" - defaults/linux-64::sortedcollections==1.0.1=py37_0\n",
" - defaults/linux-64::sortedcontainers==2.0.5=py37_0\n",
" - defaults/linux-64::sphinx==1.7.9=py37_0\n",
" - defaults/linux-64::sphinxcontrib==1.0=py37_1\n",
" - defaults/linux-64::sphinxcontrib-websupport==1.1.0=py37_1\n",
" - defaults/linux-64::spyder==3.3.1=py37_1\n",
" - defaults/linux-64::spyder-kernels==0.2.6=py37_0\n",
" - defaults/linux-64::sympy==1.2=py37_0\n",
" - defaults/linux-64::tblib==1.3.2=py37_0\n",
" - defaults/linux-64::tqdm==4.26.0=py37h28b3542_0\n",
" - defaults/linux-64::twisted==18.7.0=py37h14c3975_1\n",
" - defaults/linux-64::unicodecsv==0.14.1=py37_0\n",
" - defaults/linux-64::wrapt==1.10.11=py37h14c3975_2\n",
" - defaults/linux-64::xlrd==1.1.0=py37_1\n",
" - defaults/linux-64::xlsxwriter==1.1.0=py37_0\n",
" - defaults/linux-64::xlwt==1.3.0=py37_0\n",
" - defaults/linux-64::zict==0.1.3=py37_0\n",
" - defaults/linux-64::zope==1.0=py37_1\n",
" - defaults/linux-64::zope.interface==4.5.0=py37h14c3975_0\n",
"failed with initial frozen solve. Retrying with flexible solve.\n",
"Solving environment: / \n",
"Found conflicts! Looking for incompatible packages.\n",
"This can take several minutes. Press CTRL-C to abort.\n",
"Examining pickleshare: 3%|▌ | 15/480 [00:00<00:00, 5163.70it/s]\n",
"Comparing specs that have this dependency: 0%| | 0/16 [00:00<?, ?it/s]\u001b[A\n",
"\n",
"Finding conflict paths: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b| \n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 50%|▌| 1/2 [00:00<00:00, 2.94it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 100%|█| 2/2 [00:00<00:00, 5.88it/s]\u001b[A\u001b- \n",
"\n",
" \u001b[A\u001b[A\n",
"Comparing specs that have this dependency: 6%| | 1/16 [00:01<00:28, 1.92s/it]\u001b[A\n",
"\n",
"Finding conflict paths: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b| \n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 50%|▌| 1/2 [00:05<00:05, 5.01s/it]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 100%|█| 2/2 [00:05<00:00, 2.50s/it]\u001b[A\u001b| \n",
"\n",
" \u001b[A\u001b[A\n",
"Comparing specs that have this dependency: 12%|▏| 2/16 [00:17<01:59, 8.55s/it]\u001b[A\n",
"\n",
"Finding conflict paths: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b- \n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 50%|▌| 1/2 [00:01<00:01, 1.90s/it]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 100%|█| 2/2 [00:01<00:00, 1.05it/s]\u001b[A\u001b\\ \n",
"\n",
" \u001b[A\u001b[A\n",
"Comparing specs that have this dependency: 19%|▏| 3/16 [00:30<02:12, 10.16s/it]\u001b[A\n",
"\n",
"Finding conflict paths: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b| \n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 50%|▌| 1/2 [00:00<00:00, 7.66it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 100%|█| 2/2 [00:00<00:00, 15.30it/s]\u001b[A\u001b| \n",
"\n",
" \u001b[A\u001b[A\n",
"Comparing specs that have this dependency: 25%|▎| 4/16 [00:34<01:44, 8.72s/it]\u001b[A\n",
"\n",
"Finding conflict paths: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 50%|▌| 1/2 [00:00<00:00, 1019.27it/s]\u001b[A\u001b/ \n",
"\n",
" \u001b[A\u001b[A\n",
"Comparing specs that have this dependency: 31%|▎| 5/16 [00:35<01:17, 7.01s/it]\u001b[A\n",
"\n",
"Finding conflict paths: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b/ \n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 50%|▌| 1/2 [00:01<00:01, 1.75s/it]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 100%|█| 2/2 [00:01<00:00, 1.14it/s]\u001b[A\u001b- \n",
"\n",
" \u001b[A\u001b[A\n",
"Comparing specs that have this dependency: 38%|▍| 6/16 [00:42<01:10, 7.04s/it]\u001b[A\n",
"\n",
"Finding conflict paths: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 50%|▌| 1/2 [00:00<00:00, 176.77it/s]\u001b[A\u001b/ \n",
"\n",
" \u001b[A\u001b[A\n",
"Comparing specs that have this dependency: 44%|▍| 7/16 [00:42<00:54, 6.08s/it]\u001b[A\n",
"\n",
"Finding conflict paths: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 50%|▌| 1/2 [00:00<00:00, 104.85it/s]\u001b[A\u001b\\ \n",
"\n",
" \u001b[A\u001b[A\n",
"Comparing specs that have this dependency: 50%|▌| 8/16 [00:43<00:43, 5.41s/it]\u001b[A\n",
"\n",
"Finding conflict paths: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare: 0%| | 0/2 [00:00<?, ?it/s]\u001b[A\u001b- \n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 50%|▌| 1/2 [00:02<00:02, 2.02s/it]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 100%|█| 2/2 [00:02<00:00, 1.01s/it]\u001b[A\u001b\\ \n",
"\n",
" \u001b[A\u001b[A\n",
"Comparing specs that have this dependency: 56%|▌| 9/16 [00:50<00:39, 5.58s/it]\u001b[A\n",
"\n",
"Finding conflict paths: 0%| | 0/11 [00:00<?, ?it/s]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.4=py36_0: 0%| | 0/11 [00:00<?, ?it/s]\u001b[A\u001b| \n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.4=py36h63277f8_0: 9%| | 1/11 [00:38<06:28, 38.81s/it]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.4=py36h63277f8_0: 18%|▏| 2/11 [00:38<02:54, 19.40s/it]\u001b[A\u001b| \n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.4=py35hd57304d_0: 18%|▏| 2/11 [00:39<02:54, 19.40s/it]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.4=py35hd57304d_0: 27%|▎| 3/11 [00:39<01:49, 13.72s/it]\u001b[A\u001b/ \n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.5=py37_0: 27%|▎| 3/11 [01:02<01:49, 13.72s/it] \u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.5=py37_0: 36%|▎| 4/11 [01:02<01:56, 16.59s/it]\u001b[A\u001b| \n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.4=py27h09770e1_0: 36%|▎| 4/11 [01:08<01:56, 16.59s/it]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.4=py27h09770e1_0: 45%|▍| 5/11 [01:08<01:19, 13.30s/it]\u001b[A\u001b/ \n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.4=py27_0: 45%|▍| 5/11 [01:28<01:19, 13.30s/it] \u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare==0.7.4=py27_0: 55%|▌| 6/11 [01:28<01:16, 15.27s/it]\u001b[A\u001b/ \n",
"\n",
"Finding shortest conflict path for pickleshare: 55%|▌| 6/11 [01:34<01:16, 15.27s/it] \u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for pickleshare: 64%|▋| 7/11 [01:34<00:51, 12.76s/it]\u001b[A\u001b- \n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 64%|▋| 7/11 [04:59<00:51, 12.76s/it]\u001b[A\u001b[A\n",
"\n",
"Finding shortest conflict path for defaults/linux-64::pickleshare==0.7.4=py37_0: 73%|▋| 8/11 [04:59<03:31, 70.38s/it]\u001b[A\u001b| "
]
}
],
"source": [
"# import decision trees scikit-learn libraries\n",
"%matplotlib inline\n",
"from sklearn import tree\n",
"from sklearn.metrics import accuracy_score, confusion_matrix\n",
"\n",
"import matplotlib.pyplot as plt\n",
"\n",
"!conda install python-graphviz --yes\n",
"import graphviz\n",
"\n",
"from sklearn.tree import export_graphviz\n",
"\n",
"import itertools"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Solving environment: done\n",
"\n",
"## Package Plan ##\n",
"\n",
" environment location: /home/jupyterlab/conda\n",
"\n",
" added / updated specs: \n",
" - conda\n",
"\n",
"\n",
"The following packages will be downloaded:\n",
"\n",
" package | build\n",
" ---------------------------|-----------------\n",
" six-1.13.0 | py38_0 27 KB\n",
" pyopenssl-19.1.0 | py38_0 87 KB\n",
" pysocks-1.7.1 | py38_0 27 KB\n",
" python-3.8.0 | h0371630_2 39.6 MB\n",
" wheel-0.33.6 | py38_0 35 KB\n",
" ncurses-6.1 | he6710b0_1 958 KB\n",
" pip-19.3.1 | py38_0 1.9 MB\n",
" ruamel_yaml-0.15.87 | py38h7b6447c_0 269 KB\n",
" requests-2.22.0 | py38_1 90 KB\n",
" setuptools-42.0.2 | py38_0 654 KB\n",
" pycosat-0.6.3 | py38h7b6447c_0 113 KB\n",
" conda-package-handling-1.6.0| py38h7b6447c_0 879 KB\n",
" tqdm-4.40.2 | py_0 53 KB\n",
" zlib-1.2.11 | h7b6447c_3 120 KB\n",
" libedit-3.1.20181209 | hc058e9b_0 188 KB\n",
" sqlite-3.30.1 | h7b6447c_0 1.9 MB\n",
" urllib3-1.25.7 | py38_0 161 KB\n",
" certifi-2019.11.28 | py38_0 156 KB\n",
" cffi-1.13.2 | py38h2e261b9_0 233 KB\n",
" pycparser-2.19 | py_0 89 KB\n",
" asn1crypto-1.2.0 | py38_0 159 KB\n",
" conda-4.8.0 | py38_1 3.0 MB\n",
" idna-2.8 | py38_1000 103 KB\n",
" cryptography-2.8 | py38h1ba5d50_0 618 KB\n",
" ca-certificates-2019.11.27 | 0 132 KB\n",
" chardet-3.0.4 | py38_1003 170 KB\n",
" ------------------------------------------------------------\n",
" Total: 51.6 MB\n",
"\n",
"The following NEW packages will be INSTALLED:\n",
"\n",
" _libgcc_mutex: 0.1-main \n",
" conda-package-handling: 1.6.0-py38h7b6447c_0 \n",
"\n",
"The following packages will be UPDATED:\n",
"\n",
" asn1crypto: 0.24.0-py37_0 --> 1.2.0-py38_0 \n",
" ca-certificates: 2018.03.07-0 --> 2019.11.27-0 \n",
" certifi: 2018.8.24-py37_1 --> 2019.11.28-py38_0 \n",
" cffi: 1.11.5-py37he75722e_1 --> 1.13.2-py38h2e261b9_0 \n",
" chardet: 3.0.4-py37_1 --> 3.0.4-py38_1003 \n",
" conda: 4.5.11-py37_0 --> 4.8.0-py38_1 \n",
" cryptography: 2.3.1-py37hc365091_0 --> 2.8-py38h1ba5d50_0 \n",
" idna: 2.7-py37_0 --> 2.8-py38_1000 \n",
" libedit: 3.1.20170329-h6b74fdf_2 --> 3.1.20181209-hc058e9b_0\n",
" libgcc-ng: 8.2.0-hdf63c60_1 --> 9.1.0-hdf63c60_0 \n",
" ncurses: 6.1-hf484d3e_0 --> 6.1-he6710b0_1 \n",
" openssl: 1.0.2p-h14c3975_0 --> 1.1.1d-h7b6447c_3 \n",
" pip: 10.0.1-py37_0 --> 19.3.1-py38_0 \n",
" pycosat: 0.6.3-py37h14c3975_0 --> 0.6.3-py38h7b6447c_0 \n",
" pycparser: 2.18-py37_1 --> 2.19-py_0 \n",
" pyopenssl: 18.0.0-py37_0 --> 19.1.0-py38_0 \n",
" pysocks: 1.6.8-py37_0 --> 1.7.1-py38_0 \n",
" python: 3.7.0-hc3d631a_0 --> 3.8.0-h0371630_2 \n",
" requests: 2.19.1-py37_0 --> 2.22.0-py38_1 \n",
" ruamel_yaml: 0.15.46-py37h14c3975_0 --> 0.15.87-py38h7b6447c_0 \n",
" setuptools: 40.2.0-py37_0 --> 42.0.2-py38_0 \n",
" six: 1.11.0-py37_1 --> 1.13.0-py38_0 \n",
" sqlite: 3.24.0-h84994c4_0 --> 3.30.1-h7b6447c_0 \n",
" tqdm: 4.26.0-py37h28b3542_0 --> 4.40.2-py_0 \n",
" urllib3: 1.23-py37_0 --> 1.25.7-py38_0 \n",
" wheel: 0.31.1-py37_0 --> 0.33.6-py38_0 \n",
" zlib: 1.2.11-ha838bed_2 --> 1.2.11-h7b6447c_3 \n",
"\n",
"\n",
"Downloading and Extracting Packages\n",
"six-1.13.0 | 27 KB | ##################################### | 100% \n",
"pyopenssl-19.1.0 | 87 KB | ##################################### | 100% \n",
"pysocks-1.7.1 | 27 KB | ##################################### | 100% \n",
"python-3.8.0 | 39.6 MB | ##################################### | 100% \n",
"wheel-0.33.6 | 35 KB | ##################################### | 100% \n",
"ncurses-6.1 | 958 KB | ##################################### | 100% \n",
"pip-19.3.1 | 1.9 MB | ##################################### | 100% \n",
"ruamel_yaml-0.15.87 | 269 KB | ##################################### | 100% \n",
"requests-2.22.0 | 90 KB | ##################################### | 100% \n",
"setuptools-42.0.2 | 654 KB | ##################################### | 100% \n",
"pycosat-0.6.3 | 113 KB | ##################################### | 100% \n",
"conda-package-handli | 879 KB | ##################################### | 100% \n",
"tqdm-4.40.2 | 53 KB | ##################################### | 100% \n",
"zlib-1.2.11 | 120 KB | ##################################### | 100% \n",
"libedit-3.1.20181209 | 188 KB | ##################################### | 100% \n",
"sqlite-3.30.1 | 1.9 MB | ##################################### | 100% \n",
"urllib3-1.25.7 | 161 KB | ##################################### | 100% \n",
"certifi-2019.11.28 | 156 KB | ##################################### | 100% \n",
"cffi-1.13.2 | 233 KB | ##################################### | 100% \n",
"pycparser-2.19 | 89 KB | ##################################### | 100% \n",
"asn1crypto-1.2.0 | 159 KB | ##################################### | 100% \n",
"conda-4.8.0 | 3.0 MB | ##################################### | 100% \n",
"idna-2.8 | 103 KB | ##################################### | 100% \n",
"cryptography-2.8 | 618 KB | ##################################### | 100% \n",
"ca-certificates-2019 | 132 KB | ##################################### | 100% \n",
"chardet-3.0.4 | 170 KB | ##################################### | 100% \n",
"Preparing transaction: done\n",
"Verifying transaction: done\n",
"Executing transaction: done\n",
"\n",
"Note: you may need to restart the kernel to use updated packages.\n"
]
}
],
"source": [
"conda update -n base -c defaults conda"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Check the data again!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"recipes.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"## [bamboo_tree] Only Asian and Indian Cuisines\n",
"\n",
"Here, we are creating a decision tree for the recipes for just some of the Asian (Korean, Japanese, Chinese, Thai) and Indian cuisines. The reason for this is because the decision tree does not run well when the data is biased towards one cuisine, in this case American cuisines. One option is to exclude the American cuisines from our analysis or just build decision trees for different subsets of the data. Let's go with the latter solution."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's build our decision tree using the data pertaining to the Asian and Indian cuisines and name our decision tree *bamboo_tree*."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'recipes' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-1-2774b563402d>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# select subset of cuisines\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0masian_indian_recipes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecipes\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mrecipes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcuisine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0misin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"korean\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"japanese\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"chinese\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"thai\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"indian\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0mcuisines\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0masian_indian_recipes\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"cuisine\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mingredients\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0masian_indian_recipes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0miloc\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mNameError\u001b[0m: name 'recipes' is not defined"
]
}
],
"source": [
"# select subset of cuisines\n",
"asian_indian_recipes = recipes[recipes.cuisine.isin([\"korean\", \"japanese\", \"chinese\", \"thai\", \"indian\"])]\n",
"cuisines = asian_indian_recipes[\"cuisine\"]\n",
"ingredients = asian_indian_recipes.iloc[:,1:]\n",
"\n",
"bamboo_tree = tree.DecisionTreeClassifier(max_depth=3)\n",
"bamboo_tree.fit(ingredients, cuisines)\n",
"\n",
"print(\"Decision tree model saved to bamboo_tree!\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's plot the decision tree and examine how it looks like."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"export_graphviz(bamboo_tree,\n",
" feature_names=list(ingredients.columns.values),\n",
" out_file=\"bamboo_tree.dot\",\n",
" class_names=np.unique(cuisines),\n",
" filled=True,\n",
" node_ids=True,\n",
" special_characters=True,\n",
" impurity=False,\n",
" label=\"all\",\n",
" leaves_parallel=False)\n",
"\n",
"with open(\"bamboo_tree.dot\") as bamboo_tree_image:\n",
" bamboo_tree_graph = bamboo_tree_image.read()\n",
"graphviz.Source(bamboo_tree_graph)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"The decision tree learned:\n",
"* If a recipe contains *cumin* and *fish* and **no** *yoghurt*, then it is most likely a **Thai** recipe.\n",
"* If a recipe contains *cumin* but **no** *fish* and **no** *soy_sauce*, then it is most likely an **Indian** recipe."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"You can analyze the remaining branches of the tree to come up with similar rules for determining the cuisine of different recipes. "
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Feel free to select another subset of cuisines and build a decision tree of their recipes. You can select some European cuisines and build a decision tree to explore the ingredients that differentiate them."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"# Model Evaluation <a id=\"4\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DS0103EN/labs/images/lab4_fig5_flowchart_evaluation.png\" width=500>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"To evaluate our model of Asian and Indian cuisines, we will split our dataset into a training set and a test set. We will build the decision tree using the training set. Then, we will test the model on the test set and compare the cuisines that the model predicts to the actual cuisines. "
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's first create a new dataframe using only the data pertaining to the Asian and the Indian cuisines, and let's call the new dataframe **bamboo**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"bamboo = recipes[recipes.cuisine.isin([\"korean\", \"japanese\", \"chinese\", \"thai\", \"indian\"])]"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's see how many recipes exist for each cuisine."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"bamboo[\"cuisine\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's remove 30 recipes from each cuisine to use as the test set, and let's name this test set **bamboo_test**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"# set sample size\n",
"sample_n = 30"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Create a dataframe containing 30 recipes from each cuisine, selected randomly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"# take 30 recipes from each cuisine\n",
"random.seed(1234) # set random seed\n",
"bamboo_test = bamboo.groupby(\"cuisine\", group_keys=False).apply(lambda x: x.sample(sample_n))\n",
"\n",
"bamboo_test_ingredients = bamboo_test.iloc[:,1:] # ingredients\n",
"bamboo_test_cuisines = bamboo_test[\"cuisine\"] # corresponding cuisines or labels"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Check that there are 30 recipes for each cuisine."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"# check that we have 30 recipes from each cuisine\n",
"bamboo_test[\"cuisine\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Next, let's create the training set by removing the test set from the **bamboo** dataset, and let's call the training set **bamboo_train**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"bamboo_test_index = bamboo.index.isin(bamboo_test.index)\n",
"bamboo_train = bamboo[~bamboo_test_index]\n",
"\n",
"bamboo_train_ingredients = bamboo_train.iloc[:,1:] # ingredients\n",
"bamboo_train_cuisines = bamboo_train[\"cuisine\"] # corresponding cuisines or labels"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Check that there are 30 _fewer_ recipes now for each cuisine."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"jupyter": {
"outputs_hidden": true
},
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"bamboo_train[\"cuisine\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's build the decision tree using the training set, **bamboo_train**, and name the generated tree **bamboo_train_tree** for prediction."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"jupyter": {
"outputs_hidden": true
},
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"bamboo_train_tree = tree.DecisionTreeClassifier(max_depth=15)\n",
"bamboo_train_tree.fit(bamboo_train_ingredients, bamboo_train_cuisines)\n",
"\n",
"print(\"Decision tree model saved to bamboo_train_tree!\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Let's plot the decision tree and explore it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"jupyter": {
"outputs_hidden": true
},
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"export_graphviz(bamboo_train_tree,\n",
" feature_names=list(bamboo_train_ingredients.columns.values),\n",
" out_file=\"bamboo_train_tree.dot\",\n",
" class_names=np.unique(bamboo_train_cuisines),\n",
" filled=True,\n",
" node_ids=True,\n",
" special_characters=True,\n",
" impurity=False,\n",
" label=\"all\",\n",
" leaves_parallel=False)\n",
"\n",
"with open(\"bamboo_train_tree.dot\") as bamboo_train_tree_image:\n",
" bamboo_train_tree_graph = bamboo_train_tree_image.read()\n",
"graphviz.Source(bamboo_train_tree_graph)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Now that we defined our tree to be deeper, more decision nodes are generated."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"#### Now let's test our model on the test data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"jupyter": {
"outputs_hidden": true
},
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"bamboo_pred_cuisines = bamboo_train_tree.predict(bamboo_test_ingredients)"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"To quantify how well the decision tree is able to determine the cuisine of each recipe correctly, we will create a confusion matrix which presents a nice summary on how many recipes from each cuisine are correctly classified. It also sheds some light on what cuisines are being confused with what other cuisines."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"So let's go ahead and create the confusion matrix for how well the decision tree is able to correctly classify the recipes in **bamboo_test**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"button": false,
"collapsed": true,
"deletable": true,
"jupyter": {
"outputs_hidden": true
},
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"outputs": [],
"source": [
"test_cuisines = np.unique(bamboo_test_cuisines)\n",
"bamboo_confusion_matrix = confusion_matrix(bamboo_test_cuisines, bamboo_pred_cuisines, test_cuisines)\n",
"title = 'Bamboo Confusion Matrix'\n",
"cmap = plt.cm.Blues\n",
"\n",
"plt.figure(figsize=(8, 6))\n",
"bamboo_confusion_matrix = (\n",
" bamboo_confusion_matrix.astype('float') / bamboo_confusion_matrix.sum(axis=1)[:, np.newaxis]\n",
" ) * 100\n",
"\n",
"plt.imshow(bamboo_confusion_matrix, interpolation='nearest', cmap=cmap)\n",
"plt.title(title)\n",
"plt.colorbar()\n",
"tick_marks = np.arange(len(test_cuisines))\n",
"plt.xticks(tick_marks, test_cuisines)\n",
"plt.yticks(tick_marks, test_cuisines)\n",
"\n",
"fmt = '.2f'\n",
"thresh = bamboo_confusion_matrix.max() / 2.\n",
"for i, j in itertools.product(range(bamboo_confusion_matrix.shape[0]), range(bamboo_confusion_matrix.shape[1])):\n",
" plt.text(j, i, format(bamboo_confusion_matrix[i, j], fmt),\n",
" horizontalalignment=\"center\",\n",
" color=\"white\" if bamboo_confusion_matrix[i, j] > thresh else \"black\")\n",
"\n",
"plt.tight_layout()\n",
"plt.ylabel('True label')\n",
"plt.xlabel('Predicted label')\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"After running the above code, you should get a confusion matrix similar to the following:"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<img src=\"https://ibm.box.com/shared/static/69f5m7txv2u6g47867qe0eypnfylrj4w.png\" width=500>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"The rows represent the actual cuisines from the dataset and the columns represent the predicted ones. Each row should sum to 100%. According to this confusion matrix, we make the following observations:\n",
"\n",
"* Using the first row in the confusion matrix, 60% of the **Chinese** recipes in **bamboo_test** were correctly classified by our decision tree whereas 37% of the **Chinese** recipes were misclassified as **Korean** and 3% were misclassified as **Indian**.\n",
"\n",
"* Using the Indian row, 77% of the **Indian** recipes in **bamboo_test** were correctly classified by our decision tree and 3% of the **Indian** recipes were misclassified as **Chinese** and 13% were misclassified as **Korean** and 7% were misclassified as **Thai**."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"**Please note** that because decision trees are created using random sampling of the datapoints in the training set, then you may not get the same results every time you create the decision tree even using the same training set. The performance should still be comparable though! So don't worry if you get slightly different numbers in your confusion matrix than the ones shown above."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Using the reference confusion matrix, how many **Japanese** recipes were correctly classified by our decision tree?"
]
},
{
"cell_type": "raw",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Your Answer:\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Double-click __here__ for the solution.\n",
"<!-- The correct answer is:\n",
"36.67%.\n",
"-->"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Also using the reference confusion matrix, how many **Korean** recipes were misclassified as **Japanese**?"
]
},
{
"cell_type": "raw",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Your Answer:\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Double-click __here__ for the solution.\n",
"<!-- The correct answer is:\n",
"3.33%.\n",
"-->"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"What cuisine has the least number of recipes correctly classified by the decision tree using the reference confusion matrix?"
]
},
{
"cell_type": "raw",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Your Answer:\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"Double-click __here__ for the solution.\n",
"<!-- The correct answer is:\n",
"Japanese cuisine, with 36.67% only.\n",
"-->"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<br>\n",
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"### Thank you for completing this lab!\n",
"\n",
"This notebook was created by [Alex Aklson](https://www.linkedin.com/in/aklson/). We hope you found this lab session interesting. Feel free to contact us if you have any questions!"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"This notebook is part of a course on **Coursera** called *Data Science Methodology*. If you accessed this notebook outside the course, you can take this course, online by clicking [here](https://cocl.us/DS0103EN_Coursera_LAB4)."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"deletable": true,
"new_sheet": false,
"run_control": {
"read_only": false
}
},
"source": [
"<hr>\n",
"\n",
"Copyright &copy; 2019 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment