Created
December 12, 2018 02:42
-
-
Save STHITAPRAJNAS/7775f2298dfa3b26502d4728a9232fbb to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<a href=\"https://cognitiveclass.ai\"><img src = \"https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png\" width = 400> </a>\n", | |
"\n", | |
"<h1 align=center><font size = 5>From Problem to Approach</font></h1>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Introduction\n", | |
"\n", | |
"The aim of these labs is to reinforce the concepts that we discuss in each module's videos. These labs will revolve around the use case of food recipes, and together, we will walk through the process that data scientists usually follow when trying to solve a problem. Let's get started!\n", | |
"\n", | |
"In this lab, we will start learning about the data science methodology, and focus on the **Business Understanding** and the **Analytic Approach** stages.\n", | |
"\n", | |
"------------" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Table of Contents\n", | |
"\n", | |
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n", | |
"\n", | |
"1. [Business Understanding](#0)<br>\n", | |
"2. [Analytic Approach](#2) <br>\n", | |
"</div>\n", | |
"<hr>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Business Understanding <a id=\"0\"></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is the **Data Science Methodology**, a flowchart that begins with business understanding." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img src=\"https://ibm.box.com/shared/static/eyl60n09iige3eo5tac3dweqko2s58oo.png\" width=500>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Why is the business understanding stage important?" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"Your Answer: \n", | |
"Because this stage decides what should be done in next steps.\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!-- The correct answer is:\n", | |
"It helps clarify the goal of the entity asking the question.\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Looking at this diagram, we immediately spot two outstanding features of the data science methodology." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img src = \"https://ibm.box.com/shared/static/6u3evi4h52e80cq78alqgza8nhfy8vhl.png\" width = 500> " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### What are they?" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"Your Answer: \n", | |
"1.lot of back n forth\n", | |
"2.it never completes" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!--The correct answer is: \n", | |
"\\\\ 1. The flowchart is highly iterative.\n", | |
"\\\\ 2. The flowchart never ends. \n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Now let's illustrate the data science methodology with a case study." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Say, we are interested in automating the process of figuring out the cuisine of a given dish or recipe. Let's apply the business understanding stage to this problem." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Q. Can we predict the cuisine of a given dish using the name of the dish only?" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"Your Answer:\n", | |
"\n", | |
"No" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!--The correct answer is: \n", | |
"No.\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Q. For example, the following dish names were taken from the menu of a local restaurant in Toronto, Ontario in Canada. \n", | |
"\n", | |
"#### 1. Beast\n", | |
"#### 2. 2 PM\n", | |
"#### 3. 4 Minute" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Are you able to tell the cuisine of these dishes?" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"Your Answer:\n", | |
"\n", | |
"no" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!--The correct answer is: \n", | |
"The cuisine is <strong>Japanese</strong>. Here are links to the images of the dishes: \n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"Beast: https://ibm.box.com/shared/static/5e7duvewfl5bk4317sna5skvdhrehro2.png\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"2PM: https://ibm.box.com/shared/static/d9xuzqm8cq76zxxcc0f9gdts4iksipyk.png\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"4 Minute: https://ibm.box.com/shared/static/f1fwvvwn4u8rx8tghep6zyj5pi6a8v8k.png\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"Photographs by Avlxyz: https://commons.wikimedia.org/wiki/Category:Photographs_by_Avlxyz\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Q. What about by appearance only? Yes or No." | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"Your Answer:\n", | |
"No\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!--The correct answer is: \n", | |
"No, especially when it comes to countries in close geographical proximity such as Scandinavian countries, or Asian countries.\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"At this point, we realize that automating the process of determining the cuisine of a given dish is not a straightforward problem as we need to come up with a way that is very robust to the many cuisines and their variations." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Q. What about determining the cuisine of a dish based on its ingredients?" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"Your Answer:\n", | |
"may be but not easy\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!--The correct answer is: \n", | |
"Potentially yes, as there are specific ingredients unique to each cuisine.\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"As you guessed, yes determining the cuisine of a given dish based on its ingredients seems like a viable solution as some ingredients are unique to cuisines. For example:" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"* When we talk about **American** cuisines, the first ingredient that comes to one's mind (or at least to my mind =D) is beef or turkey.\n", | |
"\n", | |
"* When we talk about **British** cuisines, the first ingredient that comes to one's mind is haddock or mint sauce.\n", | |
"\n", | |
"* When we talk about **Canadian** cuisines, the first ingredient that comes to one's mind is bacon or poutine.\n", | |
"\n", | |
"* When we talk about **French** cuisines, the first ingredient that comes to one's mind is bread or butter.\n", | |
"\n", | |
"* When we talk about **Italian** cuisines, the first ingredient that comes to one's mind is tomato or ricotta.\n", | |
"\n", | |
"* When we talk about **Japanese** cuisines, the first ingredient that comes to one's mind is seaweed or soy sauce.\n", | |
"\n", | |
"* When we talk about **Chinese** cuisines, the first ingredient that comes to one's mind is ginger or garlic.\n", | |
"\n", | |
"* When we talk about **indian** cuisines, the first ingredient that comes to one's mind is masala or chillis." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Accordingly, can you determine the cuisine of the dish associated with the following list of ingredients?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img src=\"https://ibm.box.com/shared/static/bs81fv2n5lidkv9hg827qn5s86gdk090.png\" width=600>" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"Your Answer:\n", | |
"japanese\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!--The correct answer is: \n", | |
"Japanese since the recipe is most likely that of a sushi roll.\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Analytic Approach <a id=\"2\"></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img src=\"https://ibm.box.com/shared/static/i19z7bijbksl5kkmm81ngg39y0ruzirh.png\" width=500>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### So why are we interested in data science?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem. This step entails expressing the problem in the context of statistical and machine-learning techniques, so that the entity or stakeholders with the problem can identify the most suitable techniques for the desired outcome. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Why is the analytic approach stage important?" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"Your Answer:\n", | |
"\n", | |
"because we can find the pattern for the particular problem" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!--The correct answer is: \n", | |
"Because it helps identify what type of patterns will be needed to address the question most effectively.\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Let's explore a machine learning algorithm, decision trees, and see if it is the right technique to automate the process of identifying the cuisine of a given dish or recipe while simultaneously providing us with some insight on why a given recipe is believed to belong to a certain type of cuisine." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is a decision tree that a naive person might create manually. Starting at the top with all the recipes for all the cuisines in the world, if a recipe contains **rice**, then this decision tree would classify it as a **Japanese** cuisine. Otherwise, it would be classified as not a **Japanese** cuisine." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img src=\"https://ibm.box.com/shared/static/1dzmelrcfsgba47rbbagbxiqisqgy63n.png\" width=500>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Is this a good decision tree? Yes or No, and why? " | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"Your Answer:\n", | |
"No - because it has only one decision node\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!--The correct answer is: \n", | |
"No, because a plethora of dishes from other cuisines contain rice. Therefore, using rice as the ingredient in the Decision node to split on is not a good choice.\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### In order to build a very powerful decision tree for the recipe case study, let's take some time to learn more about decision trees." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"* Decision trees are built using recursive partitioning to classify the data.\n", | |
"* When partitioning the data, decision trees use the most predictive feature (ingredient in this case) to split the data.\n", | |
"* **Predictiveness** is based on decrease in entropy - gain in information, or *impurity*." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Suppose that our data is comprised of green triangles and red circles." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The following decision tree would be considered the optimal model for classifying the data into a node for green triangles and a node for red circles." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img src=\"https://ibm.box.com/shared/static/obwksbsin10mlg2m8x8ehe9xeyjfizqx.png\" width=400>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Each of the classes in the leaf nodes are completely pure – that is, each leaf node only contains datapoints that belong to the same class." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"On the other hand, the following decision tree is an example of the worst-case scenario that the model could output. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img src=\"https://ibm.box.com/shared/static/qenfaznfhwvvvlkjtzjirqgb4j5xx48l.png\" width=500>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Each leaf node contains datapoints belonging to the two classes resulting in many datapoints ultimately being misclassified." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### A tree stops growing at a node when:\n", | |
"* Pure or nearly pure.\n", | |
"* No remaining variables on which to further subset the data.\n", | |
"* The tree has grown to a preselected size limit." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### Here are some characteristics of decision trees:" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img src=\"https://ibm.box.com/shared/static/05mkemi191f6hbhw6f3ewrusckkgu861.png\" width=800>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now let's put what we learned about decision trees to use. Let's try and build a much better version of the decision tree for our recipe problem." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<img src=\"https://ibm.box.com/shared/static/e1ok280uavy6k8u7loli59ftoz66kk1s.png\" width = 500>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I hope you agree that the above decision tree is a much better version than the previous one. Although we are still using **Rice** as the ingredient in the first *decision node*, recipes get divided into **Asian Food** and **Non-Asian Food**. **Asian Food** is then further divided into **Japanese** and **Not Japanese** based on the **Wasabi** ingredient. This process of splitting *leaf nodes* continues until each *leaf node* is pure, i.e., containing recipes belonging to only one cuisine." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Accordingly, decision trees is a suitable technique or algorithm for our recipe case study." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Thank you for completing this lab!\n", | |
"\n", | |
"This notebook was created by [Alex Aklson](https://www.linkedin.com/in/aklson/). I hope you found this lab session interesting. Feel free to contact me if you have any questions!" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This notebook is part of a course on **Coursera** called *Data Science Methodology*. If you accessed this notebook outside the course, you can take this course, online by clicking [here](http://cocl.us/DS0103EN_Coursera_LAB1)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<hr>\n", | |
"Copyright © 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/)." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.6" | |
}, | |
"widgets": { | |
"state": {}, | |
"version": "1.1.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment