Swarchal · September 16, 2016 15:17
diff --git a/ML_intro.ipynb b/ML_intro.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine learning introduction\n",
    "\n",
    "Implementing machine learning is the easy part of the challenge. Getting a complicated data set into the right format or teasing out more accurate predictions are the difficult parts.\n",
    "\n",
    "## Classification\n",
    "\n",
    "For a classification task we have our input data, either numeric or classses, and we want to predict labels. I.e given the weight and color of a fruit, can we predict the name of a fruit?\n",
    "\n",
    "\n",
    "## Regression\n",
    "\n",
    "A regression task requires predicting a numerical value from some input data. I.e given the name of a fruit, colour and age, can we predict it's weight? Given someone's age and house location, can we predict their salary?\n",
    "\n",
    "\n",
    "## Features and labels.\n",
    "\n",
    "Data is split into features and labels. Features are what you use to build your model, labels are what you want to predict.\n",
    "\n",
    "e.g Features\n",
    "\n",
    "| weight | colour |\n",
    "|--------|--------|\n",
    "| 150    | red    |\n",
    "| 200    | green  |\n",
    "| 180    | green  |\n",
    "| 300    | orange |\n",
    "| 280    | yellow |\n",
    "\n",
    "Labels:\n",
    "\n",
    "| Labels |\n",
    "|-------|\n",
    "| apple |\n",
    "| apple |\n",
    "| apple |\n",
    "| orange |\n",
    "| banana |\n",
    "\n",
    "\n",
    "The aim is: given a new set of features, predict the labels.\n",
    "\n",
    "## Training and test data\n",
    "\n",
    "When constructing a model, a common problem is over-fitting. That means our model has not generalised well about the data, and instead has just memorised the data and labels, and will perform poorly at predicting labels for data it has never seen.\n",
    "\n",
    "For this reason we normally split any data we receive into a training and test set. We create the model using the training data, and then test on the test data to predict test labels, then our accuracy is scored using how well it predicted the test labels. Never train the model on your test data as you will then report over-optimistic accuracies. This is because overfitting is nearly un-avoidable, so your prediction accuracy is nearly always higher on your training data than your test data. This is why you normally see a training accuracy and a test accuracy - you should only be interested in the test accuracy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example 1: iris dataset\n",
    "\n",
    "Hopefully you're aware of the iris dataset, 4 columns describing petal/sepal length and width, another column listing the species."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sepal length (cm)</th>\n",
       "      <th>sepal width (cm)</th>\n",
       "      <th>petal length (cm)</th>\n",
       "      <th>petal width (cm)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5.1</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.9</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.7</td>\n",
       "      <td>3.2</td>\n",
       "      <td>1.3</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.6</td>\n",
       "      <td>3.1</td>\n",
       "      <td>1.5</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3.6</td>\n",
       "      <td>1.4</td>\n",
       "      <td>0.2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)\n",
       "0                5.1               3.5                1.4               0.2\n",
       "1                4.9               3.0                1.4               0.2\n",
       "2                4.7               3.2                1.3               0.2\n",
       "3                4.6               3.1                1.5               0.2\n",
       "4                5.0               3.6                1.4               0.2"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame(iris.data, columns=iris[\"feature_names\"]).head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The target species here are coded as integers: `0, 1, 2`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
       "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
       "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "iris.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So the goal is given the 4 columns of measurements, predict the species."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# this is how easy sci-kit learn is\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.cross_validation import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we want to split into training and test datasets, we will use 80% of the data to build a model and test on the remaining 20%.\n",
    "\n",
    "*(In machine learning the data and labels are normally abbreviated to x and y respectively, so using x to predict y)*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
       "            max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
       "            min_samples_leaf=1, min_samples_split=2,\n",
       "            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
       "            oob_score=False, random_state=None, verbose=0,\n",
       "            warm_start=False)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)\n",
    "\n",
    "model = RandomForestClassifier()\n",
    "\n",
    "# give the model the training data, and the corresponding labels\n",
    "model.fit(x_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets predict the species labels from the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2, 0, 1, 0, 1, 0, 0, 2, 2, 2, 2, 1, 2, 2, 0, 1, 0, 1, 0, 1, 1, 2, 0,\n",
       "       0, 0, 1, 0, 2, 1, 1])"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict(x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compare this against the actual labels (`y_test`), though scikit-learn already has functions to do that for you."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.93333333333333335"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.score(x_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "93% accuracy! Though it's a very simple problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Further reading\n",
    "\n",
    "I was going to do a whole tutorial, though if you want to know more I recommend this book (and it's a free pdf)\n",
    "http://www-bcf.usc.edu/~gareth/ISL/"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Machine learning introduction\n",
	"\n",
	"Implementing machine learning is the easy part of the challenge. Getting a complicated data set into the right format or teasing out more accurate predictions are the difficult parts.\n",
	"\n",
	"## Classification\n",
	"\n",
	"For a classification task we have our input data, either numeric or classses, and we want to predict labels. I.e given the weight and color of a fruit, can we predict the name of a fruit?\n",
	"\n",
	"\n",
	"## Regression\n",
	"\n",
	"A regression task requires predicting a numerical value from some input data. I.e given the name of a fruit, colour and age, can we predict it's weight? Given someone's age and house location, can we predict their salary?\n",
	"\n",
	"\n",
	"## Features and labels.\n",
	"\n",
	"Data is split into features and labels. Features are what you use to build your model, labels are what you want to predict.\n",
	"\n",
	"e.g Features\n",
	"\n",
	"\| weight \| colour \|\n",
	"\|--------\|--------\|\n",
	"\| 150 \| red \|\n",
	"\| 200 \| green \|\n",
	"\| 180 \| green \|\n",
	"\| 300 \| orange \|\n",
	"\| 280 \| yellow \|\n",
	"\n",
	"Labels:\n",
	"\n",
	"\| Labels \|\n",
	"\|-------\|\n",
	"\| apple \|\n",
	"\| apple \|\n",
	"\| apple \|\n",
	"\| orange \|\n",
	"\| banana \|\n",
	"\n",
	"\n",
	"The aim is: given a new set of features, predict the labels.\n",
	"\n",
	"## Training and test data\n",
	"\n",
	"When constructing a model, a common problem is over-fitting. That means our model has not generalised well about the data, and instead has just memorised the data and labels, and will perform poorly at predicting labels for data it has never seen.\n",
	"\n",
	"For this reason we normally split any data we receive into a training and test set. We create the model using the training data, and then test on the test data to predict test labels, then our accuracy is scored using how well it predicted the test labels. Never train the model on your test data as you will then report over-optimistic accuracies. This is because overfitting is nearly un-avoidable, so your prediction accuracy is nearly always higher on your training data than your test data. This is why you normally see a training accuracy and a test accuracy - you should only be interested in the test accuracy."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Example 1: iris dataset\n",
	"\n",
	"Hopefully you're aware of the iris dataset, 4 columns describing petal/sepal length and width, another column listing the species."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"import pandas as pd\n",
	"from sklearn import datasets\n",
	"iris = datasets.load_iris()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<div>\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>sepal length (cm)</th>\n",
	" <th>sepal width (cm)</th>\n",
	" <th>petal length (cm)</th>\n",
	" <th>petal width (cm)</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td>5.1</td>\n",
	" <td>3.5</td>\n",
	" <td>1.4</td>\n",
	" <td>0.2</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td>4.9</td>\n",
	" <td>3.0</td>\n",
	" <td>1.4</td>\n",
	" <td>0.2</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td>4.7</td>\n",
	" <td>3.2</td>\n",
	" <td>1.3</td>\n",
	" <td>0.2</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td>4.6</td>\n",
	" <td>3.1</td>\n",
	" <td>1.5</td>\n",
	" <td>0.2</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td>5.0</td>\n",
	" <td>3.6</td>\n",
	" <td>1.4</td>\n",
	" <td>0.2</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"text/plain": [
	" sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n",
	"0 5.1 3.5 1.4 0.2\n",
	"1 4.9 3.0 1.4 0.2\n",
	"2 4.7 3.2 1.3 0.2\n",
	"3 4.6 3.1 1.5 0.2\n",
	"4 5.0 3.6 1.4 0.2"
	]
	},
	"execution_count": 19,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pd.DataFrame(iris.data, columns=iris[\"feature_names\"]).head()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The target species here are coded as integers: `0, 1, 2`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
	" 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
	" 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
	" 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
	" 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n",
	" 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])"
	]
	},
	"execution_count": 18,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"iris.target"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So the goal is given the 4 columns of measurements, predict the species."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# this is how easy sci-kit learn is\n",
	"from sklearn.ensemble import RandomForestClassifier\n",
	"from sklearn.cross_validation import train_test_split"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So we want to split into training and test datasets, we will use 80% of the data to build a model and test on the remaining 20%.\n",
	"\n",
	"(In machine learning the data and labels are normally abbreviated to x and y respectively, so using x to predict y)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
	" max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
	" min_samples_leaf=1, min_samples_split=2,\n",
	" min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
	" oob_score=False, random_state=None, verbose=0,\n",
	" warm_start=False)"
	]
	},
	"execution_count": 26,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)\n",
	"\n",
	"model = RandomForestClassifier()\n",
	"\n",
	"# give the model the training data, and the corresponding labels\n",
	"model.fit(x_train, y_train)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now lets predict the species labels from the test data."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 25,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([2, 0, 1, 0, 1, 0, 0, 2, 2, 2, 2, 1, 2, 2, 0, 1, 0, 1, 0, 1, 1, 2, 0,\n",
	" 0, 0, 1, 0, 2, 1, 1])"
	]
	},
	"execution_count": 25,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model.predict(x_test)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We can compare this against the actual labels (`y_test`), though scikit-learn already has functions to do that for you."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"0.93333333333333335"
	]
	},
	"execution_count": 27,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model.score(x_test, y_test)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"93% accuracy! Though it's a very simple problem."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Further reading\n",
	"\n",
	"I was going to do a whole tutorial, though if you want to know more I recommend this book (and it's a free pdf)\n",
	"http://www-bcf.usc.edu/~gareth/ISL/"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.2"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}