Skip to content

Instantly share code, notes, and snippets.

@philkuz
Created August 24, 2016 22:29
Show Gist options
  • Save philkuz/46207ccc9bb48eb6cbb416514f01059d to your computer and use it in GitHub Desktop.
Save philkuz/46207ccc9bb48eb6cbb416514f01059d to your computer and use it in GitHub Desktop.
Data Transformation Notebook
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib.mlab as mlab\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Extraction and Transformation\n",
"[source for text](http://www.kdnuggets.com/2016/06/doing-data-science-kaggle-walkthrough-data-transformation-feature-extraction.html)\n",
"\n",
"The main purpose of data transformation and feature extraction is to enhance the data in such a way that it increases the likelihood that the classification algorithm will be able to make meaningful predictions. Unlike the steps taken during cleaning, which are designed to address problems with the raw data (missing and erroneous values, formatting issues etc.), these steps change the values and/or structure of the data (data transformation) and add additional features (feature extraction).\n",
"\n",
"## Bucketing/Binning\n",
"\n",
"A common method for manipulating numeric data, binning or bucketing is when the numerical values in a particular column are converted from a continuous series into fixed ranges. For example, instead of using the age value of all our users, we could place them into buckets such as 15-20 years old, 21-25 years old and so on.\n",
"\n",
"Typically this technique is used to manage ‘noisy data’ and reducing the possibilities \n",
"\n",
"### Example: IQ Data\n",
"In this cell we simulate the distribution of IQs in normal society. We use a histogram to place the continuous range of IQ values. This reduces the range of possible values significantly and make sure our algorithm has to worry less about the data"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Generate Data \n",
"mu, sigma = 100, 15\n",
"x = mu + sigma*np.random.randn(10000)\n",
"x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see there are many many values in this dataset. Take into consideration you may have hundreds of these types of variables and it becomes clear that this may become computationally expensive to manage.\n",
"\n",
"Instead, why don't we break it up into discrete variables and work from there?"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZYAAAEbCAYAAAD51qKQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xu4HWV59/HvLwEEEQhoMRoMUTlFQCMU5KUqeKiJCoZa\nwaBWYrH6iqlaSwU8FLlq5SBoqqlaNBrEQ6yoCLxIQGFrLQhBjsIGojFsEogKbA4GJYTc7x/zrLCy\nXXvvlT0z2Wsefp/r2ldmZs3Meu7MWuueuZ85KCIwMzOryoTxboCZmeXFicXMzCrlxGJmZpVyYjEz\ns0o5sZiZWaWcWMzMrFJOLGZmViknFjMzq5QTi5k1mqSXS9pa0lMkvWy822NOLDYCSYdL+qmkVZI+\nmqb9axr/iaTDVPiVpOeOsJ5pkrbffC2vhqSpkk6TdJ6kGW3Tj5T0P5JWSvrXIcvMSsscK+ljkv5p\n87d8eJJeJOmsIdNmS/qwpBMk/d1o08dTp/YD5wCPAHcCO27+VtlQ8i1dbCSSjgFeHxFHtU37DnBh\nRHwtjf9NGl83zDreAfw4IgY2R5urImk+8GVgGnBzRNzZ9lqn/5fXA++KiNlt0z4O7BAR455gJP0z\n8FfAAxHx92na9sAVEbF/Gr8KOAx4rNP0iLhvXBpP5/an6e8ELgHuiYjHx6t99gQfsVhpEfH9EZLK\nM4B/3MxNqsoOwEMRcVF7UulE0kTgM8C/D3npE8DbJL2gpjZ2LSLOAn4wZPLLgVvaxm8EXjHC9HEz\nTPsBHouIlU4qvWOL8W6ANZukVwLzgf8LPAy8BLgPeHtE/A0wi+IHep6kWyNikaSjgZ2AtcD6iFiY\n1vVOYEvgBcCvgd2BnwNnAccBbwFOj4irJZ0ErAT2BT4fEStSuebTwJuBnYHXpvF9gWcDv4uIRR1i\n+LP2pLheCHxM0uKI+PEo/xUHA8+JiGvaJ0bEOkk3AW8Ebm17z5uBYyLiupFWKuknEXFIGv4s8IWI\n6E/jzwP+AQhArbdMwwH8PCIuGKXduwAPtI0/QPH/PjjM9K5I2hY4DdgL+AtgOfDDiPhShW1vOVCS\ngKcDyzZhOauJE4t1Y09JH0rDAvZovRARl0u6juKz9C7gtIhYJWm79PrXJR0LLIiIAUn7A6+IiHcB\nSPpM6nC9DXhPROwv6UjgpcD70w/zOyl+bM4Afp3KNydQJIMXp+lHRcS5kv4BeEpELJa0M3BSRMyR\n9BTgKmBRe2DDtSfFdRNwbkT8tIv/oynAvcO8tpriB7zdx4DbR1qhpBcCv22b9Ergg62RiFgOnNRF\n20ayI/CntvG1wNPS8HDTR5R+5L8KfDAiVkq6DDg6Ih5tzVNR21u+HBHXp/e+ISXjBytat42BE4t1\n4/aIOKM1Iukvh7ze6qj7PnCdpJ9SHMV0ciRte+5p+GjgKxQdsFDsLe/WVl5bD9wSEbe1tWEm8G5g\na+AZbet7nCJJATxI+vGOiEclTdqE9vzPMO0fzt1Ap/UDbNfWJlJ7zu9inS8H/hdA0mTg3uFKjiU8\nTJGgW7ahSISPDjO9G3OAvohYmcYfBLZK66zDjW3Dg8ChdC6Z2WbixGJVuoOijPU64GxJr4yIDXvc\nkl5CkQi2bFtmy/R3K7B92tvdHbh0yLpbSQdJewILgSMoOpkPlzQhItanWdpr7aPV3Ydrz6a6EnhY\n0j4R8cu2tgrYD/jwGNZ5CEUpD+BlwP9K2jsibknrbi8ntduUctKvgfYdhacD11GUvjpN78ZLKY5Y\nkLQFsG1EPLxRA6tpO5LeSvF5e2ua9DRG3+ZWMycWq4ooOuk/mkpS2wCTKUo5DwPbA3sC5wHz2pab\nAXw3Ih6R9GPgncDaiPjCCO91GMUe8a8kHZTe+83At4ZpV6fhlu8C7x3anhHeu6NUsnsv8OF09tIr\nKfoVngd8sz3ZAEg6Arg0Ih7587Vt8DLgkyk5/W1q176kTvUS5aT2/4efAKe3je8HnAj8YZjpSNoN\n+HUMf0rpjTxxYtB7KE5g2EjJUlh7+1cA/5XatS3F0evlY1yvVcSnG9uwJL0O+BeKH8fPR8Tpkk6k\n+LH4NfApir3L+cBNwF3AMopE8syI+HRazyyKTvxrIuKb6fTjpwITgccj4j/TfH0U5ZcHgaXAR4HD\n0/ovBv4tddJPAz4JLKYor7wb+DFFH8d84HyKH5vTKU4ceB9FMvogcHxEfHFInH/WHkmvAT5LcST1\n8Yi4qW3+N6Z1Pg/4YkR8su21mcAxwM3p9Tsi4lOSJkfE6rb5rgPmtq93SJv2org+YyHw+/T/cThw\nWUT8qtMyo5E0DziKor/nHODTEfGwpLdRnFItYHlEfCPNP9z0fuB9EXHZCO91NEW/zLLhYqyw/W+l\nOEFgGvCtiLi6ivezsas9saQflfkUezALI+L0Ia9vBXwN2J/ih+HNqZP3AODstllPadWlR1unNU/6\ncRiMiIslbU2xhz4tIoaevtsYkp4PHBERZ0k6MiK+swnLvhvYLiLOrK+FYyNpAnBIRFwx3m2x3lTr\ndSzpA7gAmAnsDRyd9sTaHQvcHxG7UySLVifxzcD+EfFiitNG/0vShC7Xac3zQuBagIj4E8URyA7j\n2qLyfgO8W9JpwHM2cdl9KS7660VvojgN3KyjuvtYDqQ4FL4TQNJiYDYbnyEzGzg5DZ9HkTRaPy4t\n21CcGdTtOq15PknxI3w3Renl2WlaY0XE+nSXgrcAB23isvNGn2vc/L+I+ON4N8J6V92JZQpF3b1l\nJUVi6DhPRDwu6QFJO0XE/ZIOpDgNdSrwd+mL2s06rWHSdQdnjDpjw0TER4CPjHc7qhQRa8a7Ddbb\n6r6lS6ezcDqdXjh0PAAi4pqI2Ac4gOJsm626XKeZmY2Tuo9YVlIcbbTsQnEhWbu7KOrPd6f7LW0f\nEYPtM0TE7ZLWAPt0uU4AJDnhmJmNQUR02onvSt1HLEuB3STtmo425gBDL3q6kOL0TCiugr4cNtxq\nfWIa3pXiNiIrulznBhGR7d/JJ5887m1wbI7P8eX3V1atRyxR9JnMo7iKunVqcL+kU4ClEXERxXn6\n50paRnHzwjlp8ZcCJ0paS9Fx/56IuB82nM++0TrrjKNXrVixYrybUJucYwPH13S5x1dW7VfeR8Ql\nFFdct087uW34UYqLnoYu93Xg692u08zMeoOfx9Jgc+fOHe8m1Cbn2MDxNV3u8ZWV9S1dJEXO8ZmZ\n1UES0cOd91ajvr6+8W5CbXKODRxf0+UeX1lOLGZmVimXwszMbCMuhZmZWU9xYmmwnOu8OccGjq/p\nco+vLCcWMzOrlPtYzMxsI+5jMTOznuLE0mA513lzjg0cX9PlHl9ZTixmZlYp97GYmdlG3MdiZmY9\nxYmlwXKu8+YcGzi+pss9vrKcWMzMrFLuYzEzs424j8XMzHqKE0uD5VznzTk2cHxNl3t8ZTmxmJlZ\npdzHYmZmGynbx7JFlY0xs+7NPGImA6sGxrz81ClTWXL+kgpbZFYNJ5YG6+vr49BDDx3vZtQi59ig\niG9g1QCT500e8zoGFow9KdXtybD9co6vLPexmJlZpdzHYjZOph8wvdQRy+oFq+lf2l9hi8wKvo7F\nzMx6ihNLg+V8Ln3OsYHja7rc4yvLicXMzCpVex+LpFnAfIoktjAiTh/y+lbA14D9gXuBN0fEgKRX\nA6cBWwJrgQ9FxBVpmSuAZwF/BAJ4TUTc2+G93cditSh7qjDAwMoBDjztwDEv7z4Wq0tPX8ciaQKw\nAHgVcDewVNIPIuK2ttmOBe6PiN0lvRk4A5gD/B44LCJWS9obWALs0rbc0RFxfZ3tNxtO2VOFAZYf\nv7yi1pj1lrpLYQcCyyLizoh4DFgMzB4yz2zgnDR8HkUSIiJujIjVafgW4CmStmxb7klfxsu5zptz\nbACD/YPj3YRa5b79co+vrLp/nKcAd7WNr0zTOs4TEY8DD0jaqX0GSW8Crk/JqeUrkq6T9NHqm21m\nZmNVd2LpVKMb2ukxdB61z5PKYKcC72qb5y0R8SLgZcDLJL2tgrY2Ts5X/uYcG8CO03cc7ybUKvft\nl3t8ZdV9S5eVwNS28V0o+lra3QU8B7hb0kRg+4gYBJC0C/A94O8iYkVrgYi4J/27RtI3KUpuX+/U\ngLlz5zJt2jQAJk2axIwZMzZ8KFqHsx73+FjGW+WsVpLY1PH1a9cz2D845uXXPLRmo1uLjPf/h8eb\nO97X18eiRYsANvxellHrWWEpUdxO0W9yD3ANRad7f9s8xwH7RMRxkuYAR0TEHEmTgD7glIj4/pB1\nToqI+1KfyzeByyLi7A7vn/VZYe0/Krnp9djKXjU/2D9I/8J+Dj7z4DGvo5fPCuv17VdW7vH19Flh\nEfG4pHnApTxxunG/pFOApRFxEbAQOFfSMuA+ijPCAN4LPB/4mKR/JZ1WDDwCLJG0BTAR+BHwpTrj\nMDOz7vleYWZjUPaIBeDK46/M9ojFms33CjMzs57ixNJgrc63HOUcG/g6lqbLPb6ynFjMzKxSTiwN\nlvNZKTnHBr6Opelyj68sJxYzM6uUE0uD5VznzTk2cB9L0+UeX1lOLGZmViknlgbLuc6bc2zgPpam\nyz2+spxYzMysUk4sDZZznTfn2MB9LE2Xe3xlObGYmVmlnFgaLOc6b86xgftYmi73+MpyYjEzs0o5\nsTRYznXenGMD97E0Xe7xleXEYmZmlXJiabCc67w5xwbuY2m63OMry4nFzMwq5cTSYDnXeXOODdzH\n0nS5x1dWrc+8N+tVM4+YycCqgTEvP7BygMmUezSxWa6cWBos5zpv3bENrBoo9cz65ccvL/X+7mNp\nttzjK8uJxayhBgYGmH7A9DEvP3XKVJacv6TCFpkVnFgarK+vL9s9p5xjg2r6WNatX1fqqGtgwdhL\ngaPJffvlHl9Z7rw3M7NKObE0WM57TDnHBu5jabrc4yvLicXMzCrlxNJgOZ9Ln3Ns4OtYmi73+Mpy\nYjEzs0o5sTRYznXenGMD97E0Xe7xleXEYmZmlao9sUiaJek2SXdIOqHD61tJWixpmaSrJE1N018t\n6VpJN0paKukVbcvsJ+mmtM75dcfQq3Ku8+YcG7iPpelyj6+sWhOLpAnAAmAmsDdwtKS9hsx2LHB/\nROwOzAfOSNN/DxwWES8C5gLnti3zBeCdEbEHsIekmfVFYWZmm6LuI5YDgWURcWdEPAYsBmYPmWc2\ncE4aPg94FUBE3BgRq9PwLcBTJG0paTKwXURck5b5GnBEzXH0pJzrvDnHBu5jabrc4yur7sQyBbir\nbXxlmtZxnoh4HHhA0k7tM0h6E3B9Sk5T0npGWqeZmY2Tuu8Vpg7TYpR51D6PpL2BU4G/3oR1bjB3\n7lymTZsGwKRJk5gxY8aGvY1WnbSp4/Pnz88qnvbx9hp2Xe/X6udoHT1szvHB/kHWr13PYP/gmNdX\ndvk1D63Z6J5XTdt+uX8+N3c8ixYtAtjwe1mGIob9TS6/cukg4OMRMSuNnwhERJzeNs8P0zxXS5oI\n3BMRO6fXdgF+DBwTET9P0yYDV0TE9DQ+BzgkIt7T4f2jzvjGW/uPQm7qjm36AdNL3cDxyuOv5OAz\nDx7z8oP9g/Qv7C+1jrJtWL1gNf1L+8e8/Ehy/mxC/vFJIiI67cR3pe5S2FJgN0m7StoKmANcMGSe\nC4Fj0vCRwOUAkiYBFwEntpIKQOp3eUjSgZIEvB34Qb1h9KacP9g5xwbuY2m63OMrq9bEkvpM5gGX\nArcAiyOiX9Ipkg5Lsy0EniFpGfAB4MQ0/b3A84GPSbpe0nWSnpFeOy4tdwfFyQGX1BmHmZl1r/br\nWCLikojYMyJ2j4jT0rSTI+KiNPxoRByVXj8oIlak6f8eEdtFxH4R8eL0773ptV9ExL5pmffXHUOv\naq/z5ibn2MDXsTRd7vGV5SvvzcysUk4sDZZznTfn2MB9LE2Xe3xlObGYmVmlnFgaLOc6b86xgftY\nmi73+MpyYjEzs0o5sTRYznXenGMD97E0Xe7xleXEYmZmlXJiabCc67w5xwbuY2m63OMrq6vEIum7\nkl6fnq9iZmY2rG4TxReAtwDLJJ3W4WFdNg5yrvPmHBu4j6Xpco+vrK4SS0T8KCLeCuwHrAAuk3Sl\npHdI2rLOBpqZWbN0XdqS9HSKRwS/E7ge+A+KRHNZLS2zUeVc5805NnAfS9PlHl9ZXT3oS9L3gL0o\nnjt/eETck176tqRr62qcmZk1T7dPkPxyRFzcPkHSU9Kdif+yhnZZF3Ku8+YcG7iPpelyj6+sbkth\nn+gw7aoqG2JmZnkYMbFImixpf2AbSS+WtF/6OxR46mZpoQ0r5zpvzrGB+1iaLvf4yhqtFDaTosN+\nF+DTbdMfBj5cU5vMzKzBRkwsEXEOcI6kv42I726mNlmXcq7zjhbbzCNmMrBqYMzrH1g5wGQmj3n5\nstzH0my5x1fWiIlF0tsi4uvANEkfHPp6RHy6w2JmtRtYNcDkeWNPDMuPX15ha8ys3Wid99umf58G\nbNfhz8ZRznXenGMD97E0Xe7xlTVaKey/0r+nbJ7mmJlZ041WCvvsSK9HxPuqbY5tipzrvDnHBu5j\nabrc4ytrtLPCfrFZWmFmm93AwADTD5g+5uWnTpnKkvOXVNgiy0U3Z4VZj+rr68t2zynn2KA3+ljW\nrV9X6gSIgQXDn5WX+/bLPb6yRiuFzY+ID0i6EIihr0fEG2prmZmZNdJopbBz079n1t0Q23Q57zHl\nHBu4j6Xpco+vrNFKYb9I//5E0lYUdzgO4PaIWLsZ2mdmZg3T7aOJXw/8GvgssAD4laTX1tkwG13O\n59LnHBv0Rh9LnXLffrnHV1a3dzc+C3hFRBwaEYcArwA+082CkmZJuk3SHZJO6PD6VpIWS1om6SpJ\nU9P0nSRdLunhoac9S7oirfN6SddJekaXcZiZWc26fR7LwxHxq7bx5RQ3ohyRpAkURzivAu4Glkr6\nQUTc1jbbscD9EbG7pDcDZwBzgD8BHwX2SX9DHR0R13fZ/izlXOfNOTZwH0vT5R5fWaOdFfbGNHit\npIuB/6boYzkSWNrF+g8ElkXEnWl9i4HZQHtimQ2cnIbPo0hERMQjwJWSdh9m3V0/VtnMzDaf0X6c\nD09/WwO/BQ4BDgV+D2zTxfqnAHe1ja9M0zrOExGPAw9I2qmLdX8llcE+2sW8Wcq5zptzbOA+lqbL\nPb6yRjsr7B0l169Oqx1lHnWYZ6i3RMQ9krYFvtd2F+Y/M3fuXKZNmwbApEmTmDFjxobD2NaHo6nj\nN9xwQ0+1Z3OPt36cW2Wlpo2vX7uewf7Bxi6/5qE1G10oON6fB4+Pfbyvr49FixYBbPi9LEMRo/2G\ng6StKfpC9qY4egEgIv5+lOUOAj4eEbPS+InFYnF62zw/TPNcLWkicE9E7Nz2+jHA/sPdl2yk1yVF\nN/FZ80w/YHqpq8avPP5KDj7z4HFbvhfaUHb51QtW07+0f8zLW++SRER0OjDoSrf9FOcCkymeKPkT\niidKjtp5T9EPs5ukXdN1MHOAC4bMcyFwTBo+Eri8w3o2BChpoqSnp+EtgcOAX3YZh5mZ1azbxLJb\nRHwMWJPuH/Z64CWjLZT6TOYBlwK3AIsjol/SKZIOS7MtBJ4haRnwAeDE1vKSfkNxqvMxkgYk7QU8\nBVgi6QbgOop+my91GUdWcq7z5hwbuI+l6XKPr6xuTzd+LP37gKR9gNXAziPMv0FEXALsOWTayW3D\njwJHDbPsc4dZ7V92895mZrb5dZtYzpa0I/AxilLW09KwjaOcz6XPOTbwdSxNl3t8ZXWVWCLiy2nw\nJ8Dz6muOmZk1Xbf3Cnu6pM+l60Z+IWl+qwPdxk/Odd6cYwP3sTRd7vGV1W3n/WLgd8DfAm8C7gW+\nXVejzMysubrtY3lWRPxb2/gn0n29bBzlXOfNOTZwH0vT5R5fWd0esVwqaY6kCenvKMAPuzYzsz8z\nYmJJt6x/CPgH4JvA2vS3GHhX/c2zkeRc5805NnAfS9PlHl9Zo90rbLvN1RAzM8tDt30sSHoD8PI0\n2hcRF9XTJOtWznXenGMD97E0Xe7xldXt6canAe8Hbk1/70/TzMzMNtJt5/3rgL+OiK9ExFeAWWma\njaOc67w5xwbuY2m63OMra1OewjipbXiHqhtiZmZ56LaP5VTgeklXUNzC/uXASbW1yrqSc50359jA\nfSxNl3t8ZY2aWCQJ+BlwEHAARWI5ISJW19w2MzNroFFLYekRjBdHxD0RcUFE/MBJpTfkXOfNOTZw\nH0vT5R5fWd32sVwn6YBaW2JmZlnoto/lJcDbJK0A1lCUwyIiXlhXw2x0Odd5c44N3MfSdLnHV1a3\niWVmra0wM7NsjHavsK0lfQD4F4prV1ZFxJ2tv83SQhtWznXenGMD97E0Xe7xlTVaH8s5FM+Xvxl4\nLXBW7S0yM7NGG60U9oKI2BdA0kLgmvqbZN3Kuc6bc2zgPpamyz2+skZLLI+1BiJiXXFJi1l5M4+Y\nycCqgTEvP7BygMlMrrBFZlaV0RLLi9LzWKA4E2ybNN46K2z7WltnI+rr62vsntPAqgEmzxs+MQz2\nD464V7/8+OV1NGuzeTL0sTT1s9mN3OMra7TnsUzcXA0xM7M8bMpNKK3H5LzHlHsfRO7x5fzZhPzj\nK8uJxczMKuXE0mA5n0ufex9E7vHl/NmE/OMry4nFzMwqVXtikTRL0m2S7pB0QofXt5K0WNIySVdJ\nmpqm7yTpckkPS/rskGX2k3RTWuf8umPoVTnXeXPvg8g9vpw/m5B/fGXVmlgkTQAWUNxrbG/gaEl7\nDZntWOD+iNgdmA+ckab/Cfgo8M8dVv0F4J0RsQewhyTfy8zMrEd0exPKsToQWNa6r5ikxcBs4La2\neWYDJ6fh8ygSERHxCHClpN3bVyhpMrBdRLTuAvA14AhgSV1B9Kqcz6Uf7TqWpsuhj2VgYIDpB0zv\n+Nqah9aw7fbbjrj81ClTWXJ+M7+2OX/3qlB3YpkC3NU2vpIi2XScJyIel/SApJ0i4v4R1rlyyDqn\nVNReM+vSuvXrhr3ItZsdg4EFY7/zgvW2uhNLp3vAxCjzqMM8m7rODebOncu0adMAmDRpEjNmzNiw\np9E6s6Op461pvdKeTR1v7bW3foDax3ecvuOIrzd9fMfpO7J+7fqNfoA3dX29vHw322/NQ2sa+/k9\n9NBDe6o9Zcf7+vpYtGgRwIbfyzJUPHm4HpIOAj4eEbPS+IkUt4I5vW2eH6Z5rpY0EbgnInZue/0Y\nYP+IeF8anwxcERHT0/gc4JCIeE+H948647Oxm37A9BFv6TKaK4+/koPPPLixy/dCG8Z7+dULVtO/\ntH/My1t9JBERY745ZN1nhS0FdpO0q6StgDnABUPmuRA4Jg0fCVzeYT0bAoyI1cBDkg5UcVfMtwM/\nqLzlDZDzufQ59EGMxPE1W87fvSrUWgpLfSbzgEspktjCiOiXdAqwNCIuAhYC50paBtxHkXwAkPQb\nYDtgK0mzgddExG3AccAiYGvg4oi4pM44zMyse3X3sZB+9PccMu3ktuFHgaOGWfa5w0z/BbBvhc1s\npJzPSsn5jDBwfE2X83evCr7y3szMKuXE0mA513lzr9E7vmbL+btXBScWMzOrlBNLg+Vc5829Ru/4\nmi3n714VnFjMzKxSTiwNlnOdN/caveNrtpy/e1VwYjEzs0o5sTRYznXe3Gv0jq/Zcv7uVcGJxczM\nKuXE0mA513lzr9E7vmbL+btXBScWMzOrlBNLg+Vc5829Ru/4mi3n714VnFjMzKxSTiwNlnOdN/ca\nveNrtpy/e1VwYjEzs0o5sTRYznXe3Gv0jq/Zcv7uVcGJxczMKuXE0mA513lzr9E7vmbL+btXBScW\nMzOrlBNLg+Vc5829Ru/4mi3n714VnFjMzKxSTiwNlnOdN/caveNrtpy/e1VwYjEzs0o5sTRYznXe\n3Gv0jq/Zcv7uVWGL8W6ANdPMI2YysGpgzMsPrBxgMpMrbJGZ9Qonlgbr6+sbtz2ngVUDTJ439sSw\n/PjlI74+2D+Y9V5v7n0QuW+/8fzuNYFLYWZmViknlgbLeY8p571dcHxNl/N3rwq1JxZJsyTdJukO\nSSd0eH0rSYslLZN0laSpba+dlKb3S3pN2/QVkm6UdL2ka+qOwczMuldrYpE0AVgAzAT2Bo6WtNeQ\n2Y4F7o+I3YH5wBlp2RcARwHTgdcCn5ektMx64NCIeHFEHFhnDL0s53Ppnwx9EDnLPb6cv3tVqLvz\n/kBgWUTcCSBpMTAbuK1tntnAyWn4POBzafgNwOKIWAeskLQsre9qQLiMZ9ZoAwMDTD9geql1TJ0y\nlSXnL6moRVaVuhPLFOCutvGVFMmh4zwR8bikByXtlKZf1TbfqjQNIIAlkgI4OyK+VEfje13Odd7c\na/SOD9atX1fqzEKAgQVjP+W9jJy/e1WoO7Gow7Tocp6Rlj04IlZL+gvgMkn9EfGzEu00M7OK1J1Y\nVgJT28Z3Ae4eMs9dwHOAuyVNBHaIiEFJK9P0P1s2Ilanf38v6fsUR0EdE8vcuXOZNm0aAJMmTWLG\njBkb9jZaddKmjs+fP39c42nV0Vt7p1WOt9fo61j/eI8P9g+yfu36ja732NT19fLy3Wy/su8/2D/I\nmofWbHifzfn5b+9j6ZXfg7LxLFq0CGDD72UZihh6AFGdlChuB14F3ANcAxwdEf1t8xwH7BMRx0ma\nAxwREXNS5/03gJdQlMAuA3YHtgEmRMQfJG0LXAqcEhGXdnj/qDO+8TaeF2lNP2B6qTLGlcdfycFn\nHjzs66NdYDfa8mXfv+7lB/sH6V/Y3+gYRlq+mwsky74/wOoFq+lf2j/6jBXL/QJJSUREp6pRV2o9\nYkl9JvMofvwnAAsjol/SKcDSiLgIWAicmzrn7wPmpGVvlfTfwK3AY8BxERGSngl8P/WvbAF8o1NS\neTLI+YPtPohmyz2+nL97Vaj9li4RcQmw55BpJ7cNP0pxWnGnZU8FTh0y7TfAjOpbamZmVfApuw2W\n87n0uV8H4fiaLefvXhWcWMzMrFJOLA2Wc5039xq942u2nL97VXBiMTOzSjmxNFjOdd7ca/SOr9ly\n/u5VwYnwlcp/AAAJwUlEQVTFzMwq5cTSYDnXeXOv0Tu+Zsv5u1cFJxYzM6uUn3nfYGO9rcTMI2Yy\nsKrcXWEHVg4wmXJ3ph1J7s9Mz70PIvftl/stXcpyYnkSGlg1UPp25cuPX15Ra8wsNy6FNVjOe0w5\n7+2C42u6nL97VXBiMTOzSrkU1mA513lzr9G7j6UaZR9vPNZHG+f83auCE4uZNVbZxxuP16ONc+dS\nWIPlvMeU89EKOL6my/m7VwUnFjMzq5QTS4PlfL+iJ0MfRM5yjy/n714VnFjMzKxSTiwNlnOdN/ca\nveNrtpy/e1VwYjEzs0o5sTRYznXe3Gv0jq/Zcv7uVcGJxczMKuXE0mA513lzr9E7vmbL+btXBV95\nb2ZPWuN1S5jcObE0WM73K/K9wpqtKdtvrLeEacXnW8J05sTSQK0Hda15aA3bbr/tJi9f90O6zOzJ\nzYmlgco+qKsJD+lqwt5uGY6v2XKPryx33puZWaVqTyySZkm6TdIdkk7o8PpWkhZLWibpKklT2147\nKU3vl/Sabtf5ZJFznT7n2MDxNV3u8ZVVa2KRNAFYAMwE9gaOlrTXkNmOBe6PiN2B+cAZadkXAEcB\n04HXAp9XoZt1Pin8YeAP492E2uQcGzi+pss9vrLq7mM5EFgWEXcCSFoMzAZua5tnNnByGj4P+Fwa\nfgOwOCLWASskLUvrUxfr7GmtzvexanW+r3tkXYWt6i05xwaOr+lyj6+suhPLFOCutvGVFMmh4zwR\n8bikByXtlKZf1TbfqjRNXayzpz0ZOt/Nngx8HUxndScWdZgWXc4z3PRO5buh6+woInjpIS9l7eNr\nu5m9o4kTJvLU7Z/KPb+7Z8zrqOp03z/d+6fS6+hVOccGjq/pWvGVfTTyzz70sywTkyK6+k0e28ql\ng4CPR8SsNH4iEBFxets8P0zzXC1pInBPROw8dF5Jl1CUzDTaOtvWXV9wZmYZi4hOO/ddqfuIZSmw\nm6RdgXuAOcDRQ+a5EDgGuBo4Erg8Tb8A+Iakz1CUwHYDrqE4YhltnUC5/xgzMxubWhNL6jOZB1xK\nkRAWRkS/pFOApRFxEbAQODd1zt9HkSiIiFsl/TdwK/AYcFwUh1cd11lnHGZm1r1aS2FmZvbkk9WV\n95ImSLpO0gVpfJqkn0u6XdK3JDX2FjaSdpD0nXSx6C2SXiJpR0mXpviWSNphvNs5VpL+SdIvJd0k\n6RvpwtnGbj9JCyX9VtJNbdOG3V6SPpsuBr5B0ozxaXX3honvjPT5vEHSdyVt3/Zax4ude1Gn2Npe\nO17S+nTmamta47ddmv6P6cLzmyWd1jZ9k7ddVokFeD9F6azldOCsiNgTeIDiYsym+g/g4oiYDryI\n4rqdE4EfpfguB04ax/aNmaRnA/8I7BcRL6Qo0R5Ns7ffVyku4m3XcXtJei3w/HSR8LuBL27Oho5R\np/guBfaOiBnAMp6Ir+PFzpuxrZuqU2xI2gV4NXBn27Qstp2kQ4HDgX0iYl/gzDR9OmPYdtkklrTR\nXwd8uW3yK4HvpuFzgL/Z3O2qgqTtgJdFxFcBImJdRDxIcWHoOWm2c4AjxqmJVZgIbJuOSrYB7gZe\nQUO3X0T8DBh634+h22t22/SvpeWuBnaQ9MzN0c6x6hRfRPwoItan0Z8Du6ThDRc7R8QKiqTTs9ee\nDbPtAD4D/MuQaVlsO+A9wGnpgnQi4t40fTZj2HbZJBae2OgBIOnpwGDbB30l8OxxaltZzwPulfTV\nVOo7W9JTgWdGxG8BImI18Bfj2soxioi7gbOAAYoLYR8ErgMeyGT7tew8ZHvtnKYPvZC4dTFwk/09\ncHEabnx8kg4H7oqIm4e81PjYkj2Al6fS8xWS9k/TxxRfFolF0uuB30bEDTxxYaXahluaeqbCFsB+\nwH9GxH7AGoqySlPj2YikSRR7RrtSJI9tKQ67h8oi3g66uZC4MSR9BHgsIr7VmtRhtsbEJ2kb4CM8\nceupjV7uMK0xsbXZApgUEQcBHwK+k6aPKb4sEgvwV8AbJC0HvkVRAptPcVjainEXivJKE62k2Fu6\nNo1/lyLR/LZ12C1pMvC7cWpfWa8GlkfE/RHxOPB94GBgUibbr2W47bUSeE7bfI2NVdIxFCXpt7RN\nbnp8zwemATdK+g1F+6+TtDPNj63lLuB7ABGxlOKyjqdTxDe1bb6u4ssisUTEhyNiakQ8j+I6mMsj\n4m3AFRQXXUJxEeYPxquNZaTyyV2S9kiTXgXcQnER6dw0rbHxUZTADpK0deoYbMXX9O039Ki5fXvN\n5Yl4LgDeDhvuVvFAq2TW4zaKT9Isir3dN0TEo23zXQDMSWf6PZcnLnbuZRtii4hfRsTkiHheRDyX\n4sf2xRHxOzLZdsD5FN870u/MVhFxH0V8b97kbRcRWf0BhwAXpOHnUlzRfwfwbWDL8W5fibheRHEn\ngxso9ix2AHYCfgTcDlxGcSg77m0dY3wnA/3ATRQd21s2efsB36TYs3uUInG+A9hxuO1F8SiIXwE3\nUpwdN+4xjCG+ZRRnTF2X/j7fNv9JKb5+4DXj3f5NjW3I68uBnTLbdlsA5wI3A9cCh5TZdr5A0szM\nKpVFKczMzHqHE4uZmVXKicXMzCrlxGJmZpVyYjEzs0o5sZiZWaWcWMzGQNJH0m3+b0z3bzughvdo\n5N2qzXwdi9kmSldYn0VxEdm69GyOraK4sWRV7zEBeDAitqtqnWabi49YzDbds4B744lbjN8fEasl\n/UbSJyVdL+kaSS+WdEl6SNK7ASRtK+lHkq5NRztvSNN3TQ9ZOkfSzRSPf9gmHQ2dK+mpki5K675J\n0pHDts5snPmIxWwTSdoW+BnFc2N+DHw7In6ablB4akScLenTFDdDPRh4KnBLRDxT0kRgm4j4Q7rJ\n388jYndJuwK/Bv5PFDcBRNJDEbF9Gn4jMDMiWglqu4h4ePNGbtYdH7GYbaKIWENxd+l3Ab8HFqe7\n+gZwYZrtZuDqiHgkiocm/TE9qlfAqZJupLhv2LPTXXIB7mwllQ5uBl4t6VRJL3VSsV7WmGeIm/WS\nKA71fwr8NJWujkkvte7qu75tuDW+BfBW4BkUd8ddn45ytk7zrBnyNhvuPhsRy9LDl14HfELSjyLi\nE1XGZFYVH7GYbSJJe0jarW3SDGDFaIulf3cAfpeSyisoHm42dJ6Wtal0hqRnAX+MiG8Cn6I4YjLr\nST5iMdt0TwM+J2kHYB3FLcXfBRw2wjKtzsxvABemUti1FLciHzpPy9nAzZJ+QXFL809JWg+spXhG\nuVlPcue9mZlVyqUwMzOrlBOLmZlVyonFzMwq5cRiZmaVcmIxM7NKObGYmVmlnFjMzKxSTixmZlap\n/w+pONZ5pdO0sgAAAABJRU5ErkJggg==\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f8f616e9d10>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"bins = 20\n",
"# the histogram of the data\n",
"n, bins, patches = plt.hist(x, bins, normed=1, facecolor='green', alpha=0.75)\n",
"\n",
"plt.xlabel('Smarts')\n",
"plt.ylabel('Probability')\n",
"plt.title(r'$\\mathrm{Histogram\\ of\\ IQ:}\\ \\mu=100,\\ \\sigma=15$')\n",
"plt.axis([40, 160, 0, 0.03])\n",
"plt.grid(True)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Normalization\n",
"\n",
"This type of transformation simply means converting values into a 'normalized' range. This is a pretty general concept that helps best with algorithms that are dependent on the magnitude of the data passed in. It's important to note that this is **not** necessary for all algorithms. [This quora answer](https://www.quora.com/In-data-mining-and-statistical-data-analysis-when-do-I-need-to-normalize-data-statistical-normalization-and-why-is-it-important-to-do-so) discusses in more detail what kinds of problems would benefit from normalization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example: Clustering\n",
"One of the more apparent usecases of normalization comes up with clustering \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Other mathematical transformations\n",
"There are a wide variety of other transformations available and you should do your do diligence and look for some other techniques available, a good starting place is [the wikipedia page](https://en.wikipedia.org/wiki/Data_transformation_(statistics))\n",
"\n",
"Based on our limited time however, we will only cover the log transformation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Categorical Data: One Hot Encoding\n",
"\n",
"When dealing with categorical data, you need to feed your classifier data that is numerical. One technique is by transforming a single column w/ `x` categories into `x` columns. Whether or not a datapoint fits into one of the `x` categories is represented by a 1 or a 0 repsectively in the corresponding new column for that category"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Gender</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Male</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Female</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Not specified</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Not specified</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Female</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Gender\n",
"0 Male\n",
"1 Female\n",
"3 Not specified\n",
"4 Not specified\n",
"5 Female"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame( {'Gender': {0: 'Male', 1: 'Female', 3: 'Not specified', 4: 'Not specified', 5: 'Female'}})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have 3 cateogries in our example - Male, Female, and Not specified. To better satisfy some angsty classifiers, we will use 3 columns instead of the single category column. First thing we do is give each category a number"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"{'Female': 0, 'Male': 1, 'Not specified': 2}"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"unique_values = np.unique(df['Gender'])\n",
"gender_i_map = {category : i for i, category in enumerate(unique_values)}\n",
"gender_i_map"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Gender</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Gender\n",
"0 1\n",
"1 0\n",
"3 2\n",
"4 2\n",
"5 0"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = df.replace({'Gender' : gender_i_map})\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can use the sklearn OneHotEncoder library and get to work."
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import OneHotEncoder as OHE\n",
"\n",
"enc = OHE()\n",
"# enc.transform(['Male'])\n",
"enc.fit(df['Gender'].reshape(-1, 1))\n",
"# check number of values found\n",
"enc.n_values_[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sweet, three values found, now let's adjust the table with our initialized OHE."
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Female</th>\n",
" <th>Male</th>\n",
" <th>Not specified</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Female Male Not specified\n",
"0 0 1 0\n",
"1 1 0 0\n",
"2 0 0 1\n",
"3 0 0 1\n",
"4 1 0 0"
]
},
"execution_count": 80,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# transform simply takes the columnar Gender data, then returns the OneHotEncoded matrix\n",
"one = enc.transform(df['Gender'].reshape(-1, 1)).toarray()\n",
"df = pd.DataFrame(one, dtype=np.int, columns=unique_values)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment