Last active
December 28, 2015 01:49
-
-
Save sarguido/7423289 to your computer and use it in GitHub Desktop.
A Beginner's Guide to Machine Learning with Scikit-Learn: Preprocessing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Data Preprocessing in Scikit-Learn\n", | |
"\n", | |
"The first step to any data analysis workflow is getting your data into a usable format. The estimators in scikit-learn have very specific requirements for what they'll take in. This notebook will cover different ways of preprocessing your data into a more scikit-learn-friendly format.\n", | |
"\n", | |
"### Getting Data into Scikit-Learn\n", | |
"\n", | |
"One of the great things about scikit-learn is that it comes with several toy datasets. If all you want to is play around with certain algorithms, these built-in datasets are a good way to do that. Here's an example of the classic iris dataset." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from sklearn import datasets\n", | |
"\n", | |
"iris = datasets.load_iris()\n", | |
"print iris.data[0:10]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"[[ 5.1 3.5 1.4 0.2]\n", | |
" [ 4.9 3. 1.4 0.2]\n", | |
" [ 4.7 3.2 1.3 0.2]\n", | |
" [ 4.6 3.1 1.5 0.2]\n", | |
" [ 5. 3.6 1.4 0.2]\n", | |
" [ 5.4 3.9 1.7 0.4]\n", | |
" [ 4.6 3.4 1.4 0.3]\n", | |
" [ 5. 3.4 1.5 0.2]\n", | |
" [ 4.4 2.9 1.4 0.2]\n", | |
" [ 4.9 3.1 1.5 0.1]]\n" | |
] | |
} | |
], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Scikit-learn also comes with the handwritten digits dataset, the diabetes dataset, a house prices dataset, and sample images.\n", | |
"\n", | |
"A really important thing to know is that scikit-learn estimators **only take in continuous data in the form of NumPy arrays**. If your data is already continuous, this isn't a problem. There's a function in NumPy called loadtxt() that can read in a CSV file and convert it to an array with ease. For example, here's both the first five data instances and their labels from the glass identification dataset." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from sklearn import preprocessing\n", | |
"import numpy as np\n", | |
"\n", | |
"glass_data = np.loadtxt('../data/glass_data.csv', delimiter=',')\n", | |
"glass_target = np.loadtxt('../data/glass_target.csv')\n", | |
"print glass_data[0:5], glass_target[0:5]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"[[ 1.52101000e+00 1.36400000e+01 4.49000000e+00 1.10000000e+00\n", | |
" 7.17800000e+01 6.00000000e-02 8.75000000e+00 0.00000000e+00\n", | |
" 0.00000000e+00]\n", | |
" [ 1.51761000e+00 1.38900000e+01 3.60000000e+00 1.36000000e+00\n", | |
" 7.27300000e+01 4.80000000e-01 7.83000000e+00 0.00000000e+00\n", | |
" 0.00000000e+00]\n", | |
" [ 1.51618000e+00 1.35300000e+01 3.55000000e+00 1.54000000e+00\n", | |
" 7.29900000e+01 3.90000000e-01 7.78000000e+00 0.00000000e+00\n", | |
" 0.00000000e+00]\n", | |
" [ 1.51766000e+00 1.32100000e+01 3.69000000e+00 1.29000000e+00\n", | |
" 7.26100000e+01 5.70000000e-01 8.22000000e+00 0.00000000e+00\n", | |
" 0.00000000e+00]\n", | |
" [ 1.51742000e+00 1.32700000e+01 3.62000000e+00 1.24000000e+00\n", | |
" 7.30800000e+01 5.50000000e-01 8.07000000e+00 0.00000000e+00\n", | |
" 0.00000000e+00]] [ 1. 1. 1. 1. 1.]\n" | |
] | |
} | |
], | |
"prompt_number": 8 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"And you're ready to go. However, with categorical data, it's a bit more complicated.\n", | |
"\n", | |
"In my presentation and in the sklearn_workflow notebook, I use the car evaluation dataset from the UCI machine learning repository. This is a great dataset to work with because it's simple and is great for classification, but all of the values in the dataset are categorical. This means that I have to transform these categorical values into continuous ones.\n", | |
"\n", | |
"One of the easiest ways I've found for importing categorical data is to read in a file from a csv and put it into a list of dictionaries, which can easily be encoded into 1s and 0s in scikit-learn. For the target variables, that simply gets read into a list and is then encoded." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import csv\n", | |
"\n", | |
"car_data = list(csv.DictReader(open('../data/cardata.csv', 'rU')))\n", | |
"car_target = list(csv.reader(open('../data/cartarget.csv', 'rU')))" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here's what the first dictionary in the list looks like:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"car_data[10]" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 2, | |
"text": [ | |
"{'buying': 'vhigh',\n", | |
" 'doors': '2',\n", | |
" 'lug_boot': 'small',\n", | |
" 'maint': 'vhigh',\n", | |
" 'persons': '4',\n", | |
" 'safety': 'med'}" | |
] | |
} | |
], | |
"prompt_number": 2 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The first step in vectorizing our categorical values is to create a DictVectorizer() object and then use fit_transform() and toarray() to get the values into a NumPy array." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from sklearn.feature_extraction import DictVectorizer\n", | |
"\n", | |
"vec = DictVectorizer()\n", | |
"car_data = vec.fit_transform(car_data).toarray()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 3 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here's a vectorized item and the unencoded item beneath." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"print 'Vectorized:', car_data[10]\n", | |
"print 'Unvectorized:', vec.inverse_transform(car_data[10])" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Vectorized: [ 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0.\n", | |
" 0. 0. 1.]\n", | |
"Unvectorized: [{'persons=4': 1.0, 'buying=vhigh': 1.0, 'safety=med': 1.0, 'lug_boot=small': 1.0, 'doors=2': 1.0, 'maint=vhigh': 1.0}]\n" | |
] | |
} | |
], | |
"prompt_number": 4 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Because the labels are also categorical, those need to be transformed as well. There's a special LabelEncoder() object specifically for this task." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from sklearn import preprocessing\n", | |
"\n", | |
"le = preprocessing.LabelEncoder()\n", | |
"le.fit([\"unacc\", \"acc\", \"good\", \"vgood\"])\n", | |
"target = le.transform(car_target[0])" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 5 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here's the transformed label and what it means." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"print 'Transformed:', target[10] \n", | |
"print 'Inverse transformed:', le.inverse_transform(target[10])" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Transformed: 2\n", | |
"Inverse transformed: unacc\n" | |
] | |
} | |
], | |
"prompt_number": 6 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Splitting Up the Dataset\n", | |
"\n", | |
"Another preprocessing step is to split up the dataset, in order to avoid overfitting. The train_test_split() function is a really simple way to do that. By default, the size of the test set is 25%." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from sklearn.cross_validation import train_test_split\n", | |
"\n", | |
"car_data_train, car_data_test, target_train, target_test = train_test_split(car_data, target)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 7 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The length of the whole data set is 1728 instances. After train_test_split():" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"print 'Training set:', len(car_data_train)\n", | |
"print 'Test set:', len(car_data_test)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"Training set: 1296\n", | |
"Test set: 432\n" | |
] | |
} | |
], | |
"prompt_number": 8 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment