Skip to content

Instantly share code, notes, and snippets.

@sarguido
Last active December 28, 2015 01:49
Show Gist options
  • Save sarguido/7423289 to your computer and use it in GitHub Desktop.
Save sarguido/7423289 to your computer and use it in GitHub Desktop.
A Beginner's Guide to Machine Learning with Scikit-Learn: Preprocessing
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Preprocessing in Scikit-Learn\n",
"\n",
"The first step to any data analysis workflow is getting your data into a usable format. The estimators in scikit-learn have very specific requirements for what they'll take in. This notebook will cover different ways of preprocessing your data into a more scikit-learn-friendly format.\n",
"\n",
"### Getting Data into Scikit-Learn\n",
"\n",
"One of the great things about scikit-learn is that it comes with several toy datasets. If all you want to is play around with certain algorithms, these built-in datasets are a good way to do that. Here's an example of the classic iris dataset."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn import datasets\n",
"\n",
"iris = datasets.load_iris()\n",
"print iris.data[0:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 5.1 3.5 1.4 0.2]\n",
" [ 4.9 3. 1.4 0.2]\n",
" [ 4.7 3.2 1.3 0.2]\n",
" [ 4.6 3.1 1.5 0.2]\n",
" [ 5. 3.6 1.4 0.2]\n",
" [ 5.4 3.9 1.7 0.4]\n",
" [ 4.6 3.4 1.4 0.3]\n",
" [ 5. 3.4 1.5 0.2]\n",
" [ 4.4 2.9 1.4 0.2]\n",
" [ 4.9 3.1 1.5 0.1]]\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scikit-learn also comes with the handwritten digits dataset, the diabetes dataset, a house prices dataset, and sample images.\n",
"\n",
"A really important thing to know is that scikit-learn estimators **only take in continuous data in the form of NumPy arrays**. If your data is already continuous, this isn't a problem. There's a function in NumPy called loadtxt() that can read in a CSV file and convert it to an array with ease. For example, here's both the first five data instances and their labels from the glass identification dataset."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn import preprocessing\n",
"import numpy as np\n",
"\n",
"glass_data = np.loadtxt('../data/glass_data.csv', delimiter=',')\n",
"glass_target = np.loadtxt('../data/glass_target.csv')\n",
"print glass_data[0:5], glass_target[0:5]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[[ 1.52101000e+00 1.36400000e+01 4.49000000e+00 1.10000000e+00\n",
" 7.17800000e+01 6.00000000e-02 8.75000000e+00 0.00000000e+00\n",
" 0.00000000e+00]\n",
" [ 1.51761000e+00 1.38900000e+01 3.60000000e+00 1.36000000e+00\n",
" 7.27300000e+01 4.80000000e-01 7.83000000e+00 0.00000000e+00\n",
" 0.00000000e+00]\n",
" [ 1.51618000e+00 1.35300000e+01 3.55000000e+00 1.54000000e+00\n",
" 7.29900000e+01 3.90000000e-01 7.78000000e+00 0.00000000e+00\n",
" 0.00000000e+00]\n",
" [ 1.51766000e+00 1.32100000e+01 3.69000000e+00 1.29000000e+00\n",
" 7.26100000e+01 5.70000000e-01 8.22000000e+00 0.00000000e+00\n",
" 0.00000000e+00]\n",
" [ 1.51742000e+00 1.32700000e+01 3.62000000e+00 1.24000000e+00\n",
" 7.30800000e+01 5.50000000e-01 8.07000000e+00 0.00000000e+00\n",
" 0.00000000e+00]] [ 1. 1. 1. 1. 1.]\n"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And you're ready to go. However, with categorical data, it's a bit more complicated.\n",
"\n",
"In my presentation and in the sklearn_workflow notebook, I use the car evaluation dataset from the UCI machine learning repository. This is a great dataset to work with because it's simple and is great for classification, but all of the values in the dataset are categorical. This means that I have to transform these categorical values into continuous ones.\n",
"\n",
"One of the easiest ways I've found for importing categorical data is to read in a file from a csv and put it into a list of dictionaries, which can easily be encoded into 1s and 0s in scikit-learn. For the target variables, that simply gets read into a list and is then encoded."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import csv\n",
"\n",
"car_data = list(csv.DictReader(open('../data/cardata.csv', 'rU')))\n",
"car_target = list(csv.reader(open('../data/cartarget.csv', 'rU')))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's what the first dictionary in the list looks like:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"car_data[10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"{'buying': 'vhigh',\n",
" 'doors': '2',\n",
" 'lug_boot': 'small',\n",
" 'maint': 'vhigh',\n",
" 'persons': '4',\n",
" 'safety': 'med'}"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step in vectorizing our categorical values is to create a DictVectorizer() object and then use fit_transform() and toarray() to get the values into a NumPy array."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.feature_extraction import DictVectorizer\n",
"\n",
"vec = DictVectorizer()\n",
"car_data = vec.fit_transform(car_data).toarray()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's a vectorized item and the unencoded item beneath."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Vectorized:', car_data[10]\n",
"print 'Unvectorized:', vec.inverse_transform(car_data[10])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Vectorized: [ 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0.\n",
" 0. 0. 1.]\n",
"Unvectorized: [{'persons=4': 1.0, 'buying=vhigh': 1.0, 'safety=med': 1.0, 'lug_boot=small': 1.0, 'doors=2': 1.0, 'maint=vhigh': 1.0}]\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because the labels are also categorical, those need to be transformed as well. There's a special LabelEncoder() object specifically for this task."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn import preprocessing\n",
"\n",
"le = preprocessing.LabelEncoder()\n",
"le.fit([\"unacc\", \"acc\", \"good\", \"vgood\"])\n",
"target = le.transform(car_target[0])"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's the transformed label and what it means."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Transformed:', target[10] \n",
"print 'Inverse transformed:', le.inverse_transform(target[10])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Transformed: 2\n",
"Inverse transformed: unacc\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Splitting Up the Dataset\n",
"\n",
"Another preprocessing step is to split up the dataset, in order to avoid overfitting. The train_test_split() function is a really simple way to do that. By default, the size of the test set is 25%."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.cross_validation import train_test_split\n",
"\n",
"car_data_train, car_data_test, target_train, target_test = train_test_split(car_data, target)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The length of the whole data set is 1728 instances. After train_test_split():"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'Training set:', len(car_data_train)\n",
"print 'Test set:', len(car_data_test)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Training set: 1296\n",
"Test set: 432\n"
]
}
],
"prompt_number": 8
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment