Created
November 6, 2015 17:15
-
-
Save subpath/fcab076969b2ee6f3cd9 to your computer and use it in GitHub Desktop.
Data preparation with pandas and numpy
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Data preparation with pandas and numpy" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"###Introduction:\n", | |
"\n", | |
"Hi! This is my Practice Notebook # 5 where I share my practice Data Science with Python from very basics. \n", | |
"\n", | |
"You can find useful information about Python and DataScience [here](https://plus.google.com/b/112453425797644541537/112453425797644541537/posts).\n", | |
"\n", | |
"Here I'm gonna use dataset from [Titanic Kaggle Competition](https://www.kaggle.com/c/titanic/data)\n", | |
"\n", | |
"In this notebook I want to describe some essential data preparation process before you can actually start to analyse this data with Machine Learning algorithms." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Using Machine Learning algorithms in Python is very easy with [scikit-learn package](http://scikit-learn.org/stable/index.html)\n", | |
"\n", | |
"Basically is just a few lines of codes required in order to build a model. Like in example below:\n", | |
"\n", | |
"#####Import the kNeighbors Classifiers \n", | |
"knn = KNeighborsClassifier(n_neighbors = 3)\n", | |
"##### Fit the data\n", | |
"knn.fit(X_train,Y_train)\n", | |
"##### Run a prediction\n", | |
"Y_pred = knn.predict(X_test)\n", | |
"##### Check Accuracy against the Testing Set\n", | |
"knn_result = metrics.accuracy_score(Y_test,Y_pred)\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##But in order to use scikit-learn you need to prepare your data first." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"\n", | |
"To use Machine Learning algorithms your data should look like this:\n", | |
"\n", | |
"[x1, x2.. xn] [yi]\n", | |
"\n", | |
"It's means that each Label Y need to have set of features x1, x2.. xn.\n", | |
"\n", | |
"Let's look at the example." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<div>\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>PassengerId</th>\n", | |
" <th>Survived</th>\n", | |
" <th>Pclass</th>\n", | |
" <th>Name</th>\n", | |
" <th>Sex</th>\n", | |
" <th>Age</th>\n", | |
" <th>SibSp</th>\n", | |
" <th>Parch</th>\n", | |
" <th>Ticket</th>\n", | |
" <th>Fare</th>\n", | |
" <th>Cabin</th>\n", | |
" <th>Embarked</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>3</td>\n", | |
" <td>Braund, Mr. Owen Harris</td>\n", | |
" <td>male</td>\n", | |
" <td>22</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>A/5 21171</td>\n", | |
" <td>7.2500</td>\n", | |
" <td>NaN</td>\n", | |
" <td>S</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td>2</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n", | |
" <td>female</td>\n", | |
" <td>38</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>PC 17599</td>\n", | |
" <td>71.2833</td>\n", | |
" <td>C85</td>\n", | |
" <td>C</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td>3</td>\n", | |
" <td>1</td>\n", | |
" <td>3</td>\n", | |
" <td>Heikkinen, Miss. Laina</td>\n", | |
" <td>female</td>\n", | |
" <td>26</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>STON/O2. 3101282</td>\n", | |
" <td>7.9250</td>\n", | |
" <td>NaN</td>\n", | |
" <td>S</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td>4</td>\n", | |
" <td>1</td>\n", | |
" <td>1</td>\n", | |
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n", | |
" <td>female</td>\n", | |
" <td>35</td>\n", | |
" <td>1</td>\n", | |
" <td>0</td>\n", | |
" <td>113803</td>\n", | |
" <td>53.1000</td>\n", | |
" <td>C123</td>\n", | |
" <td>S</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td>5</td>\n", | |
" <td>0</td>\n", | |
" <td>3</td>\n", | |
" <td>Allen, Mr. William Henry</td>\n", | |
" <td>male</td>\n", | |
" <td>35</td>\n", | |
" <td>0</td>\n", | |
" <td>0</td>\n", | |
" <td>373450</td>\n", | |
" <td>8.0500</td>\n", | |
" <td>NaN</td>\n", | |
" <td>S</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"text/plain": [ | |
" PassengerId Survived Pclass \\\n", | |
"0 1 0 3 \n", | |
"1 2 1 1 \n", | |
"2 3 1 3 \n", | |
"3 4 1 1 \n", | |
"4 5 0 3 \n", | |
"\n", | |
" Name Sex Age SibSp \\\n", | |
"0 Braund, Mr. Owen Harris male 22 1 \n", | |
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n", | |
"2 Heikkinen, Miss. Laina female 26 0 \n", | |
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n", | |
"4 Allen, Mr. William Henry male 35 0 \n", | |
"\n", | |
" Parch Ticket Fare Cabin Embarked \n", | |
"0 0 A/5 21171 7.2500 NaN S \n", | |
"1 0 PC 17599 71.2833 C85 C \n", | |
"2 0 STON/O2. 3101282 7.9250 NaN S \n", | |
"3 0 113803 53.1000 C123 S \n", | |
"4 0 373450 8.0500 NaN S " | |
] | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"\n", | |
"\n", | |
"#lets look at the example when you need to read data from CSV file\n", | |
"# I will be using data set from famous Kaggle competition\n", | |
"#open csv file with Pandas\n", | |
"\n", | |
"my_csv_file = pd.read_csv('Kaggle Titanic Dataset.csv')\n", | |
"\n", | |
"#now lets look what we have inside that csv file\n", | |
"\n", | |
"my_csv_file.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"Int64Index: 891 entries, 0 to 890\n", | |
"Data columns (total 12 columns):\n", | |
"PassengerId 891 non-null int64\n", | |
"Survived 891 non-null int64\n", | |
"Pclass 891 non-null int64\n", | |
"Name 891 non-null object\n", | |
"Sex 891 non-null object\n", | |
"Age 714 non-null float64\n", | |
"SibSp 891 non-null int64\n", | |
"Parch 891 non-null int64\n", | |
"Ticket 891 non-null object\n", | |
"Fare 891 non-null float64\n", | |
"Cabin 204 non-null object\n", | |
"Embarked 889 non-null object\n", | |
"dtypes: float64(2), int64(5), object(5)\n", | |
"memory usage: 90.5+ KB\n" | |
] | |
} | |
], | |
"source": [ | |
"#another way to check your data is info method\n", | |
"my_csv_file.info()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here you can see 3 columns: \n", | |
"\n", | |
"-Name of the column in csv file\n", | |
"\n", | |
"-Quantity of cells that's not empty (non-null)\n", | |
"\n", | |
"-Type of the data in column\n", | |
"\n", | |
"Machine Learning algorithms can't deal with Null values, so you need to take care of it." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"#Cabin column have to many missing values, so it's probably better to delete this column completely from our dataset\n", | |
"# you can use pandas drop method to do it\n", | |
"\n", | |
"my_csv_file = my_csv_file.drop(['Cabin'], axis = 1)\n", | |
"\n", | |
"#Age column and Embarked column also have missing values\n", | |
"#You can do few thing to fix it\n", | |
"#Use pandas method dropna, which will delete all rows if any of it's items have NaN value\n", | |
"\n", | |
"my_csv_file = my_csv_file.dropna()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"Int64Index: 712 entries, 0 to 890\n", | |
"Data columns (total 11 columns):\n", | |
"PassengerId 712 non-null int64\n", | |
"Survived 712 non-null int64\n", | |
"Pclass 712 non-null int64\n", | |
"Name 712 non-null object\n", | |
"Sex 712 non-null object\n", | |
"Age 712 non-null float64\n", | |
"SibSp 712 non-null int64\n", | |
"Parch 712 non-null int64\n", | |
"Ticket 712 non-null object\n", | |
"Fare 712 non-null float64\n", | |
"Embarked 712 non-null object\n", | |
"dtypes: float64(2), int64(5), object(4)\n", | |
"memory usage: 66.8+ KB\n" | |
] | |
} | |
], | |
"source": [ | |
"#Check our data again\n", | |
"\n", | |
"my_csv_file.info()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"####So now we've balanced our dataset.\n", | |
"\n", | |
"Sometimes you may don't want to delete rows with missing values.\n", | |
"\n", | |
"Like with column Age we could fill missing values with average age from all passengers\n", | |
"\n", | |
"You can do it with code like this:\n", | |
"\n", | |
"####meanAge = np.mean(my_csv_file.Age)\n", | |
"\n", | |
"####my_csv_file.Age = my_csv_file.Age.fillna(meanAge)\n", | |
"\n", | |
"You can try different approaches and check whats works better\n", | |
"\n", | |
"#Now you can do Machine Learning*!\n", | |
"\n", | |
"*but only if you whant to train you model only for one feature :)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"0.752808988764\n" | |
] | |
} | |
], | |
"source": [ | |
"from sklearn.neighbors import KNeighborsClassifier\n", | |
"\n", | |
"#create an array of features\n", | |
"X_train = np.array(my_csv_file['Fare'])[:, np.newaxis]\n", | |
"#create an array of labels\n", | |
"Y_train = np.array(my_csv_file['Survived'])\n", | |
"\n", | |
"# Import the kNeighbors Classifiers \n", | |
"knn = KNeighborsClassifier(n_neighbors = 3)\n", | |
"# Fit the data\n", | |
"knn.fit(X_train,Y_train)\n", | |
"# Run a prediction\n", | |
"model_score = knn.score(X_train,Y_train)\n", | |
"\n", | |
"print model_score\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"What we've got is a models that predicts survival person or not depending on a fare he paid.\n", | |
"\n", | |
"####But we can do better!\n", | |
"\n", | |
"What if we want to make model with multiple features?\n", | |
"\n", | |
"We can do it with several ways.\n", | |
"\n", | |
"##Create a formula!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from patsy import dmatrices\n", | |
"\n", | |
"# Create an acceptable formula for our machine learning algorithms\n", | |
"#here the ~ sign is an = sign, and the features of our dataset are written as a formula to predict survived. \n", | |
"#The C() lets our algorithm know that those variables are categorical\n", | |
"formula_ml = 'Survived ~ C(Pclass) + C(Sex) + Age + SibSp + Parch + C(Embarked)'\n", | |
"\n", | |
"#assign the variables\n", | |
"Y_train, X_train = dmatrices(formula_ml, data=my_csv_file, return_type='dataframe')\n", | |
"\n", | |
"Y_train= np.asarray(Y_train).ravel()\n", | |
"#and then do the machine learning part" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##Use a Numpy!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"#Create an arrays\n", | |
"Y_train = [x[1] for x in my_csv_file]\n", | |
"X_train = [x[2:10] for x in my_csv_file]\n", | |
"\n", | |
"#1 and 2:10 is a column number in a csv file, remember that count starts from 0\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"##Make custom function to do it!\n", | |
"#####Warning! Not Pythonic code!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"def features_preparation(df):\n", | |
" i=0\n", | |
" x={}\n", | |
" while i<len(df['Pclass']):\n", | |
" x[i]=[df['Pclass'][i], df['Sex'][i], df['Age'][i], df['Fare'][i]] \n", | |
" i+=1\n", | |
" i=0\n", | |
" X_train=[]\n", | |
" while i<len(df['Class']):\n", | |
" X_train.append(x[i])\n", | |
" i+=1 \n", | |
"\n", | |
"X_train = features_preparation(my_csv_file) \n", | |
"\n", | |
"Y_train = np.array(my_csv_file.Survived)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#####But This is pretty messy) better use numpy or patsy methods" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment