pandas for machine learning in python

Introduction

These are notes from this great video series by Kevin Markham: : http://www.dataschool.io/easier-data-analysis-with-pandas/

Reading tabular data

import pandas as pd

pd.read_table('data/chipotle.tsv') pd.read_table('http://bit.ly/chiporders')

Assumes that the data file is tab-separated, first row is a header row. In this case, assumptions hold.

pd.read_table('http://bit.ly/movieusers')

Here, original data is pipe-separated (|)

pd.read_table('http://bit.ly/movieusers', sep='|')

Also, first row is not header

pd.read_table('http://bit.ly/movieusers', sep='|', header='none')

Define header.

user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']

pd.read_table('http://bit.ly/movieusers', sep='|', header='none', names=user_cols)

Select a pandas series from a dataframe

ufo = pd.read_table('http://bit.ly/uforeports', sep=',') OR ufo = pd.read_csv('http://bit.ly/uforeports')

type(ufo) //pandas.core.frame.DataFrame

//first 5 rows ufo.head()

ufo['City']: will select out the 'City' series (column)

type(ufo['City']) //pandas.core.series.Series

//The following two are equivalent, but dot notation only works in certain conditions ufo.City ufo['City']

For column names with spaces, have to use bracket notation ufo['Colors Reported']

Concatenating string series

ufo['Location'] = ufo.City + ufo.State

Will put city next to the state in a column

You HAVE to use bracket notation to create a new column

ufo.describe(include=['object'])

ipynb tip: anytime you have a method or function and want to remember the argument, click anywhere inside the parentheses, and hit shit + tab (three times for screen split)

Renaming columns

ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}(, inplace=True))

ufo_cols=['city', 'colors repoted', 'shape reported', 'state', 'time']

ufo.columns = ufo_cols

(Overwrites the columns)

ufo = pd.read_csv('http://bit.ly/uforeports', names=ufo_cols, header=0)

Tip: string manipulation on all columns

ufo.columns = ufo.columns.str.replace(' ', '_')

Dropping columns

ufo.drop('Colors Reported', axis=1, inplace=true) //axis is a direction (axis 0 is the row axis, axis 1 is the column axis

ufo.drop(['Colors Reported', 'City'], axis=1, inplace=true)

tip: dropping a row

ufo.drop([0,1] , axis=0, inplace=True) => will remove rows at indices 0 and 1

Sorting a pandas DataFrame or Series

movies['title'].sort_values(ascending=False)

//preserve dataframe movies.sort_values('title')

//sorting by two columns: first by content_rating then by duration movies.sort_values(['content_rating', 'duration'])

Filter rows of a DataFrame by column values

//Returns a list of booleans is_long = movies.duration >=200 movies[is_long] //will filter out the rows

Shorter:

movies[movies.duration >= 200]

Breaking it down: "movies.duration >= 200" returns a list of booleans with length = length of data. movies[booleans] applies a filter to the rows.

Multiple filter criteria

movies[(movies.duration >=200) | (movies.genre == 'Drama')]

movies[movies.genre.isin(['crime', 'drama', 'action'])]

Using the axis parameter

//drop a column drinks.drop('continent', axis=1).head() drinks.drop(2, axis=0).head()

drinks.mean(axis=0)//Mean of each column drinks.mean(axis=1)//Mean of each row

Using string methods

orders.item_name.str.upper()

//Returns a series of booleans orders.item_name.str.contains('Chicken')

orders.choice_description.str.replace('[', '')

//Replacing right and left brackets using regex orders.choice_description.str.replace('[[]]', '')

Changing the data type of a pandas series

//Will show the data types of all columns in the drinks dataframe drinks.dttypes

dinks['beer_servings'] = drinks.beer_savings.astype(float)

//Will reading from file

drinks = pd.read_csv('path/to/csv', dtype={'beer_servings':float})

orders.item_price.str.replace('$', '').astype(float).mean()

tip change a list of booleans to a list of 0s and 1s

orders.item_name.str.contains('Chicken').astype(float).head()

Using groupby

drinks.groupby('continent').beer_servings.mean()

drinks[drinks.continent='Africa'].beer_servings.mean()

drinks.groupby('continent').beer_servings.max() (or mean(), or min())

drinks.groupby('continent').beer_servings.agg(['count', 'min', 'max', 'mean'])

//If using ipynb/jupyter %matplotlib inline

//Generates a bar visualization drinks.groupby('continent').mean().plot(kind='bar')

Exploring a pandas Series

movies.genre.value_counts().head()

Categorical

//All unique values movies.genre.unique()

//Number of unique values movies.genre.nunique()

//Cross-tabulation: a tally of how many movies belons to each category of genre and content rating pd.crosstab(movies.genre, movies.content_rating)

Numeric

movies.duration.describe()

//Tally the number of movies for each duration value. Not that useful here with a numeric column movies.duration.value_counts()

Visualization

//if in jupyter/ipynb %matplotlib inline

movies.duration.plot(kind='hist')

movies.genre.value_counts.plot(kind='bar')

Handling missing values

//isnull() will show True/False for a present/missing column ufo.isnull().tail()

//the exact opposite ufo.notnull().tail()

//number of missing values in each of the columns ufo.isnull().sum()

How this works:

//Can sum booleans with pandas (will convert True to 1s and False to 0s) pd.Series([True, False, True]).sum()

//Sum() is selecting axis=0 by default so ufo.isnull().sum(axis=0) => summation accross the rows

equivalent to ufo.isnull().sum()

ufo[ufo.City.isnull()]

Options for handling missing data

//Drop any row in which at least one column has a missing value ufo.dropna(how='any').shape

//Drop any row in which ALL the columns have missing values ufo.dropna(how='all').shape

//If either column missing ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape

//If both columns missing ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape

ufo.dropna(how='all').shape

//With value counts ufo["Shape Reported'].value_counts(dropna=False)

ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)

Things to know about indices in pandas

The index is sometimes called the row label.

The index NOT counted in the matrix dimensions.

Why the index exists:

identification
selection
alignment

Identification

drinks[drinks.continent=='South America']

The index was not modified by this filtering. So index can keep track of rows despite of dataframe modifications

Selection

drinks.loc[23, 'beer_servings];//row 23, beer_servings value

//Made the "country" column the index. Cool! drinks.set_index('country', inplace=True)

//print index drinks.index

//clear index name (sometimes useful) drinks.index.name = None

//reset inex drinks.index.name='country' drinks.reset_index(inplace=True)

tip: using describe

drinks.describe().loc['25%', 'beer_servings']

Series indices

//the series came from the dataframe drinks.continent.head() => will print the index

drinks.set_index('country', inplace=True)

//Then you can do things like drinks.continent.value_counts()['Africa']

//Sorting by index drinks.continent.value_counts().sort_index()

people = pd.Series([50000, 850000], index=['Albania', 'Andorra'], name='population')

//Goal: get total beer servings by country (initial dataframe contains per capita data

drinks.beer_servings * people

tip: concatenating

//axis=1 means concat COLUMNS pd.concat([drinks, people], axis=1)

Selecting multiple rows and columns

loc

import pandas as pd

//row 0, all columns ufo.loc[0, :]

//first three rows, all columns ufo.loc[[0,1,2], : <=> ufo.loc[0:2,:]

Note: loc is inclusive on both sides

//column selection ufo.loc[:, ['City', 'State']]

//City through State ufo.loc[:, 'City':'State']

//Using booleans filtering, the old way: ufo[ufo.City=='Oakland']

With loc, and selecting column: ufo.loc[ufo.City=='Oakland', 'State']

iloc

"i" stands for integer position

//All rows, columns 0 and 3 ufo.iloc[:, [0, 3]]

//All rows, first, second and third column ufo.iloc[:,0:4]

Important note: iloc is EXCLUSIVE ot the second number, inclusive of the first. (so 0:4, returns columns 1, 2, 3), UNLIKE loc

ix

Kindof a hybrid between loc and iloc. Ix figures out whether you're referring to a position or a label.

drinks.ix['Albania', 0]

drinks.ix[1, 'beer_servings']

drinks.ix['Albania':'Andorra', 0:2]

Logic behind ix: when using position, follows iloc inclusive:exclusive pattern

ufo.ix[0:2, 0:2]

Ix is kinda confusing because you need to keep track of its internal logic. So, not always recommended. Use loc and iloc instead

making dataFrames smaller and faster

//display info in the data frame drinks.info

//Exhaustive memory usage information drinks.info(memory_usage='deep')

//break down by column drinks.memory_usage(deep=True)

drinks.memory_usage(deep=True).sum()

//Casting a dictionary as a category for memory efficiency

Object columns can take up a lot of space. How can I be more space-efficient?

drinks['continent'] = drinks.continent.astype('category')

//.cat method for category

drinks.continent.cat.codes: how pandas represents the continent series as integer

//ordered categories

df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)

//Sorting in logical order defined for the category df.sort_values('quality')

//Using boolean conditions df.loc(df.quality > 'good', :)

Using pandas with scikit-learn to create Kaggle submissions

//Titanic survival data set (!)

train = pd=read_csv('http://bit.ly/kaggletrain')

train.head()

feature_cols = ['Pclass', 'Parch']

X = train.loc[:, feature_cols]

y = train.survived

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression() logreg.fit(X, y)

test = pd.read_cs('http://bit.ly/kaggletest')

X_new =test.loc[:, feature_cols] new_pred_class = logreg.predict(X_new)

//Kaggle asks for a .csv file with two columns: passenger id from the testing set, and the predicted classes

pd.DataFrame({'PassengerId':test.PassengerId, 'Survived': new_pred_class)).set_index('PassengerId').to_csv('/name/of/file.csv')

tip: save dataframe to disk

train.to_pickle('train.pkl')

//recover dataframe from pickle file pd.read_pickle('train.pkl')

Creating dummy variables

//Series map method train['Sex_male'] = train.Sex.map({'female':0, 'male':1})

//This creates one column for every possible value of the column given as parameter pd.get_summies(train.Sex)

Generally, if you have K variables, use K-1 dummies.

pd.get_dummies(train.Sex (,prefix='Sex')).iloc[:,1:]

Will produce a single column called "Sex_male".

embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked')

Note: 'Embarked' can take on 3 possible values. Will produce two dummy variable columns

//attach to dataframe, using axis=1 because concatenating columns pd.concat([train, embarked_dummies], axis=1)

passing a dataframe to pd.get_dummies pd.get_dummies(train, columns=['Sex', 'Embarked'])

//for K-1 dummies pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True)

Working with dates and times

ufo['Time'] = pd.to_datetime(ufo.Time)

Note: no need to specifiy format of date. Pandas just figured it out. Doesn't always work though, but many fallback options exist.

ufo.head()

using the datetime format

ufo.Time.dt.hour ufo.time.dt.weekday_name

ts = pd.to_datetime('1/1/1999')

using timestamps for comparisons

ufo.loc[ufo.Time >= ts, :]

//Produces a time delta object ufo.Time.max() - ufo.Time.min()

//if in jupyter/ipynb

%matplotlib inline

ufo['Year'] = ufo.Time.dt.year ufo.Year.value_counts().sort_index().plot()

Finding and removing duplicate rows in pandas

users.zip_code.duplicated()

users.zip_code.duplicated().sum()

//getting duplicated rows in dataframe users.duplicated()

users.loc[users.duplicated(), :]

//mark duplicate as true except first instance users.loc[users.duplicated(keep='first');, :]

//mark duplicate as true except last instance users.loc[users.duplicated(keep='first');, :]

//mark all duplicates as true users.loc[users.duplicated(keep=False);, :]

//Drop duplicates Let's say age+zip_code consistutes a unique id

//Now looking at duplicate age+zip_code only users.duplicated(subset['age', 'zip_code']).sum()

Avoiding the SettingWithCopyWarning

Example 1

//tally of missing content rating values movies.content_rating.isnull().sum()

movies[movies.content_rating.isnull()]

//modifying custom "missing value" label, e.g. NOT RATED => Na

import numpy as np movies[movies.content_rating=='NOT RATED'].content_rating = np.nan

Yields a warning: SettingWithCopyWarning

Replace command with:

movies.lov[movies.contant_rating=='NOT RATED', 'content_rating'] = np.nan

Does not throw a warning

Why did it throw a warning for

movies[movies.content_rating=='NOT RATED'].content_rating = np.nan

get item: movies[movies.content_rating=='NOT RATED'] set item: .content_ratin

pb:pandas doesn't know if get item is view or copy. Warning says "not sure what happened".

Example 2

top_movies.loc[0, 'duration'] = 150

Also throws a SettingWithCopyWarning

Pandas is not sure whether top_movies is a reference or a copy

Problem came from this line of code:

//removes ambiguity top_movies = movies.loc[movies.start_rating >=9, :].copy()

Changing display options in panda

What if I want to show all the rows

pd.set_option('display.max_rows', None)

pd.reset_option('display.max_rows')

Creating a dataFrame from another object

//index param specifies row indices if needed df = pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'yellow']}, columns=['id', 'color'], index=['a', 'b', 'c'])

//other format: list of list

pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'yellow']], columns=['id', 'color'])

import numpy as np

//create a 4*2 numpy array of random numbers b/w 0 and 1 arr = np.random.rand(4,2)

//student ids will be from 100 to 110 with increment 1 //Also, use the student column as index pd.DataFrame({'student': np.arange(100,110,1), 'test':np.random.randint(60,101,10)}).set_index('student')

Creating a series

s = pd.Series(['round', 'square', index=['c', 'b'], name='shape')

pd.concat([df, s], axis=1)

Note: the name of the series will the column name for the concatenated series

Applying a function to a pandas series or dataframe

train['Sex_num'] = train.Sex.map('female':0, 'male':1})

apply as a series method

train['Name_length'] = train.Name.apply(len)

apply as a dataframe method

Applies a function along either axis of the dataframe.

Apply this function in the vertical direction ("row" direction) Will return the max value for each column

drinks.loc(:, 'beer_servings':'wine_servings'].apply(max, axis=0)

applymap

Apply a function to EVERY element (cell) of a dataframe

ericbolo/ml-pandas.md