These are notes from this great video series by Kevin Markham: : http://www.dataschool.io/easier-data-analysis-with-pandas/
import pandas as pd
pd.read_table('data/chipotle.tsv') pd.read_table('http://bit.ly/chiporders')
Assumes that the data file is tab-separated, first row is a header row. In this case, assumptions hold.
pd.read_table('http://bit.ly/movieusers')
Here, original data is pipe-separated (|)
pd.read_table('http://bit.ly/movieusers', sep='|')
Also, first row is not header
pd.read_table('http://bit.ly/movieusers', sep='|', header='none')
Define header.
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
pd.read_table('http://bit.ly/movieusers', sep='|', header='none', names=user_cols)
ufo = pd.read_table('http://bit.ly/uforeports', sep=',') OR ufo = pd.read_csv('http://bit.ly/uforeports')
type(ufo) //pandas.core.frame.DataFrame
//first 5 rows ufo.head()
ufo['City']: will select out the 'City' series (column)
type(ufo['City']) //pandas.core.series.Series
//The following two are equivalent, but dot notation only works in certain conditions ufo.City ufo['City']
For column names with spaces, have to use bracket notation ufo['Colors Reported']
Concatenating string series
ufo['Location'] = ufo.City + ufo.State
Will put city next to the state in a column
You HAVE to use bracket notation to create a new column
ufo.describe(include=['object'])
- ipynb tip: anytime you have a method or function and want to remember the argument, click anywhere inside the parentheses, and hit shit + tab (three times for screen split)
ufo.rename(columns={'Colors Reported':'Colors_Reported', 'Shape Reported':'Shape_Reported'}(, inplace=True))
OR
ufo_cols=['city', 'colors repoted', 'shape reported', 'state', 'time']
ufo.columns = ufo_cols
(Overwrites the columns)
OR
ufo = pd.read_csv('http://bit.ly/uforeports', names=ufo_cols, header=0)
- Tip: string manipulation on all columns
ufo.columns = ufo.columns.str.replace(' ', '_')
ufo.drop('Colors Reported', axis=1, inplace=true) //axis is a direction (axis 0 is the row axis, axis 1 is the column axis
OR
ufo.drop(['Colors Reported', 'City'], axis=1, inplace=true)
- tip: dropping a row
ufo.drop([0,1] , axis=0, inplace=True) => will remove rows at indices 0 and 1
movies['title'].sort_values(ascending=False)
//preserve dataframe movies.sort_values('title')
//sorting by two columns: first by content_rating then by duration movies.sort_values(['content_rating', 'duration'])
//Returns a list of booleans is_long = movies.duration >=200 movies[is_long] //will filter out the rows
Shorter:
movies[movies.duration >= 200]
Breaking it down: "movies.duration >= 200" returns a list of booleans with length = length of data. movies[booleans] applies a filter to the rows.
movies[(movies.duration >=200) | (movies.genre == 'Drama')]
movies[movies.genre.isin(['crime', 'drama', 'action'])]
//drop a column drinks.drop('continent', axis=1).head() drinks.drop(2, axis=0).head()
drinks.mean(axis=0)//Mean of each column drinks.mean(axis=1)//Mean of each row
orders.item_name.str.upper()
//Returns a series of booleans orders.item_name.str.contains('Chicken')
orders.choice_description.str.replace('[', '')
//Replacing right and left brackets using regex orders.choice_description.str.replace('[[]]', '')
//Will show the data types of all columns in the drinks dataframe drinks.dttypes
dinks['beer_servings'] = drinks.beer_savings.astype(float)
//Will reading from file
drinks = pd.read_csv('path/to/csv', dtype={'beer_servings':float})
orders.item_price.str.replace('$', '').astype(float).mean()
- tip change a list of booleans to a list of 0s and 1s
orders.item_name.str.contains('Chicken').astype(float).head()
drinks.groupby('continent').beer_servings.mean()
drinks[drinks.continent='Africa'].beer_servings.mean()
drinks.groupby('continent').beer_servings.max() (or mean(), or min())
drinks.groupby('continent').beer_servings.agg(['count', 'min', 'max', 'mean'])
//If using ipynb/jupyter %matplotlib inline
//Generates a bar visualization drinks.groupby('continent').mean().plot(kind='bar')
movies.genre.value_counts().head()
//All unique values movies.genre.unique()
//Number of unique values movies.genre.nunique()
//Cross-tabulation: a tally of how many movies belons to each category of genre and content rating pd.crosstab(movies.genre, movies.content_rating)
movies.duration.describe()
//Tally the number of movies for each duration value. Not that useful here with a numeric column movies.duration.value_counts()
//if in jupyter/ipynb %matplotlib inline
movies.duration.plot(kind='hist')
movies.genre.value_counts.plot(kind='bar')
//isnull() will show True/False for a present/missing column ufo.isnull().tail()
//the exact opposite ufo.notnull().tail()
//number of missing values in each of the columns ufo.isnull().sum()
How this works:
//Can sum booleans with pandas (will convert True to 1s and False to 0s) pd.Series([True, False, True]).sum()
//Sum() is selecting axis=0 by default so ufo.isnull().sum(axis=0) => summation accross the rows
equivalent to ufo.isnull().sum()
ufo[ufo.City.isnull()]
//Drop any row in which at least one column has a missing value ufo.dropna(how='any').shape
//Drop any row in which ALL the columns have missing values ufo.dropna(how='all').shape
//If either column missing ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape
//If both columns missing ufo.dropna(subset=['City', 'Shape Reported'], how='any').shape
ufo.dropna(how='all').shape
//With value counts ufo["Shape Reported'].value_counts(dropna=False)
ufo['Shape Reported'].fillna(value='VARIOUS', inplace=True)
The index is sometimes called the row label.
The index NOT counted in the matrix dimensions.
Why the index exists:
- identification
- selection
- alignment
drinks[drinks.continent=='South America']
The index was not modified by this filtering. So index can keep track of rows despite of dataframe modifications
drinks.loc[23, 'beer_servings];//row 23, beer_servings value
//Made the "country" column the index. Cool! drinks.set_index('country', inplace=True)
//print index drinks.index
//clear index name (sometimes useful) drinks.index.name = None
//reset inex drinks.index.name='country' drinks.reset_index(inplace=True)
- tip: using describe
drinks.describe().loc['25%', 'beer_servings']
//the series came from the dataframe drinks.continent.head() => will print the index
drinks.set_index('country', inplace=True)
//Then you can do things like drinks.continent.value_counts()['Africa']
//Sorting by index drinks.continent.value_counts().sort_index()
people = pd.Series([50000, 850000], index=['Albania', 'Andorra'], name='population')
//Goal: get total beer servings by country (initial dataframe contains per capita data
drinks.beer_servings * people
- tip: concatenating
//axis=1 means concat COLUMNS pd.concat([drinks, people], axis=1)
import pandas as pd
//row 0, all columns ufo.loc[0, :]
//first three rows, all columns ufo.loc[[0,1,2], : <=> ufo.loc[0:2,:]
Note: loc is inclusive on both sides
//column selection ufo.loc[:, ['City', 'State']]
//City through State ufo.loc[:, 'City':'State']
//Using booleans filtering, the old way: ufo[ufo.City=='Oakland']
With loc, and selecting column: ufo.loc[ufo.City=='Oakland', 'State']
"i" stands for integer position
//All rows, columns 0 and 3 ufo.iloc[:, [0, 3]]
//All rows, first, second and third column ufo.iloc[:,0:4]
Important note: iloc is EXCLUSIVE ot the second number, inclusive of the first. (so 0:4, returns columns 1, 2, 3), UNLIKE loc
Kindof a hybrid between loc and iloc. Ix figures out whether you're referring to a position or a label.
drinks.ix['Albania', 0]
drinks.ix[1, 'beer_servings']
drinks.ix['Albania':'Andorra', 0:2]
Logic behind ix: when using position, follows iloc inclusive:exclusive pattern
ufo.ix[0:2, 0:2]
Ix is kinda confusing because you need to keep track of its internal logic. So, not always recommended. Use loc and iloc instead
//display info in the data frame drinks.info
//Exhaustive memory usage information drinks.info(memory_usage='deep')
//break down by column drinks.memory_usage(deep=True)
drinks.memory_usage(deep=True).sum()
//Casting a dictionary as a category for memory efficiency
Object columns can take up a lot of space. How can I be more space-efficient?
drinks['continent'] = drinks.continent.astype('category')
//.cat method for category
drinks.continent.cat.codes: how pandas represents the continent series as integer
//ordered categories
df['quality'] = df.quality.astype('category', categories=['good', 'very good', 'excellent'], ordered=True)
//Sorting in logical order defined for the category df.sort_values('quality')
//Using boolean conditions df.loc(df.quality > 'good', :)
//Titanic survival data set (!)
train = pd=read_csv('http://bit.ly/kaggletrain')
train.head()
feature_cols = ['Pclass', 'Parch']
X = train.loc[:, feature_cols]
y = train.survived
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression() logreg.fit(X, y)
test = pd.read_cs('http://bit.ly/kaggletest')
X_new =test.loc[:, feature_cols] new_pred_class = logreg.predict(X_new)
//Kaggle asks for a .csv file with two columns: passenger id from the testing set, and the predicted classes
pd.DataFrame({'PassengerId':test.PassengerId, 'Survived': new_pred_class)).set_index('PassengerId').to_csv('/name/of/file.csv')
- tip: save dataframe to disk
train.to_pickle('train.pkl')
//recover dataframe from pickle file pd.read_pickle('train.pkl')
//Series map method train['Sex_male'] = train.Sex.map({'female':0, 'male':1})
//This creates one column for every possible value of the column given as parameter pd.get_summies(train.Sex)
Generally, if you have K variables, use K-1 dummies.
pd.get_dummies(train.Sex (,prefix='Sex')).iloc[:,1:]
Will produce a single column called "Sex_male".
embarked_dummies = pd.get_dummies(train.Embarked, prefix='Embarked')
Note: 'Embarked' can take on 3 possible values. Will produce two dummy variable columns
//attach to dataframe, using axis=1 because concatenating columns pd.concat([train, embarked_dummies], axis=1)
- passing a dataframe to pd.get_dummies pd.get_dummies(train, columns=['Sex', 'Embarked'])
//for K-1 dummies pd.get_dummies(train, columns=['Sex', 'Embarked'], drop_first=True)
ufo['Time'] = pd.to_datetime(ufo.Time)
Note: no need to specifiy format of date. Pandas just figured it out. Doesn't always work though, but many fallback options exist.
ufo.head()
ufo.Time.dt.hour ufo.time.dt.weekday_name
ts = pd.to_datetime('1/1/1999')
ufo.loc[ufo.Time >= ts, :]
//Produces a time delta object ufo.Time.max() - ufo.Time.min()
//if in jupyter/ipynb
%matplotlib inline
ufo['Year'] = ufo.Time.dt.year ufo.Year.value_counts().sort_index().plot()
users.zip_code.duplicated()
users.zip_code.duplicated().sum()
//getting duplicated rows in dataframe users.duplicated()
users.loc[users.duplicated(), :]
//mark duplicate as true except first instance users.loc[users.duplicated(keep='first');, :]
//mark duplicate as true except last instance users.loc[users.duplicated(keep='first');, :]
//mark all duplicates as true users.loc[users.duplicated(keep=False);, :]
//Drop duplicates Let's say age+zip_code consistutes a unique id
//Now looking at duplicate age+zip_code only users.duplicated(subset['age', 'zip_code']).sum()
//tally of missing content rating values movies.content_rating.isnull().sum()
movies[movies.content_rating.isnull()]
//modifying custom "missing value" label, e.g. NOT RATED => Na
import numpy as np movies[movies.content_rating=='NOT RATED'].content_rating = np.nan
Yields a warning: SettingWithCopyWarning
Replace command with:
movies.lov[movies.contant_rating=='NOT RATED', 'content_rating'] = np.nan
Does not throw a warning
Why did it throw a warning for
movies[movies.content_rating=='NOT RATED'].content_rating = np.nan
get item: movies[movies.content_rating=='NOT RATED'] set item: .content_ratin
pb:pandas doesn't know if get item is view or copy. Warning says "not sure what happened".
top_movies.loc[0, 'duration'] = 150
Also throws a SettingWithCopyWarning
Pandas is not sure whether top_movies is a reference or a copy
Problem came from this line of code:
//removes ambiguity top_movies = movies.loc[movies.start_rating >=9, :].copy()
What if I want to show all the rows
pd.set_option('display.max_rows', None)
pd.reset_option('display.max_rows')
//index param specifies row indices if needed df = pd.DataFrame({'id':[100, 101, 102], 'color':['red', 'blue', 'yellow']}, columns=['id', 'color'], index=['a', 'b', 'c'])
//other format: list of list
pd.DataFrame([[100, 'red'], [101, 'blue'], [102, 'yellow']], columns=['id', 'color'])
import numpy as np
//create a 4*2 numpy array of random numbers b/w 0 and 1 arr = np.random.rand(4,2)
//student ids will be from 100 to 110 with increment 1 //Also, use the student column as index pd.DataFrame({'student': np.arange(100,110,1), 'test':np.random.randint(60,101,10)}).set_index('student')
s = pd.Series(['round', 'square', index=['c', 'b'], name='shape')
pd.concat([df, s], axis=1)
Note: the name of the series will the column name for the concatenated series
train['Sex_num'] = train.Sex.map('female':0, 'male':1})
train['Name_length'] = train.Name.apply(len)
Applies a function along either axis of the dataframe.
Apply this function in the vertical direction ("row" direction) Will return the max value for each column
drinks.loc(:, 'beer_servings':'wine_servings'].apply(max, axis=0)
Apply a function to EVERY element (cell) of a dataframe