The data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. Import the data you’re interested in as a collection of DataFrames and combine them to answer your central questions. This course is all about the act of combining — or merging — DataFrames. Organize, reshape, and aggregate multiple datasets to answer your specific questions.
Lead by Team Anaconda, Data Science Training
Different techniques to import multiple files into DataFrames. Share information between DataFrames using their indexes. How indexes work is essential to merging DataFrames.
Perform database-style operations to combine DataFrames. Appending and concatenating DataFrames while working with a variety of real-world datasets.
- Tools for pandas data import
pd.read_csv()
,pd.read_excel()
,pd.read_json()
,pd.read_html()
- Reindexing DataFrames
- indices: many index labels within Index data structures
- indexes: many pandas Index data structures
- Arithmetic with Series & DataFrames
from glob import glob
filenames = glob('sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]
w_mean = pd.read_csv('quarterly_mean_temp.csv', index_col='Month')
print(w_mean.index)
print(type(w_mean.index))
ordered = ['Jan', 'Apr', 'Jul', 'Oct']
w_mean2 = w_mean.reindex(ordered)
w_mean2.sort_index()
w_max.reindex(w_mean3.index).dropna()
weather = pd.read_csv('pittsburgh2013.csv', index_col='Date', parse_dates=True)
weather.loc['2013-07-01':'2013-07-07', 'PrecipitationIn'] * 2.54
// error
week1_range / week1_mean
// right way
week1_range.divide(week1_mean, axis='rows')
bronze = pd.read_csv('bronze_top5.csv', index_col=0)
silver = pd.read_csv('silver_top5.csv', index_col=0)
bronze + silver
bronze.add(silver)
bronze.add(silver, fill_value=0).add(gold, fill_value=0)
- Appending and concatenating Series
append()
stacks rows of s2 below s1- Method for Series and DataFrames
concar()
can stack row-wise or column-wise- pandas module function
- May need to reset the index after appending
result1 = pd.concat([s1, s2, s3])
result2 = s1.append(s2).append(s3)
new_east = northeast.append(south).reset_index(drop=True)
new_east = pd.concat([northeast, south], ignore_index=True)
pd.concat([population, unemployment], axis=1)
pd.concat([rain2013, rain2014], axis=0)
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis=0)
print(rain1314.loc[2014])
rain1314 = pd.concat([rain2013, rain2014], axis='columns')
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis='columns')
- Concatenating rows with
axis='rows'
- Keys and MultiIndexes
- Concatenating columns with
axis='columns'
- Using a multi-index on columns
A = np.arange(8).reshape(2,4) + 0.1
B = np.arange(6).reshape(2,3) + 0.2
C = np.arange(12).reshape(3,4) + 0.3
np.hstack([B, A])
np.concatenate([B, A], axis=1)
np.vstack([A, C])
np.concatenate([A, C], axis=0)
// errors
np.concatenate([A, B], axis=0) # incompatible columns
np.concatenate([A, C], axis=1) # incompatible rows
- Outer join
- Union of index sets (all labels, no repetition)
- Missing fields filled with NaN
- Inner join
- Intersection of index sets (only common labels)
pd.concat([population, unemployment], axis=1, join='inner')
pd.concat([population, unemployment], axis=1, join='outer')
Techniques for merging with left joins, right joins, inner joins, and outer joins. Ordered merging is useful to merge DataFrames with columns that have natural orderings, like date-time columns.
- Merding DataFrames
- Using suffixes
- Specifying columns to merge
- Joining DataFrames
- Inner join, Left and right join
- Outer join
- Which should you use?
- df1.append(df2): stacking vertically
- pd.concat([df1, df2]): stacking many horizontally or vertically, simple inner/outer joins on Indexes
- df1.join(df2): inner/outer/le!/right joins on Indexes
- pd.merge([df1, df2]): many joins on multiple columns
- Ordered merges
pd.merge(population, cities)
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'])
pd.merge(counties, cities, left_on='CITY NAME', right_on='City')
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='inner')
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='left')
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='outer')
population.join(unemployment)
population.join(unemployment, how='right')
population.join(unemployment, how='inner')
population.join(unemployment, how='outer')
pd.merge(hardware, software, how='outer').sort_values('Date')
pd.merge_ordered(hardware, software)
pd.merge_ordered(hardware, software, on=['Date', 'Company'], suffixes=['_hardware', '_software'])
pd.merge_ordered(stocks, gdp, on='Date', fill_method='ffill')
An in-depth case study using Olympic medal data
- Reminder: loading & merging files
- Constructing a pivot table
- Reshaping and plo!ing
- Graphical summary