Merging DataFrames with pandas

The data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. Import the data you’re interested in as a collection of DataFrames and combine them to answer your central questions. This course is all about the act of combining — or merging — DataFrames. Organize, reshape, and aggregate multiple datasets to answer your specific questions.

Lead by Team Anaconda, Data Science Training

Preparing data

Different techniques to import multiple files into DataFrames. Share information between DataFrames using their indexes. How indexes work is essential to merging DataFrames.

Perform database-style operations to combine DataFrames. Appending and concatenating DataFrames while working with a variety of real-world datasets.

Tools for pandas data import
- pd.read_csv(), pd.read_excel(), pd.read_json(), pd.read_html()
Reindexing DataFrames
- indices: many index labels within Index data structures
- indexes: many pandas Index data structures
Arithmetic with Series & DataFrames

from glob import glob
filenames = glob('sales*.csv')
dataframes = [pd.read_csv(f) for f in filenames]

w_mean = pd.read_csv('quarterly_mean_temp.csv', index_col='Month')
print(w_mean.index)
print(type(w_mean.index))

ordered = ['Jan', 'Apr', 'Jul', 'Oct']
w_mean2 = w_mean.reindex(ordered)
w_mean2.sort_index()

w_max.reindex(w_mean3.index).dropna()

weather = pd.read_csv('pittsburgh2013.csv', index_col='Date', parse_dates=True)
weather.loc['2013-07-01':'2013-07-07', 'PrecipitationIn'] * 2.54

// error
week1_range / week1_mean
// right way
week1_range.divide(week1_mean, axis='rows')

bronze = pd.read_csv('bronze_top5.csv', index_col=0)
silver = pd.read_csv('silver_top5.csv', index_col=0)

bronze + silver
bronze.add(silver)
bronze.add(silver, fill_value=0).add(gold, fill_value=0)

Concatenating data

Appending and concatenating Series
- append() stacks rows of s2 below s1
- Method for Series and DataFrames
- concar() can stack row-wise or column-wise
- pandas module function
- May need to reset the index after appending

result1 = pd.concat([s1, s2, s3])
result2 = s1.append(s2).append(s3)

new_east = northeast.append(south).reset_index(drop=True)
new_east = pd.concat([northeast, south], ignore_index=True)

pd.concat([population, unemployment], axis=1)

pd.concat([rain2013, rain2014], axis=0)
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis=0)
print(rain1314.loc[2014])

rain1314 = pd.concat([rain2013, rain2014], axis='columns')
rain1314 = pd.concat([rain2013, rain2014], keys=[2013, 2014], axis='columns')

Concatenating rows with axis='rows'
- Keys and MultiIndexes
Concatenating columns with axis='columns'
- Using a multi-index on columns

A = np.arange(8).reshape(2,4) + 0.1
B = np.arange(6).reshape(2,3) + 0.2
C = np.arange(12).reshape(3,4) + 0.3

np.hstack([B, A])
np.concatenate([B, A], axis=1)

np.vstack([A, C])
np.concatenate([A, C], axis=0)

// errors
np.concatenate([A, B], axis=0) # incompatible columns
np.concatenate([A, C], axis=1) # incompatible rows

Outer join
- Union of index sets (all labels, no repetition)
- Missing fields filled with NaN
Inner join
- Intersection of index sets (only common labels)

pd.concat([population, unemployment], axis=1, join='inner')
pd.concat([population, unemployment], axis=1, join='outer')

Merging data

Techniques for merging with left joins, right joins, inner joins, and outer joins. Ordered merging is useful to merge DataFrames with columns that have natural orderings, like date-time columns.

Merding DataFrames
- Using suffixes
- Specifying columns to merge
Joining DataFrames
- Inner join, Left and right join
- Outer join
Which should you use?
- df1.append(df2): stacking vertically
- pd.concat([df1, df2]): stacking many horizontally or vertically, simple inner/outer joins on Indexes
- df1.join(df2): inner/outer/le!/right joins on Indexes
- pd.merge([df1, df2]): many joins on multiple columns
Ordered merges

pd.merge(population, cities)
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'])
pd.merge(counties, cities, left_on='CITY NAME', right_on='City')

pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='inner')
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='left')
pd.merge(bronze, gold, on=['NOC', 'Country'], suffixes=['_bronze', '_gold'], how='outer')

population.join(unemployment)
population.join(unemployment, how='right')
population.join(unemployment, how='inner')
population.join(unemployment, how='outer')

pd.merge(hardware, software, how='outer').sort_values('Date')
pd.merge_ordered(hardware, software)
pd.merge_ordered(hardware, software, on=['Date', 'Company'], suffixes=['_hardware', '_software'])
pd.merge_ordered(stocks, gdp, on='Date', fill_method='ffill')

Case Study - Summer Olympics

An in-depth case study using Olympic medal data

Reminder: loading & merging files
Constructing a pivot table
Reshaping and plo!ing
Graphical summary

misho-kr/Merging DataFrames with pandas.md

Merging DataFrames with pandas

Preparing data

Concatenating data

Merging data

Case Study - Summer Olympics