pandas DataFrames
are the most widely used in-memory representation of complex data collections within Python. Whether in finance, a scientific field, or data science, familiarity with pandas is essential. This course teaches you to work with real-world datasets containing both string and numeric data, often structured around time series. You will learn powerful analysis, selection, and visualization techniques in this course.
Lead by Team Anaconda, Data Science Consultant at Lander Analytics
Use pandas to import and inspect a variety of datasets, ranging from population data obtained from the World Bank to monthly stock data obtained via Yahoo Finance. Build DataFrames from scratch and become familiar with the intrinsic data visualization capabilities of pandas.
- Pandas DataFrames
- Indexes and Columns
- Slicing, head() and tail()
- Broadcasting -- assigning scalar value to column slice broadcasts value to each row
- Pandas Series
- Building DataFrames
- CSV file
- Omitting header, setting column names, na_value
- Parse dates
- Python dictionary
- Broadcasting with a dictionary
- CSV file
import pandas as pd
users = pd.read_csv('datasets/users.csv', index_col=0)
zipped = list(zip(list_labels, list_cols))
data = dict(zipped)
users = pd.DataFrame(data)
- Inspecting with df.info()
- Using dates as index
- Trimming redundant columns
sunspots.index = sunspots['year_month_day']
sunspots.index.name = 'date'
cols = ['sunspots', 'definite']
sunspots = sunspots[cols]
sunspots.iloc[10:20, :]
- Writing files with df.to_csv() and df.to_excel()
- Plotting arrays, series and data frames
import pandas as pd
import matplotlib.pyplot as plt
aapl = pd.read_csv('aapl.csv', index_col='date', parse_dates=True)
plt.plot(aapl['close'].values)
plt.plot(aapl['close'])
aapl['close'].plot()
plt.plot(aapl)
aapl.plot()
plt.savefig('aapl.png')
plt.show()
- Customizing the plots -- colors, style, labels, legends, ticks, scales
Explore data visually and quantitatively. Exploratory data analysis (EDA) is a crucial component of any data science project. pandas has powerful methods that help with statistical and visual EDA.
- Visual
- Plots: scatter, line, box, histogram
- Histogram options: bins, range, normalized-to-one, cumulative
- CDF: cumulative Distribution Function
iris.plot(y='sepal_length', kind='hist', bins=30,
range=(4,8), cumulative=True, normed=True)
plt.xlabel('sepal length (cm)')
plt.title('Cumulative distribution function (CDF)')
plt.show()
- Statistical:
- describe, count, average, standard deviation
- ranges, inter-quartile range
- percentiles: 25, 50, 75
- unique
indices = iris['species'] == 'setosa'
setosa = iris.loc[indices,:] # extract new DataFrame
- Computing errors
describe_all = iris.describe()
error_setosa = 100 * np.abs(describe_setosa - describe_all)
error_setosa = error_setosa/describe_setosa
print(error_setosa)
Manipulate and visualize time series data using pandas. Upsampling, downsampling, and interpolation. Use method chaining to efficiently filter data and perform time series analyses. From stock prices to flight timings, time series data can be found in a wide variety of domains.
- Using pandas to read datetime objects, specify
parse_dates=True
- Partial datetime string selection
- Convert strings to datatime with
pd.to_datetime()
- Reindex and fill missing values
sales.loc['2015-02-19 11:00:00', 'Company']
sales.loc['February 5, 2015']
sales.loc[‘2015-Feb-5'] # Whole month
sales.loc[‘2015’] # Whole year
sales.loc['2015-2-16':'2015-2-20']
evening_2_11 = pd.to_datetime(['2015-2-11 20:00', '2015-2-11 21:00',
'2015-2-11 22:00', '2015-2-11 23:00'])
sales.reindex(evening_2_11, method='ffill')
- Resampling time series data
- Statistical methods over different time intervals
- Method chaining: mean(), sum(), count(), etc.
- Downsampling to reduce datetime rows to slower frequency
- Upsampling to increase datetime rows to faster frequency
daily_mean = sales.resample('D').mean()
sales.loc[:, 'Units'].resample('2W').sum()
two_days.resample('4H').ffill()
- String methods
- Datetime methods
- Set and convert timezone
- Interpolate missing data with interpolate()
sales['Company'].str.upper()
sales['Product'].str.contains('ware')
sales['Date'].dt.hour
central = sales['Date'].dt.tz_localize('US/Central')
central.dt.tz_convert('US/Eastern')
- TIme series visualization
sp500 = pd.read_csv('sp500.csv', parse_dates=True, index_col= 'Date')
sp500.loc['2012-4-1':'2012-4-7', 'Close'].plot(title='S&P 500')
plt.ylabel('Closing Price (US Dollars)
plt.show()
sp500['Close'].plot(kind='area', title='S&P 500')
plt.ylabel('Closing Price (US Dollars)')
plt.show()
sp500.loc['2012', ['Close','Volume']].plot(subplots=True)
plt.show()
Working with real-world weather and climate data, you will use pandas to manipulate the data into a usable form for analysis and systematically explore it using the techniques you’ve learned.
- Climate normals of Austin, TX from 1981-2010
- Weather data of Austin, TX from 2011