pandas Foundations

pandas DataFrames are the most widely used in-memory representation of complex data collections within Python. Whether in finance, a scientific field, or data science, familiarity with pandas is essential. This course teaches you to work with real-world datasets containing both string and numeric data, often structured around time series. You will learn powerful analysis, selection, and visualization techniques in this course.

Lead by Team Anaconda, Data Science Consultant at Lander Analytics

Data ingestion & inspection

Use pandas to import and inspect a variety of datasets, ranging from population data obtained from the World Bank to monthly stock data obtained via Yahoo Finance. Build DataFrames from scratch and become familiar with the intrinsic data visualization capabilities of pandas.

Pandas DataFrames
- Indexes and Columns
- Slicing, head() and tail()
- Broadcasting -- assigning scalar value to column slice broadcasts value to each row
Pandas Series
Building DataFrames
- CSV file
  - Omitting header, setting column names, na_value
  - Parse dates
- Python dictionary
  - Broadcasting with a dictionary

import pandas as pd
users = pd.read_csv('datasets/users.csv', index_col=0)

zipped = list(zip(list_labels, list_cols))
data = dict(zipped)
users = pd.DataFrame(data)

Inspecting with df.info()
Using dates as index
Trimming redundant columns

sunspots.index = sunspots['year_month_day']
sunspots.index.name = 'date'

cols = ['sunspots', 'definite']
sunspots = sunspots[cols]
sunspots.iloc[10:20, :]

Writing files with df.to_csv() and df.to_excel()
Plotting arrays, series and data frames

import pandas as pd
import matplotlib.pyplot as plt
aapl = pd.read_csv('aapl.csv', index_col='date', parse_dates=True)

plt.plot(aapl['close'].values)
plt.plot(aapl['close'])
aapl['close'].plot()
plt.plot(aapl)
aapl.plot()

plt.savefig('aapl.png')
plt.show()

Customizing the plots -- colors, style, labels, legends, ticks, scales

Exploratory data analysis

Explore data visually and quantitatively. Exploratory data analysis (EDA) is a crucial component of any data science project. pandas has powerful methods that help with statistical and visual EDA.

Visual
- Plots: scatter, line, box, histogram
- Histogram options: bins, range, normalized-to-one, cumulative
- CDF: cumulative Distribution Function

iris.plot(y='sepal_length', kind='hist', bins=30, 
          range=(4,8), cumulative=True, normed=True)
plt.xlabel('sepal length (cm)')
plt.title('Cumulative distribution function (CDF)')
plt.show()

Statistical:
- describe, count, average, standard deviation
- ranges, inter-quartile range
- percentiles: 25, 50, 75
- unique

indices = iris['species'] == 'setosa'
setosa = iris.loc[indices,:] # extract new DataFrame

Computing errors

describe_all = iris.describe()

error_setosa = 100 * np.abs(describe_setosa - describe_all)
error_setosa = error_setosa/describe_setosa

print(error_setosa)

Time series in pandas

Manipulate and visualize time series data using pandas. Upsampling, downsampling, and interpolation. Use method chaining to efficiently filter data and perform time series analyses. From stock prices to flight timings, time series data can be found in a wide variety of domains.

Using pandas to read datetime objects, specify parse_dates=True
Partial datetime string selection
Convert strings to datatime with pd.to_datetime()
Reindex and fill missing values

sales.loc['2015-02-19 11:00:00', 'Company']
sales.loc['February 5, 2015']
sales.loc[‘2015-Feb-5']   # Whole month
sales.loc[‘2015’]         # Whole year
sales.loc['2015-2-16':'2015-2-20']

evening_2_11 = pd.to_datetime(['2015-2-11 20:00', '2015-2-11 21:00',
                               '2015-2-11 22:00', '2015-2-11 23:00'])
sales.reindex(evening_2_11, method='ffill')

Resampling time series data
- Statistical methods over different time intervals
- Method chaining: mean(), sum(), count(), etc.
- Downsampling to reduce datetime rows to slower frequency
- Upsampling to increase datetime rows to faster frequency

daily_mean = sales.resample('D').mean()
sales.loc[:, 'Units'].resample('2W').sum()
two_days.resample('4H').ffill()

String methods
Datetime methods
Set and convert timezone
Interpolate missing data with interpolate()

sales['Company'].str.upper()
sales['Product'].str.contains('ware')

sales['Date'].dt.hour
central = sales['Date'].dt.tz_localize('US/Central')
central.dt.tz_convert('US/Eastern')

TIme series visualization

sp500 = pd.read_csv('sp500.csv', parse_dates=True, index_col= 'Date')
sp500.loc['2012-4-1':'2012-4-7', 'Close'].plot(title='S&P 500')
plt.ylabel('Closing Price (US Dollars)
plt.show()

sp500['Close'].plot(kind='area', title='S&P 500')
plt.ylabel('Closing Price (US Dollars)')
plt.show()

sp500.loc['2012', ['Close','Volume']].plot(subplots=True)
plt.show()

Case Study - Sunlight in Austin

Working with real-world weather and climate data, you will use pandas to manipulate the data into a usable form for analysis and systematically explore it using the techniques you’ve learned.

Climate normals of Austin, TX from 1981-2010

Weather data of Austin, TX from 2011

misho-kr/pandas Foundations.md

pandas Foundations

Data ingestion & inspection

Exploratory data analysis

Time series in pandas

Case Study - Sunlight in Austin