Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active September 17, 2020 09:39
Show Gist options
  • Save misho-kr/01ee6634403c8f53893d4ec5a20ca7d1 to your computer and use it in GitHub Desktop.
Save misho-kr/01ee6634403c8f53893d4ec5a20ca7d1 to your computer and use it in GitHub Desktop.
Summary of "pandas Foundations" course on Datacamp (https://gist.github.com/misho-kr/873ddcc2fc89f1c96414de9e0a58e0fe)

pandas DataFrames are the most widely used in-memory representation of complex data collections within Python. Whether in finance, a scientific field, or data science, familiarity with pandas is essential. This course teaches you to work with real-world datasets containing both string and numeric data, often structured around time series. You will learn powerful analysis, selection, and visualization techniques in this course.

Lead by Team Anaconda, Data Science Consultant at Lander Analytics

Data ingestion & inspection

Use pandas to import and inspect a variety of datasets, ranging from population data obtained from the World Bank to monthly stock data obtained via Yahoo Finance. Build DataFrames from scratch and become familiar with the intrinsic data visualization capabilities of pandas.

  • Pandas DataFrames
    • Indexes and Columns
    • Slicing, head() and tail()
    • Broadcasting -- assigning scalar value to column slice broadcasts value to each row
  • Pandas Series
  • Building DataFrames
    • CSV file
      • Omitting header, setting column names, na_value
      • Parse dates
    • Python dictionary
      • Broadcasting with a dictionary
import pandas as pd
users = pd.read_csv('datasets/users.csv', index_col=0)

zipped = list(zip(list_labels, list_cols))
data = dict(zipped)
users = pd.DataFrame(data)
  • Inspecting with df.info()
  • Using dates as index
  • Trimming redundant columns
sunspots.index = sunspots['year_month_day']
sunspots.index.name = 'date'

cols = ['sunspots', 'definite']
sunspots = sunspots[cols]
sunspots.iloc[10:20, :]
  • Writing files with df.to_csv() and df.to_excel()
  • Plotting arrays, series and data frames
import pandas as pd
import matplotlib.pyplot as plt
aapl = pd.read_csv('aapl.csv', index_col='date', parse_dates=True)

plt.plot(aapl['close'].values)
plt.plot(aapl['close'])
aapl['close'].plot()
plt.plot(aapl)
aapl.plot()

plt.savefig('aapl.png')
plt.show()
  • Customizing the plots -- colors, style, labels, legends, ticks, scales

Exploratory data analysis

Explore data visually and quantitatively. Exploratory data analysis (EDA) is a crucial component of any data science project. pandas has powerful methods that help with statistical and visual EDA.

  • Visual
    • Plots: scatter, line, box, histogram
    • Histogram options: bins, range, normalized-to-one, cumulative
    • CDF: cumulative Distribution Function
iris.plot(y='sepal_length', kind='hist', bins=30, 
          range=(4,8), cumulative=True, normed=True)
plt.xlabel('sepal length (cm)')
plt.title('Cumulative distribution function (CDF)')
plt.show()
  • Statistical:
    • describe, count, average, standard deviation
    • ranges, inter-quartile range
    • percentiles: 25, 50, 75
    • unique
indices = iris['species'] == 'setosa'
setosa = iris.loc[indices,:] # extract new DataFrame
  • Computing errors
describe_all = iris.describe()

error_setosa = 100 * np.abs(describe_setosa - describe_all)
error_setosa = error_setosa/describe_setosa

print(error_setosa)

Time series in pandas

Manipulate and visualize time series data using pandas. Upsampling, downsampling, and interpolation. Use method chaining to efficiently filter data and perform time series analyses. From stock prices to flight timings, time series data can be found in a wide variety of domains.

  • Using pandas to read datetime objects, specify parse_dates=True
  • Partial datetime string selection
  • Convert strings to datatime with pd.to_datetime()
  • Reindex and fill missing values
sales.loc['2015-02-19 11:00:00', 'Company']
sales.loc['February 5, 2015']
sales.loc[‘2015-Feb-5']   # Whole month
sales.loc[‘2015’]         # Whole year
sales.loc['2015-2-16':'2015-2-20']

evening_2_11 = pd.to_datetime(['2015-2-11 20:00', '2015-2-11 21:00',
                               '2015-2-11 22:00', '2015-2-11 23:00'])
sales.reindex(evening_2_11, method='ffill')
  • Resampling time series data
    • Statistical methods over different time intervals
    • Method chaining: mean(), sum(), count(), etc.
    • Downsampling to reduce datetime rows to slower frequency
    • Upsampling to increase datetime rows to faster frequency
daily_mean = sales.resample('D').mean()
sales.loc[:, 'Units'].resample('2W').sum()
two_days.resample('4H').ffill()
  • String methods
  • Datetime methods
  • Set and convert timezone
  • Interpolate missing data with interpolate()
sales['Company'].str.upper()
sales['Product'].str.contains('ware')

sales['Date'].dt.hour
central = sales['Date'].dt.tz_localize('US/Central')
central.dt.tz_convert('US/Eastern')
  • TIme series visualization
sp500 = pd.read_csv('sp500.csv', parse_dates=True, index_col= 'Date')
sp500.loc['2012-4-1':'2012-4-7', 'Close'].plot(title='S&P 500')
plt.ylabel('Closing Price (US Dollars)
plt.show()

sp500['Close'].plot(kind='area', title='S&P 500')
plt.ylabel('Closing Price (US Dollars)')
plt.show()

sp500.loc['2012', ['Close','Volume']].plot(subplots=True)
plt.show()

Case Study - Sunlight in Austin

Working with real-world weather and climate data, you will use pandas to manipulate the data into a usable form for analysis and systematically explore it using the techniques you’ve learned.

  1. Climate normals of Austin, TX from 1981-2010
  2. Weather data of Austin, TX from 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment