Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active April 5, 2023 13:46
Show Gist options
  • Save misho-kr/45d7014b000c40e4d4d5d22d93098370 to your computer and use it in GitHub Desktop.
Save misho-kr/45d7014b000c40e4d4d5d22d93098370 to your computer and use it in GitHub Desktop.
Summary of "Data Manipulation with pandas" course on Datacamp

pandas is the world's most popular Python library, used for everything from data manipulation to data analysis. Learn how to manipulate DataFrames, as you extract, filter, and transform real-world datasets for analysis. Using real-world data, including Walmart sales figures and global temperature time series, you’ll learn how to import, clean, calculate statistics, and create visualizations—using pandas!

Lead by Maggie Matsui, Data Scientist at DataCamp

Transforming Data

Inspect DataFrames and perform fundamental manipulations, including sorting rows, subsetting, and adding new columns

  • Exploring DataFrames with .head(), .tail(), .info(), .describe() and .shape
  • Viewing components with .values, .columns and .index
  • There should be one -- and preferably only one -- obvious way to do it
  • Sorting, subsetting columns and rows, adding new columns
> dogs.sort_values("weight_kg")
> dogs.sort_values(["weight_kg", "height_cm"], ascending=[True, False])
> dogs[["breed", "height_cm"]]
> dogs[dogs["height_cm"] > 50]
> dogs["color"].isin(["Black", "Brown"])

Aggregating Data

Calculate summary statistics on DataFrame columns, and master grouped summary statistics and pivot tables

  • Summarizing with:
    • median(), mode(), min(), max(), median(), sum(), var(), std(), quantile()
    • cumsum(), cummin(), cummax(), cumprod(),
  • Counting
    • drop_duplicates(), value_counts()
  • Grouped summary statistics with groupby()
  • Pivot Tales
    • They are just DataFrames with sorted indexes
    • Filling missing values
    • Summing
> dogs["date_of_birth"].min()
> dogs[["weight_kg", "height_cm"]].agg(np.min)
>
> vet_visits.drop_duplicates(subset="name")
> vet_visits.drop_duplicates(subset=["name", "breed"]).value_counts(sort=True, normalize=True)
> 
> dogs.groupby(["color", "breed"])["weight_kg"].agg([min, max, sum]).mean()
> dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])
> 
> dogs.groupby(["color", "breed"], fill_value=0, margins=True)["weight_kg"].mean()
> dogs.pivot_table(values="weight_kg", index="color", columns="breed")

Slicing and indexing

Indexes are supercharged row and column names. Learn how they can be combined with slicing for powerful DataFrame subsetting.

  • Explicit Indexes: .columns and .index
    • Setting a column as index
    • Removingm, dropping and sorting index
  • Multi-level indexes a.k.a. hierarchical indexes
  • Indexes make subsetting simpler
    • Index values are just data
    • Indexes violate "tidy data" principles
    • You need to learn two syntaxes
  • Slicing and subsetting with .loc and .iloc
    • Sort the index before you slice
    • Slicing columns and slicing twice
    • Slicing by dates
    • Subsetting by row/column number
  • Working with pivot tables
    • They are just DataFrames with sorted indexes
    • Yet they are special cases since every column containers the same data type
    • The axis argument
    • Calculating summary stats across columns
> dogs_ind = dogs.set_index("name")
> dogs_ind.reset_index()
> dogs_ind.reset_index(drop=True)
> dogs_ind3 = dogs.set_index(["breed", "color"])
> dogs_ind3.loc[["Labrador", "Chihuahua"]]
> dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]]
> dogs_ind3.sort_index(level=["color", "breed"], ascending=[True, False])
> 
> dogs_srt = dogs.set_index(["breed", "color"]).sort_index()
> dogs_srt.loc["Chow Chow":"Poodle"]
> dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey")]
> dogs_srt.loc[:, "name":"height_cm"]
> dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey"), "name":"height_cm"]
> dogs.loc["2014-08-25":"2016-09-16"]
> print(dogs.iloc[2:5, 1:4])
> 
> dogs_height_by_breed_vs_color = 
    dog_pack.pivot_table("height_cm", index="breed", columns="color")
> dogs_height_by_breed_vs_color.loc["Chow Chow":"Poodle"]
> dogs_height_by_breed_vs_color.mean(axis="index")
> dogs_height_by_breed_vs_color.mean(axis="columns")

Creating and Visualizing DataFrames

Visualize the contents of your DataFrames, handle missing data values, and import data from and export data to CSV files

  • Plots
    • Histograms, Bar plots, Line plots, Scatter plots
    • Layering plots, legend, grid, transparency ...
  • Missing values
    • Detecting, counting, removing, replacing
  • Creating DataFrames
    • From a list of dictionaries
    • From a dictionary of lists
  • Reading and writing CSVs
> dog_pack["height_cm"].hist(bins=20, alpha=0.7)
> avg_weight_by_breed = dog_pack.groupby("breed")["weight_kg"].mean()
> avg_weight_by_breed.plot(kind="bar")
> sully.plot(x="date", y="weight_kg", kind="line", rot=45)
> dog_pack.plot(x="height_cm", y="weight_kg", kind="scatter")
> 
> dogs.isna().any()
> dogs.isna().sum()
> dogs.isna().sum().plot(kind="bar")
> dogs.dropna()
> dogs.fillna(0)
>
> new_dogs = pd.read_csv("new_dogs.csv")
> new_dogs.to_csv("new_dogs_with_bmi.csv")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment