Data Manipulation with pandas

pandas is the world's most popular Python library, used for everything from data manipulation to data analysis. Learn how to manipulate DataFrames, as you extract, filter, and transform real-world datasets for analysis. Using real-world data, including Walmart sales figures and global temperature time series, you’ll learn how to import, clean, calculate statistics, and create visualizations—using pandas!

Lead by Maggie Matsui, Data Scientist at DataCamp

Transforming Data

Inspect DataFrames and perform fundamental manipulations, including sorting rows, subsetting, and adding new columns

Exploring DataFrames with .head(), .tail(), .info(), .describe() and .shape
Viewing components with .values, .columns and .index
There should be one -- and preferably only one -- obvious way to do it
Sorting, subsetting columns and rows, adding new columns

> dogs.sort_values("weight_kg")
> dogs.sort_values(["weight_kg", "height_cm"], ascending=[True, False])
> dogs[["breed", "height_cm"]]
> dogs[dogs["height_cm"] > 50]
> dogs["color"].isin(["Black", "Brown"])

Aggregating Data

Calculate summary statistics on DataFrame columns, and master grouped summary statistics and pivot tables

Summarizing with:
- median(), mode(), min(), max(), median(), sum(), var(), std(), quantile()
- cumsum(), cummin(), cummax(), cumprod(),
Counting
- drop_duplicates(), value_counts()
Grouped summary statistics with groupby()
Pivot Tales
- They are just DataFrames with sorted indexes
- Filling missing values
- Summing

> dogs["date_of_birth"].min()
> dogs[["weight_kg", "height_cm"]].agg(np.min)
>
> vet_visits.drop_duplicates(subset="name")
> vet_visits.drop_duplicates(subset=["name", "breed"]).value_counts(sort=True, normalize=True)
> 
> dogs.groupby(["color", "breed"])["weight_kg"].agg([min, max, sum]).mean()
> dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])
> 
> dogs.groupby(["color", "breed"], fill_value=0, margins=True)["weight_kg"].mean()
> dogs.pivot_table(values="weight_kg", index="color", columns="breed")

Slicing and indexing

Indexes are supercharged row and column names. Learn how they can be combined with slicing for powerful DataFrame subsetting.

Explicit Indexes: .columns and .index
- Setting a column as index
- Removingm, dropping and sorting index
Multi-level indexes a.k.a. hierarchical indexes
Indexes make subsetting simpler
- Index values are just data
- Indexes violate "tidy data" principles
- You need to learn two syntaxes
Slicing and subsetting with .loc and .iloc
- Sort the index before you slice
- Slicing columns and slicing twice
- Slicing by dates
- Subsetting by row/column number
Working with pivot tables
- They are just DataFrames with sorted indexes
- Yet they are special cases since every column containers the same data type
- The axis argument
- Calculating summary stats across columns

> dogs_ind = dogs.set_index("name")
> dogs_ind.reset_index()
> dogs_ind.reset_index(drop=True)
> dogs_ind3 = dogs.set_index(["breed", "color"])
> dogs_ind3.loc[["Labrador", "Chihuahua"]]
> dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]]
> dogs_ind3.sort_index(level=["color", "breed"], ascending=[True, False])
> 
> dogs_srt = dogs.set_index(["breed", "color"]).sort_index()
> dogs_srt.loc["Chow Chow":"Poodle"]
> dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey")]
> dogs_srt.loc[:, "name":"height_cm"]
> dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey"), "name":"height_cm"]
> dogs.loc["2014-08-25":"2016-09-16"]
> print(dogs.iloc[2:5, 1:4])
> 
> dogs_height_by_breed_vs_color = 
    dog_pack.pivot_table("height_cm", index="breed", columns="color")
> dogs_height_by_breed_vs_color.loc["Chow Chow":"Poodle"]
> dogs_height_by_breed_vs_color.mean(axis="index")
> dogs_height_by_breed_vs_color.mean(axis="columns")

Creating and Visualizing DataFrames

Visualize the contents of your DataFrames, handle missing data values, and import data from and export data to CSV files

Plots
- Histograms, Bar plots, Line plots, Scatter plots
- Layering plots, legend, grid, transparency ...
Missing values
- Detecting, counting, removing, replacing
Creating DataFrames
- From a list of dictionaries
- From a dictionary of lists
Reading and writing CSVs

> dog_pack["height_cm"].hist(bins=20, alpha=0.7)
> avg_weight_by_breed = dog_pack.groupby("breed")["weight_kg"].mean()
> avg_weight_by_breed.plot(kind="bar")
> sully.plot(x="date", y="weight_kg", kind="line", rot=45)
> dog_pack.plot(x="height_cm", y="weight_kg", kind="scatter")
> 
> dogs.isna().any()
> dogs.isna().sum()
> dogs.isna().sum().plot(kind="bar")
> dogs.dropna()
> dogs.fillna(0)
>
> new_dogs = pd.read_csv("new_dogs.csv")
> new_dogs.to_csv("new_dogs_with_bmi.csv")

misho-kr/Data Manipulation with pandas.md

Data Manipulation with pandas

Transforming Data

Aggregating Data

Slicing and indexing

Creating and Visualizing DataFrames