pandas
is the world's most popular Python library, used for everything from data manipulation to data analysis. Learn how to manipulate DataFrames, as you extract, filter, and transform real-world datasets for analysis. Using real-world data, including Walmart sales figures and global temperature time series, you’ll learn how to import, clean, calculate statistics, and create visualizations—using pandas!
Lead by Maggie Matsui, Data Scientist at DataCamp
Inspect DataFrames and perform fundamental manipulations, including sorting rows, subsetting, and adding new columns
- Exploring DataFrames with
.head()
,.tail()
,.info()
,.describe()
and.shape
- Viewing components with
.values
,.columns
and.index
- There should be one -- and preferably only one -- obvious way to do it
- Sorting, subsetting columns and rows, adding new columns
> dogs.sort_values("weight_kg")
> dogs.sort_values(["weight_kg", "height_cm"], ascending=[True, False])
> dogs[["breed", "height_cm"]]
> dogs[dogs["height_cm"] > 50]
> dogs["color"].isin(["Black", "Brown"])
Calculate summary statistics on DataFrame columns, and master grouped summary statistics and pivot tables
- Summarizing with:
median()
,mode()
,min()
,max()
,median()
,sum()
,var()
,std()
,quantile()
cumsum()
,cummin()
,cummax()
,cumprod()
,
- Counting
drop_duplicates()
,value_counts()
- Grouped summary statistics with
groupby()
- Pivot Tales
- They are just DataFrames with sorted indexes
- Filling missing values
- Summing
> dogs["date_of_birth"].min()
> dogs[["weight_kg", "height_cm"]].agg(np.min)
>
> vet_visits.drop_duplicates(subset="name")
> vet_visits.drop_duplicates(subset=["name", "breed"]).value_counts(sort=True, normalize=True)
>
> dogs.groupby(["color", "breed"])["weight_kg"].agg([min, max, sum]).mean()
> dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])
>
> dogs.groupby(["color", "breed"], fill_value=0, margins=True)["weight_kg"].mean()
> dogs.pivot_table(values="weight_kg", index="color", columns="breed")
Indexes are supercharged row and column names. Learn how they can be combined with slicing for powerful DataFrame subsetting.
- Explicit Indexes:
.columns
and.index
- Setting a column as index
- Removingm, dropping and sorting index
- Multi-level indexes a.k.a. hierarchical indexes
- Indexes make subsetting simpler
- Index values are just data
- Indexes violate "tidy data" principles
- You need to learn two syntaxes
- Slicing and subsetting with .loc and .iloc
- Sort the index before you slice
- Slicing columns and slicing twice
- Slicing by dates
- Subsetting by row/column number
- Working with pivot tables
- They are just DataFrames with sorted indexes
- Yet they are special cases since every column containers the same data type
- The axis argument
- Calculating summary stats across columns
> dogs_ind = dogs.set_index("name")
> dogs_ind.reset_index()
> dogs_ind.reset_index(drop=True)
> dogs_ind3 = dogs.set_index(["breed", "color"])
> dogs_ind3.loc[["Labrador", "Chihuahua"]]
> dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]]
> dogs_ind3.sort_index(level=["color", "breed"], ascending=[True, False])
>
> dogs_srt = dogs.set_index(["breed", "color"]).sort_index()
> dogs_srt.loc["Chow Chow":"Poodle"]
> dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey")]
> dogs_srt.loc[:, "name":"height_cm"]
> dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey"), "name":"height_cm"]
> dogs.loc["2014-08-25":"2016-09-16"]
> print(dogs.iloc[2:5, 1:4])
>
> dogs_height_by_breed_vs_color =
dog_pack.pivot_table("height_cm", index="breed", columns="color")
> dogs_height_by_breed_vs_color.loc["Chow Chow":"Poodle"]
> dogs_height_by_breed_vs_color.mean(axis="index")
> dogs_height_by_breed_vs_color.mean(axis="columns")
Visualize the contents of your DataFrames, handle missing data values, and import data from and export data to CSV files
- Plots
- Histograms, Bar plots, Line plots, Scatter plots
- Layering plots, legend, grid, transparency ...
- Missing values
- Detecting, counting, removing, replacing
- Creating DataFrames
- From a list of dictionaries
- From a dictionary of lists
- Reading and writing CSVs
> dog_pack["height_cm"].hist(bins=20, alpha=0.7)
> avg_weight_by_breed = dog_pack.groupby("breed")["weight_kg"].mean()
> avg_weight_by_breed.plot(kind="bar")
> sully.plot(x="date", y="weight_kg", kind="line", rot=45)
> dog_pack.plot(x="height_cm", y="weight_kg", kind="scatter")
>
> dogs.isna().any()
> dogs.isna().sum()
> dogs.isna().sum().plot(kind="bar")
> dogs.dropna()
> dogs.fillna(0)
>
> new_dogs = pd.read_csv("new_dogs.csv")
> new_dogs.to_csv("new_dogs_with_bmi.csv")