Created
June 16, 2020 19:44
-
-
Save dhwanijhawan/3392478d3eeeaede938f2b5ab4a0e482 to your computer and use it in GitHub Desktop.
Created on Skills Network Labs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"<a href=\"https://cognitiveclass.ai\"><img src = \"https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png\" width = 400> </a>\n", | |
"\n", | |
"<h1 align=center><font size = 5>Area Plots, Histograms, and Bar Plots</font></h1>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"## Introduction\n", | |
"\n", | |
"In this lab, we will continue exploring the Matplotlib library and will learn how to create additional plots, namely area plots, histograms, and bar charts." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"## Table of Contents\n", | |
"\n", | |
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n", | |
"\n", | |
"1. [Exploring Datasets with *pandas*](#0)<br>\n", | |
"2. [Downloading and Prepping Data](#2)<br>\n", | |
"3. [Visualizing Data using Matplotlib](#4) <br>\n", | |
"4. [Area Plots](#6) <br>\n", | |
"5. [Histograms](#8) <br>\n", | |
"6. [Bar Charts](#10) <br>\n", | |
"</div>\n", | |
"<hr>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"# Exploring Datasets with *pandas* and Matplotlib<a id=\"0\"></a>\n", | |
"\n", | |
"Toolkits: The course heavily relies on [*pandas*](http://pandas.pydata.org/) and [**Numpy**](http://www.numpy.org/) for data wrangling, analysis, and visualization. The primary plotting library that we are exploring in the course is [Matplotlib](http://matplotlib.org/).\n", | |
"\n", | |
"Dataset: Immigration to Canada from 1980 to 2013 - [International migration flows to and from selected countries - The 2015 revision](http://www.un.org/en/development/desa/population/migration/data/empirical2/migrationflows.shtml) from United Nation's website.\n", | |
"\n", | |
"The dataset contains annual data on the flows of international migrants as recorded by the countries of destination. The data presents both inflows and outflows according to the place of birth, citizenship or place of previous / next residence both for foreigners and nationals. For this lesson, we will focus on the Canadian Immigration data." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"# Downloading and Prepping Data <a id=\"2\"></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Import Primary Modules. The first thing we'll do is import two key data analysis modules: *pandas* and **Numpy**." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np # useful for many scientific computing in Python\n", | |
"import pandas as pd # primary data structure library" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Let's download and import our primary Canadian Immigration dataset using *pandas* `read_excel()` method. Normally, before we can do that, we would need to download a module which *pandas* requires to read in excel files. This module is **xlrd**. For your convenience, we have pre-installed this module, so you would not have to worry about that. Otherwise, you would need to run the following line of code to install the **xlrd** module:\n", | |
"```\n", | |
"!conda install -c anaconda xlrd --yes\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Download the dataset and read it into a *pandas* dataframe." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_can = pd.read_excel('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx',\n", | |
" sheet_name='Canada by Citizenship',\n", | |
" skiprows=range(20),\n", | |
" skipfooter=2\n", | |
" )\n", | |
"\n", | |
"print('Data downloaded and read into a dataframe!')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Let's take a look at the first five items in our dataset." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_can.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Let's find out how many entries there are in our dataset." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
}, | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# print the dimensions of the dataframe\n", | |
"print(df_can.shape)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Clean up data. We will make some modifications to the original dataset to make it easier to create our visualizations. Refer to `Introduction to Matplotlib and Line Plots` lab for the rational and detailed description of the changes." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"#### 1. Clean up the dataset to remove columns that are not informative to us for visualization (eg. Type, AREA, REG)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_can.drop(['AREA', 'REG', 'DEV', 'Type', 'Coverage'], axis=1, inplace=True)\n", | |
"\n", | |
"# let's view the first five elements and see how the dataframe was changed\n", | |
"df_can.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Notice how the columns Type, Coverage, AREA, REG, and DEV got removed from the dataframe." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"#### 2. Rename some of the columns so that they make sense." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent','RegName':'Region'}, inplace=True)\n", | |
"\n", | |
"# let's view the first five elements and see how the dataframe was changed\n", | |
"df_can.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Notice how the column names now make much more sense, even to an outsider." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"#### 3. For consistency, ensure that all column labels of type string." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
}, | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# let's examine the types of the column labels\n", | |
"all(isinstance(column, str) for column in df_can.columns)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Notice how the above line of code returned *False* when we tested if all the column labels are of type **string**. So let's change them all to **string** type." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_can.columns = list(map(str, df_can.columns))\n", | |
"\n", | |
"# let's check the column labels types now\n", | |
"all(isinstance(column, str) for column in df_can.columns)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"#### 4. Set the country name as index - useful for quickly looking up countries using .loc method." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_can.set_index('Country', inplace=True)\n", | |
"\n", | |
"# let's view the first five elements and see how the dataframe was changed\n", | |
"df_can.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Notice how the country names now serve as indices." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"#### 5. Add total column." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_can['Total'] = df_can.sum(axis=1)\n", | |
"\n", | |
"# let's view the first five elements and see how the dataframe was changed\n", | |
"df_can.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Now the dataframe has an extra column that presents the total number of immigrants from each country in the dataset from 1980 - 2013. So if we print the dimension of the data, we get:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
}, | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"print ('data dimensions:', df_can.shape)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"So now our dataframe has 38 columns instead of 37 columns that we had before." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# finally, let's create a list of years from 1980 - 2013\n", | |
"# this will come in handy when we start plotting the data\n", | |
"years = list(map(str, range(1980, 2014)))\n", | |
"\n", | |
"years" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"# Visualizing Data using Matplotlib<a id=\"4\"></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Import `Matplotlib` and **Numpy**." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# use the inline backend to generate the plots within the browser\n", | |
"%matplotlib inline \n", | |
"\n", | |
"import matplotlib as mpl\n", | |
"import matplotlib.pyplot as plt\n", | |
"\n", | |
"mpl.style.use('ggplot') # optional: for ggplot-like style\n", | |
"\n", | |
"# check for latest version of Matplotlib\n", | |
"print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"# Area Plots<a id=\"6\"></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"In the last module, we created a line plot that visualized the top 5 countries that contribued the most immigrants to Canada from 1980 to 2013. With a little modification to the code, we can visualize this plot as a cumulative plot, also knows as a **Stacked Line Plot** or **Area plot**." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_can.sort_values(['Total'], ascending=False, axis=0, inplace=True)\n", | |
"\n", | |
"# get the top 5 entries\n", | |
"df_top5 = df_can.head()\n", | |
"\n", | |
"# transpose the dataframe\n", | |
"df_top5 = df_top5[years].transpose() \n", | |
"\n", | |
"df_top5.head()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Area plots are stacked by default. And to produce a stacked area plot, each column must be either all positive or all negative values (any NaN values will defaulted to 0). To produce an unstacked plot, pass `stacked=False`. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_top5.index = df_top5.index.map(int) # let's change the index values of df_top5 to type integer for plotting\n", | |
"df_top5.plot(kind='area', \n", | |
" stacked=False,\n", | |
" figsize=(20, 10), # pass a tuple (x, y) size\n", | |
" )\n", | |
"\n", | |
"plt.title('Immigration Trend of Top 5 Countries')\n", | |
"plt.ylabel('Number of Immigrants')\n", | |
"plt.xlabel('Years')\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"The unstacked plot has a default transparency (alpha value) at 0.5. We can modify this value by passing in the `alpha` parameter." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_top5.plot(kind='area', \n", | |
" alpha=0.25, # 0-1, default value a= 0.5\n", | |
" stacked=False,\n", | |
" figsize=(20, 10),\n", | |
" )\n", | |
"\n", | |
"plt.title('Immigration Trend of Top 5 Countries')\n", | |
"plt.ylabel('Number of Immigrants')\n", | |
"plt.xlabel('Years')\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"### Two types of plotting\n", | |
"\n", | |
"As we discussed in the video lectures, there are two styles/options of ploting with `matplotlib`. Plotting using the Artist layer and plotting using the scripting layer.\n", | |
"\n", | |
"**Option 1: Scripting layer (procedural method) - using matplotlib.pyplot as 'plt' **\n", | |
"\n", | |
"You can use `plt` i.e. `matplotlib.pyplot` and add more elements by calling different methods procedurally; for example, `plt.title(...)` to add title or `plt.xlabel(...)` to add label to the x-axis.\n", | |
"```python\n", | |
" # Option 1: This is what we have been using so far\n", | |
" df_top5.plot(kind='area', alpha=0.35, figsize=(20, 10)) \n", | |
" plt.title('Immigration trend of top 5 countries')\n", | |
" plt.ylabel('Number of immigrants')\n", | |
" plt.xlabel('Years')\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"**Option 2: Artist layer (Object oriented method) - using an `Axes` instance from Matplotlib (preferred) **\n", | |
"\n", | |
"You can use an `Axes` instance of your current plot and store it in a variable (eg. `ax`). You can add more elements by calling methods with a little change in syntax (by adding \"*set_*\" to the previous methods). For example, use `ax.set_title()` instead of `plt.title()` to add title, or `ax.set_xlabel()` instead of `plt.xlabel()` to add label to the x-axis. \n", | |
"\n", | |
"This option sometimes is more transparent and flexible to use for advanced plots (in particular when having multiple plots, as you will see later). \n", | |
"\n", | |
"In this course, we will stick to the **scripting layer**, except for some advanced visualizations where we will need to use the **artist layer** to manipulate advanced aspects of the plots." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# option 2: preferred option with more flexibility\n", | |
"ax = df_top5.plot(kind='area', alpha=0.35, figsize=(20, 10))\n", | |
"\n", | |
"ax.set_title('Immigration Trend of Top 5 Countries')\n", | |
"ax.set_ylabel('Number of Immigrants')\n", | |
"ax.set_xlabel('Years')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"**Question**: Use the scripting layer to create a stacked area plot of the 5 countries that contributed the least to immigration to Canada **from** 1980 to 2013. Use a transparency value of 0.45." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"### type your answer here\n", | |
"\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!-- The correct answer is:\n", | |
"\\\\ # get the 5 countries with the least contribution\n", | |
"df_least5 = df_can.tail(5)\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"\\\\ # transpose the dataframe\n", | |
"df_least5 = df_least5[years].transpose() \n", | |
"df_least5.head()\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"df_least5.index = df_least5.index.map(int) # let's change the index values of df_least5 to type integer for plotting\n", | |
"df_least5.plot(kind='area', alpha=0.45, figsize=(20, 10)) \n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"plt.title('Immigration Trend of 5 Countries with Least Contribution to Immigration')\n", | |
"plt.ylabel('Number of Immigrants')\n", | |
"plt.xlabel('Years')\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"plt.show()\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"**Question**: Use the artist layer to create an unstacked area plot of the 5 countries that contributed the least to immigration to Canada **from** 1980 to 2013. Use a transparency value of 0.55." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"### type your answer here\n", | |
"\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!-- The correct answer is:\n", | |
"\\\\ # get the 5 countries with the least contribution\n", | |
"df_least5 = df_can.tail(5)\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"\\\\ # transpose the dataframe\n", | |
"df_least5 = df_least5[years].transpose() \n", | |
"df_least5.head()\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"df_least5.index = df_least5.index.map(int) # let's change the index values of df_least5 to type integer for plotting\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"ax = df_least5.plot(kind='area', alpha=0.55, stacked=False, figsize=(20, 10))\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"ax.set_title('Immigration Trend of 5 Countries with Least Contribution to Immigration')\n", | |
"ax.set_ylabel('Number of Immigrants')\n", | |
"ax.set_xlabel('Years')\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"# Histograms<a id=\"8\"></a>\n", | |
"\n", | |
"A histogram is a way of representing the *frequency* distribution of numeric dataset. The way it works is it partitions the x-axis into *bins*, assigns each data point in our dataset to a bin, and then counts the number of data points that have been assigned to each bin. So the y-axis is the frequency or the number of data points in each bin. Note that we can change the bin size and usually one needs to tweak it so that the distribution is displayed nicely." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"**Question:** What is the frequency distribution of the number (population) of new immigrants from the various countries to Canada in 2013?" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Before we proceed with creating the histogram plot, let's first examine the data split into intervals. To do this, we will us **Numpy**'s `histrogram` method to get the bin ranges and frequency counts as follows:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# let's quickly view the 2013 data\n", | |
"df_can['2013'].head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# np.histogram returns 2 values\n", | |
"count, bin_edges = np.histogram(df_can['2013'])\n", | |
"\n", | |
"print(count) # frequency count\n", | |
"print(bin_edges) # bin ranges, default = 10 bins" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"By default, the `histrogram` method breaks up the dataset into 10 bins. The figure below summarizes the bin ranges and the frequency distribution of immigration in 2013. We can see that in 2013:\n", | |
"* 178 countries contributed between 0 to 3412.9 immigrants \n", | |
"* 11 countries contributed between 3412.9 to 6825.8 immigrants\n", | |
"* 1 country contributed between 6285.8 to 10238.7 immigrants, and so on..\n", | |
"\n", | |
"<img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Images/Mod2Fig1-Histogram.JPG\" align=\"center\" width=800>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"We can easily graph this distribution by passing `kind=hist` to `plot()`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_can['2013'].plot(kind='hist', figsize=(8, 5))\n", | |
"\n", | |
"plt.title('Histogram of Immigration from 195 Countries in 2013') # add a title to the histogram\n", | |
"plt.ylabel('Number of Countries') # add y-label\n", | |
"plt.xlabel('Number of Immigrants') # add x-label\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"In the above plot, the x-axis represents the population range of immigrants in intervals of 3412.9. The y-axis represents the number of countries that contributed to the aforementioned population. \n", | |
"\n", | |
"Notice that the x-axis labels do not match with the bin size. This can be fixed by passing in a `xticks` keyword that contains the list of the bin sizes, as follows:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# 'bin_edges' is a list of bin intervals\n", | |
"count, bin_edges = np.histogram(df_can['2013'])\n", | |
"\n", | |
"df_can['2013'].plot(kind='hist', figsize=(8, 5), xticks=bin_edges)\n", | |
"\n", | |
"plt.title('Histogram of Immigration from 195 countries in 2013') # add a title to the histogram\n", | |
"plt.ylabel('Number of Countries') # add y-label\n", | |
"plt.xlabel('Number of Immigrants') # add x-label\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"*Side Note:* We could use `df_can['2013'].plot.hist()`, instead. In fact, throughout this lesson, using `some_data.plot(kind='type_plot', ...)` is equivalent to `some_data.plot.type_plot(...)`. That is, passing the type of the plot as argument or method behaves the same. \n", | |
"\n", | |
"See the *pandas* documentation for more info http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"We can also plot multiple histograms on the same plot. For example, let's try to answer the following questions using a histogram.\n", | |
"\n", | |
"**Question**: What is the immigration distribution for Denmark, Norway, and Sweden for years 1980 - 2013?" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# let's quickly view the dataset \n", | |
"df_can.loc[['Denmark', 'Norway', 'Sweden'], years]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# generate histogram\n", | |
"df_can.loc[['Denmark', 'Norway', 'Sweden'], years].plot.hist()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"That does not look right! \n", | |
"\n", | |
"Don't worry, you'll often come across situations like this when creating plots. The solution often lies in how the underlying dataset is structured.\n", | |
"\n", | |
"Instead of plotting the population frequency distribution of the population for the 3 countries, *pandas* instead plotted the population frequency distribution for the `years`.\n", | |
"\n", | |
"This can be easily fixed by first transposing the dataset, and then plotting as shown below.\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# transpose dataframe\n", | |
"df_t = df_can.loc[['Denmark', 'Norway', 'Sweden'], years].transpose()\n", | |
"df_t.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# generate histogram\n", | |
"df_t.plot(kind='hist', figsize=(10, 6))\n", | |
"\n", | |
"plt.title('Histogram of Immigration from Denmark, Norway, and Sweden from 1980 - 2013')\n", | |
"plt.ylabel('Number of Years')\n", | |
"plt.xlabel('Number of Immigrants')\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Let's make a few modifications to improve the impact and aesthetics of the previous plot:\n", | |
"* increase the bin size to 15 by passing in `bins` parameter\n", | |
"* set transparency to 60% by passing in `alpha` paramemter\n", | |
"* label the x-axis by passing in `x-label` paramater\n", | |
"* change the colors of the plots by passing in `color` parameter" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# let's get the x-tick values\n", | |
"count, bin_edges = np.histogram(df_t, 15)\n", | |
"\n", | |
"# un-stacked histogram\n", | |
"df_t.plot(kind ='hist', \n", | |
" figsize=(10, 6),\n", | |
" bins=15,\n", | |
" alpha=0.6,\n", | |
" xticks=bin_edges,\n", | |
" color=['coral', 'darkslateblue', 'mediumseagreen']\n", | |
" )\n", | |
"\n", | |
"plt.title('Histogram of Immigration from Denmark, Norway, and Sweden from 1980 - 2013')\n", | |
"plt.ylabel('Number of Years')\n", | |
"plt.xlabel('Number of Immigrants')\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Tip:\n", | |
"For a full listing of colors available in Matplotlib, run the following code in your python shell:\n", | |
"```python\n", | |
"import matplotlib\n", | |
"for name, hex in matplotlib.colors.cnames.items():\n", | |
" print(name, hex)\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"If we do no want the plots to overlap each other, we can stack them using the `stacked` paramemter. Let's also adjust the min and max x-axis labels to remove the extra gap on the edges of the plot. We can pass a tuple (min,max) using the `xlim` paramater, as show below." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"count, bin_edges = np.histogram(df_t, 15)\n", | |
"xmin = bin_edges[0] - 10 # first bin value is 31.0, adding buffer of 10 for aesthetic purposes \n", | |
"xmax = bin_edges[-1] + 10 # last bin value is 308.0, adding buffer of 10 for aesthetic purposes\n", | |
"\n", | |
"# stacked Histogram\n", | |
"df_t.plot(kind='hist',\n", | |
" figsize=(10, 6), \n", | |
" bins=15,\n", | |
" xticks=bin_edges,\n", | |
" color=['coral', 'darkslateblue', 'mediumseagreen'],\n", | |
" stacked=True,\n", | |
" xlim=(xmin, xmax)\n", | |
" )\n", | |
"\n", | |
"plt.title('Histogram of Immigration from Denmark, Norway, and Sweden from 1980 - 2013')\n", | |
"plt.ylabel('Number of Years')\n", | |
"plt.xlabel('Number of Immigrants') \n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"**Question**: Use the scripting layer to display the immigration distribution for Greece, Albania, and Bulgaria for years 1980 - 2013? Use an overlapping plot with 15 bins and a transparency value of 0.35." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": true, | |
"deletable": true, | |
"jupyter": { | |
"outputs_hidden": true | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"### type your answer here\n", | |
"\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!-- The correct answer is:\n", | |
"\\\\ # create a dataframe of the countries of interest (cof)\n", | |
"df_cof = df_can.loc[['Greece', 'Albania', 'Bulgaria'], years]\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"\\\\ # transpose the dataframe\n", | |
"df_cof = df_cof.transpose() \n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"\\\\ # let's get the x-tick values\n", | |
"count, bin_edges = np.histogram(df_cof, 15)\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"\\\\ # Un-stacked Histogram\n", | |
"df_cof.plot(kind ='hist',\n", | |
" figsize=(10, 6),\n", | |
" bins=15,\n", | |
" alpha=0.35,\n", | |
" xticks=bin_edges,\n", | |
" color=['coral', 'darkslateblue', 'mediumseagreen']\n", | |
" )\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"plt.title('Histogram of Immigration from Greece, Albania, and Bulgaria from 1980 - 2013')\n", | |
"plt.ylabel('Number of Years')\n", | |
"plt.xlabel('Number of Immigrants')\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"plt.show()\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"# Bar Charts (Dataframe) <a id=\"10\"></a>\n", | |
"\n", | |
"A bar plot is a way of representing data where the *length* of the bars represents the magnitude/size of the feature/variable. Bar graphs usually represent numerical and categorical variables grouped in intervals. \n", | |
"\n", | |
"To create a bar plot, we can pass one of two arguments via `kind` parameter in `plot()`:\n", | |
"\n", | |
"* `kind=bar` creates a *vertical* bar plot\n", | |
"* `kind=barh` creates a *horizontal* bar plot" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"**Vertical bar plot**\n", | |
"\n", | |
"In vertical bar graphs, the x-axis is used for labelling, and the length of bars on the y-axis corresponds to the magnitude of the variable being measured. Vertical bar graphs are particuarly useful in analyzing time series data. One disadvantage is that they lack space for text labelling at the foot of each bar. \n", | |
"\n", | |
"**Let's start off by analyzing the effect of Iceland's Financial Crisis:**\n", | |
"\n", | |
"The 2008 - 2011 Icelandic Financial Crisis was a major economic and political event in Iceland. Relative to the size of its economy, Iceland's systemic banking collapse was the largest experienced by any country in economic history. The crisis led to a severe economic depression in 2008 - 2011 and significant political unrest.\n", | |
"\n", | |
"**Question:** Let's compare the number of Icelandic immigrants (country = 'Iceland') to Canada from year 1980 to 2013. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# step 1: get the data\n", | |
"df_iceland = df_can.loc['Iceland', years]\n", | |
"df_iceland.head()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"# step 2: plot data\n", | |
"df_iceland.plot(kind='bar', figsize=(10, 6))\n", | |
"\n", | |
"plt.xlabel('Year') # add to x-label to the plot\n", | |
"plt.ylabel('Number of immigrants') # add y-label to the plot\n", | |
"plt.title('Icelandic immigrants to Canada from 1980 to 2013') # add title to the plot\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"The bar plot above shows the total number of immigrants broken down by each year. We can clearly see the impact of the financial crisis; the number of immigrants to Canada started increasing rapidly after 2008. \n", | |
"\n", | |
"Let's annotate this on the plot using the `annotate` method of the **scripting layer** or the **pyplot interface**. We will pass in the following parameters:\n", | |
"- `s`: str, the text of annotation.\n", | |
"- `xy`: Tuple specifying the (x,y) point to annotate (in this case, end point of arrow).\n", | |
"- `xytext`: Tuple specifying the (x,y) point to place the text (in this case, start point of arrow).\n", | |
"- `xycoords`: The coordinate system that xy is given in - 'data' uses the coordinate system of the object being annotated (default).\n", | |
"- `arrowprops`: Takes a dictionary of properties to draw the arrow:\n", | |
" - `arrowstyle`: Specifies the arrow style, `'->'` is standard arrow.\n", | |
" - `connectionstyle`: Specifies the connection type. `arc3` is a straight line.\n", | |
" - `color`: Specifes color of arror.\n", | |
" - `lw`: Specifies the line width.\n", | |
"\n", | |
"I encourage you to read the Matplotlib documentation for more details on annotations: \n", | |
"http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.annotate." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_iceland.plot(kind='bar', figsize=(10, 6), rot=90) # rotate the bars by 90 degrees\n", | |
"\n", | |
"plt.xlabel('Year')\n", | |
"plt.ylabel('Number of Immigrants')\n", | |
"plt.title('Icelandic Immigrants to Canada from 1980 to 2013')\n", | |
"\n", | |
"# Annotate arrow\n", | |
"plt.annotate('', # s: str. Will leave it blank for no text\n", | |
" xy=(32, 70), # place head of the arrow at point (year 2012 , pop 70)\n", | |
" xytext=(28, 20), # place base of the arrow at point (year 2008 , pop 20)\n", | |
" xycoords='data', # will use the coordinate system of the object being annotated \n", | |
" arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2)\n", | |
" )\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Let's also annotate a text to go over the arrow. We will pass in the following additional parameters:\n", | |
"- `rotation`: rotation angle of text in degrees (counter clockwise)\n", | |
"- `va`: vertical alignment of text [‘center’ | ‘top’ | ‘bottom’ | ‘baseline’]\n", | |
"- `ha`: horizontal alignment of text [‘center’ | ‘right’ | ‘left’]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": false, | |
"deletable": true, | |
"editable": true, | |
"jupyter": { | |
"outputs_hidden": false | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"df_iceland.plot(kind='bar', figsize=(10, 6), rot=90) \n", | |
"\n", | |
"plt.xlabel('Year')\n", | |
"plt.ylabel('Number of Immigrants')\n", | |
"plt.title('Icelandic Immigrants to Canada from 1980 to 2013')\n", | |
"\n", | |
"# Annotate arrow\n", | |
"plt.annotate('', # s: str. will leave it blank for no text\n", | |
" xy=(32, 70), # place head of the arrow at point (year 2012 , pop 70)\n", | |
" xytext=(28, 20), # place base of the arrow at point (year 2008 , pop 20)\n", | |
" xycoords='data', # will use the coordinate system of the object being annotated \n", | |
" arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2)\n", | |
" )\n", | |
"\n", | |
"# Annotate Text\n", | |
"plt.annotate('2008 - 2011 Financial Crisis', # text to display\n", | |
" xy=(28, 30), # start the text at at point (year 2008 , pop 30)\n", | |
" rotation=72.5, # based on trial and error to match the arrow\n", | |
" va='bottom', # want the text to be vertically 'bottom' aligned\n", | |
" ha='left', # want the text to be horizontally 'left' algned.\n", | |
" )\n", | |
"\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"**Horizontal Bar Plot**\n", | |
"\n", | |
"Sometimes it is more practical to represent the data horizontally, especially if you need more room for labelling the bars. In horizontal bar graphs, the y-axis is used for labelling, and the length of bars on the x-axis corresponds to the magnitude of the variable being measured. As you will see, there is more room on the y-axis to label categetorical variables.\n", | |
"\n", | |
"\n", | |
"**Question:** Using the scripting layter and the `df_can` dataset, create a *horizontal* bar plot showing the *total* number of immigrants to Canada from the top 15 countries, for the period 1980 - 2013. Label each country with the total immigrant count." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Step 1: Get the data pertaining to the top 15 countries." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": true, | |
"deletable": true, | |
"jupyter": { | |
"outputs_hidden": true | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"### type your answer here\n", | |
"\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!-- The correct answer is:\n", | |
"\\\\ # sort dataframe on 'Total' column (descending)\n", | |
"df_can.sort_values(by='Total', ascending=True, inplace=True)\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"\\\\ # get top 15 countries\n", | |
"df_top15 = df_can['Total'].tail(15)\n", | |
"df_top15\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Step 2: Plot data:\n", | |
" 1. Use `kind='barh'` to generate a bar chart with horizontal bars.\n", | |
" 2. Make sure to choose a good size for the plot and to label your axes and to give the plot a title.\n", | |
" 3. Loop through the countries and annotate the immigrant population using the anotate function of the scripting interface." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"button": false, | |
"collapsed": true, | |
"deletable": true, | |
"jupyter": { | |
"outputs_hidden": true | |
}, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"outputs": [], | |
"source": [ | |
"### type your answer here\n", | |
"\n", | |
"\n", | |
"\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"Double-click __here__ for the solution.\n", | |
"<!-- The correct answer is:\n", | |
"\\\\ # generate plot\n", | |
"df_top15.plot(kind='barh', figsize=(12, 12), color='steelblue')\n", | |
"plt.xlabel('Number of Immigrants')\n", | |
"plt.title('Top 15 Conuntries Contributing to the Immigration to Canada between 1980 - 2013')\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"\\\\ # annotate value labels to each country\n", | |
"for index, value in enumerate(df_top15): \n", | |
" label = format(int(value), ',') # format int with commas\n", | |
" \n", | |
" # place text at the end of bar (subtracting 47000 from x, and 0.1 from y to make it fit within the bar)\n", | |
" plt.annotate(label, xy=(value - 47000, index - 0.10), color='white')\n", | |
"-->\n", | |
"\n", | |
"<!--\n", | |
"plt.show()\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"### Thank you for completing this lab!\n", | |
"\n", | |
"This notebook was originally created by [Jay Rajasekharan](https://www.linkedin.com/in/jayrajasekharan) with contributions from [Ehsan M. Kermani](https://www.linkedin.com/in/ehsanmkermani), and [Slobodan Markovic](https://www.linkedin.com/in/slobodan-markovic).\n", | |
"\n", | |
"This notebook was recently revamped by [Alex Aklson](https://www.linkedin.com/in/aklson/). I hope you found this lab session interesting. Feel free to contact me if you have any questions!" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"This notebook is part of a course on **Coursera** called *Data Visualization with Python*. If you accessed this notebook outside the course, you can take this course online by clicking [here](http://cocl.us/DV0101EN_Coursera_Week2_LAB1)." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"button": false, | |
"deletable": true, | |
"editable": true, | |
"new_sheet": false, | |
"run_control": { | |
"read_only": false | |
} | |
}, | |
"source": [ | |
"<hr>\n", | |
"\n", | |
"Copyright © 2019 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/)." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python", | |
"language": "python", | |
"name": "conda-env-python-py" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.10" | |
}, | |
"widgets": { | |
"state": {}, | |
"version": "1.1.2" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment