Created
January 23, 2020 16:51
-
-
Save Patrick-Wallin/35fd6382ffdfa95244653c2576419506 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n", | |
" <a href=\"http://cocl.us/DA0101EN_NotbookLink_Top\">\n", | |
" <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/Images/TopAd.png\" width=\"750\" align=\"center\">\n", | |
" </a>\n", | |
"</div>\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<a href=\"https://www.bigdatauniversity.com\"><img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/Images/CCLog.png\" width=300, align=\"center\"></a>\n", | |
"\n", | |
"<h1 align=center><font size = 5>Data Analysis with Python</font></h1>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h1>Introduction</h1>\n", | |
"<h3>Welcome!</h3>\n", | |
"\n", | |
"<p>\n", | |
"In this section, you will learn how to approach data acquisition in various ways, and obtain necessary insights from a dataset. By the end of this lab, you will successfully load the data into Jupyter Notebook, and gain some fundamental insights via Pandas Library.\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Table of Contents</h2>\n", | |
"\n", | |
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n", | |
"<ol>\n", | |
" <li><a href=\"#data_acquisition\">Data Acquisition</a>\n", | |
" <li><a href=\"#basic_insight\">Basic Insight of Dataset</a></li>\n", | |
"</ol>\n", | |
"\n", | |
"Estimated Time Needed: <strong>10 min</strong>\n", | |
"</div>\n", | |
"<hr>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h1 id=\"data_acquisition\">Data Acquisition</h1>\n", | |
"<p>\n", | |
"There are various formats for a dataset, .csv, .json, .xlsx etc. The dataset can be stored in different places, on your local machine or sometimes online.<br>\n", | |
"In this section, you will learn how to load a dataset into our Jupyter Notebook.<br>\n", | |
"In our case, the Automobile Dataset is an online source, and it is in CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.\n", | |
"<ul>\n", | |
" <li>data source: <a href=\"https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data\" target=\"_blank\">https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data</a></li>\n", | |
" <li>data type: csv</li>\n", | |
"</ul>\n", | |
"The Pandas Library is a useful tool that enables us to read various datasets into a data frame; our Jupyter notebook platforms have a built-in <b>Pandas Library</b> so that all we need to do is import Pandas without installing.\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"# import pandas library\n", | |
"import pandas as pd" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Read Data</h2>\n", | |
"<p>\n", | |
"We use <code>pandas.read_csv()</code> function to read the csv file. In the bracket, we put the file path along with a quotation mark, so that pandas will read the file into a data frame from that address. The file path can be either an URL or your local file address.<br>\n", | |
"Because the data does not include headers, we can add an argument <code>headers = None</code> inside the <code>read_csv()</code> method, so that pandas will not automatically set the first row as a header.<br>\n", | |
"You can also assign the dataset to any variable you create.\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This dataset was hosted on IBM Cloud object click <a href=\"https://cocl.us/cognitive_class_DA0101EN_objectstorage\">HERE</a> for free storage." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Import pandas library\n", | |
"import pandas as pd\n", | |
"\n", | |
"# Read the online file by the URL provides above, and assign it to variable \"df\"\n", | |
"other_path = \"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv\"\n", | |
"df = pd.read_csv(other_path, header=None)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"After reading the dataset, we can use the <code>dataframe.head(n)</code> method to check the top n rows of the dataframe; where n is an integer. Contrary to <code>dataframe.head(n)</code>, <code>dataframe.tail(n)</code> will show you the bottom n rows of the dataframe.\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# show the first 5 rows using dataframe.head() method\n", | |
"print(\"The first 5 rows of the dataframe\") \n", | |
"df.head(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n", | |
"<h1> Question #1: </h1>\n", | |
"<b>check the bottom 10 rows of data frame \"df\".</b>\n", | |
"</div>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Write your code below and press Shift+Enter to execute \n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n", | |
"<h1> Question #1 Answer: </h1>\n", | |
"<b>Run the code below for the solution!</b>\n", | |
"</div>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click <b>here</b> for the solution.\n", | |
"\n", | |
"<!-- The answer is below:\n", | |
"\n", | |
"print(\"The last 10 rows of the dataframe\\n\")\n", | |
"df.tail(10)\n", | |
"\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>Add Headers</h3>\n", | |
"<p>\n", | |
"Take a look at our dataset; pandas automatically set the header by an integer from 0.\n", | |
"</p>\n", | |
"<p>\n", | |
"To better describe our data we can introduce a header, this information is available at: <a href=\"https://archive.ics.uci.edu/ml/datasets/Automobile\" target=\"_blank\">https://archive.ics.uci.edu/ml/datasets/Automobile</a>\n", | |
"</p>\n", | |
"<p>\n", | |
"Thus, we have to add headers manually.\n", | |
"</p>\n", | |
"<p>\n", | |
"Firstly, we create a list \"headers\" that include all column names in order.\n", | |
"Then, we use <code>dataframe.columns = headers</code> to replace the headers by the list we created.\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# create headers list\n", | |
"headers = [\"symboling\",\"normalized-losses\",\"make\",\"fuel-type\",\"aspiration\", \"num-of-doors\",\"body-style\",\n", | |
" \"drive-wheels\",\"engine-location\",\"wheel-base\", \"length\",\"width\",\"height\",\"curb-weight\",\"engine-type\",\n", | |
" \"num-of-cylinders\", \"engine-size\",\"fuel-system\",\"bore\",\"stroke\",\"compression-ratio\",\"horsepower\",\n", | |
" \"peak-rpm\",\"city-mpg\",\"highway-mpg\",\"price\"]\n", | |
"print(\"headers\\n\", headers)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" We replace headers and recheck our data frame" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df.columns = headers\n", | |
"df.head(10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"we can drop missing values along the column \"price\" as follows " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df.dropna(subset=[\"price\"], axis=0)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, we have successfully read the raw dataset and add the correct headers into the data frame." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" <div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n", | |
"<h1> Question #2: </h1>\n", | |
"<b>Find the name of the columns of the dataframe</b>\n", | |
"</div>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# Write your code below and press Shift+Enter to execute \n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click <b>here</b> for the solution.\n", | |
"\n", | |
"<!-- The answer is below:\n", | |
"\n", | |
"print(df.columns)\n", | |
"\n", | |
"-->" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Save Dataset</h2>\n", | |
"<p>\n", | |
"Correspondingly, Pandas enables us to save the dataset to csv by using the <code>dataframe.to_csv()</code> method, you can add the file path and name along with quotation marks in the brackets.\n", | |
"</p>\n", | |
"<p>\n", | |
" For example, if you would save the dataframe <b>df</b> as <b>automobile.csv</b> to your local machine, you may use the syntax below:\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"df.to_csv(\"automobile.csv\", index=False)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" We can also read and save other file formats, we can use similar functions to **`pd.read_csv()`** and **`df.to_csv()`** for other data formats, the functions are listed in the following table:\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Read/Save Other Data Formats</h2>\n", | |
"\n", | |
"\n", | |
"\n", | |
"| Data Formate | Read | Save |\n", | |
"| ------------- |:--------------:| ----------------:|\n", | |
"| csv | `pd.read_csv()` |`df.to_csv()` |\n", | |
"| json | `pd.read_json()` |`df.to_json()` |\n", | |
"| excel | `pd.read_excel()`|`df.to_excel()` |\n", | |
"| hdf | `pd.read_hdf()` |`df.to_hdf()` |\n", | |
"| sql | `pd.read_sql()` |`df.to_sql()` |\n", | |
"| ... | ... | ... |" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h1 id=\"basic_insight\">Basic Insight of Dataset</h1>\n", | |
"<p>\n", | |
"After reading data into Pandas dataframe, it is time for us to explore the dataset.<br>\n", | |
"There are several ways to obtain essential insights of the data to help us better understand our dataset.\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Data Types</h2>\n", | |
"<p>\n", | |
"Data has a variety of types.<br>\n", | |
"The main types stored in Pandas dataframes are <b>object</b>, <b>float</b>, <b>int</b>, <b>bool</b> and <b>datetime64</b>. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas:\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df.dtypes\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"returns a Series with the data type of each column." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false, | |
"scrolled": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# check the data type of data frame \"df\" by .dtypes\n", | |
"print(df.dtypes)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<p>\n", | |
"As a result, as shown above, it is clear to see that the data type of \"symboling\" and \"curb-weight\" are <code>int64</code>, \"normalized-losses\" is <code>object</code>, and \"wheel-base\" is <code>float64</code>, etc.\n", | |
"</p>\n", | |
"<p>\n", | |
"These data types can be changed; we will learn how to accomplish this in a later module.\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Describe</h2>\n", | |
"If we would like to get a statistical summary of each column, such as count, column mean value, column standard deviation, etc. We use the describe method:" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"dataframe.describe()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This method will provide various summary statistics, excluding <code>NaN</code> (Not a Number) values." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"df.describe()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<p>\n", | |
"This shows the statistical summary of all numeric-typed (int, float) columns.<br>\n", | |
"For example, the attribute \"symboling\" has 205 counts, the mean value of this column is 0.83, the standard deviation is 1.25, the minimum value is -2, 25th percentile is 0, 50th percentile is 1, 75th percentile is 2, and the maximum value is 3.\n", | |
"<br>\n", | |
"However, what if we would also like to check all the columns including those that are of type object.\n", | |
"<br><br>\n", | |
"\n", | |
"You can add an argument <code>include = \"all\"</code> inside the bracket. Let's try it again.\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# describe all the columns in \"df\" \n", | |
"df.describe(include = \"all\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<p>\n", | |
"Now, it provides the statistical summary of all the columns, including object-typed attributes.<br>\n", | |
"We can now see how many unique values, which is the top value and the frequency of top value in the object-typed columns.<br>\n", | |
"Some values in the table above show as \"NaN\", this is because those numbers are not available regarding a particular column type.<br>\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n", | |
"<h1> Question #3: </h1>\n", | |
"\n", | |
"<p>\n", | |
"You can select the columns of a data frame by indicating the name of each column, for example, you can select the three columns as follows:\n", | |
"</p>\n", | |
"<p>\n", | |
" <code>dataframe[[' column 1 ',column 2', 'column 3']]</code>\n", | |
"</p>\n", | |
"<p>\n", | |
"Where \"column\" is the name of the column, you can apply the method \".describe()\" to get the statistics of those columns as follows:\n", | |
"</p>\n", | |
"<p>\n", | |
" <code>dataframe[[' column 1 ',column 2', 'column 3'] ].describe()</code>\n", | |
"</p>\n", | |
"\n", | |
"Apply the method to \".describe()\" to the columns 'length' and 'compression-ratio'.\n", | |
"</div>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Write your code below and press Shift+Enter to execute \n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Double-click <b>here</b> for the solution.\n", | |
"\n", | |
"<!-- The answer is below:\n", | |
"\n", | |
"df[['length', 'compression-ratio']].describe()\n", | |
"\n", | |
"-->\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h2>Info</h2>\n", | |
"Another method you can use to check your dataset is:" | |
] | |
}, | |
{ | |
"cell_type": "raw", | |
"metadata": {}, | |
"source": [ | |
"dataframe.info" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"It provide a concise summary of your DataFrame." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"# look at the info of \"df\"\n", | |
"df.info" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<p>\n", | |
"Here we are able to see the information of our dataframe, with the top 30 rows and the bottom 30 rows.\n", | |
"<br><br>\n", | |
"And, it also shows us the whole data frame has 205 rows and 26 columns in total.\n", | |
"</p>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h1>Excellent! You have just completed the Introduction Notebook!</h1>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n", | |
"\n", | |
" <p><a href=\"https://cocl.us/DA0101EN_NotbookLink_Top_bottom\"><img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/Images/BottomAd.png\" width=\"750\" align=\"center\"></a></p>\n", | |
"</div>\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<h3>About the Authors:</h3>\n", | |
"\n", | |
"This notebook was written by <a href=\"https://www.linkedin.com/in/mahdi-noorian-58219234/\" target=\"_blank\">Mahdi Noorian PhD</a>, <a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\" target=\"_blank\">Joseph Santarcangelo</a>, Bahare Talayian, Eric Xiao, Steven Dong, Parizad, Hima Vsudevan and <a href=\"https://www.linkedin.com/in/fiorellawever/\" target=\"_blank\">Fiorella Wenver</a> and <a href=\" https://www.linkedin.com/in/yi-leng-yao-84451275/ \" target=\"_blank\" >Yi Yao</a>.\n", | |
"\n", | |
"<p><a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\" target=\"_blank\">Joseph Santarcangelo</a> is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.</p>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"<hr>\n", | |
"<p>Copyright © 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the <a href=\"https://cognitiveclass.ai/mit-license/\">MIT License</a>.</p>" | |
] | |
} | |
], | |
"metadata": { | |
"anaconda-cloud": {}, | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.7" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment