Last active
October 27, 2022 09:53
-
-
Save Faiza-K/0191ccaff8bfdd90dad7fbcd7a22f064 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "# Peer Review Assignment - Data Engineer - ETL\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Estimated time needed: **20** minutes\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Objectives\n\nIn this final part you will:\n\n* Run the ETL process\n* Extract bank and market cap data from the JSON file `bank_market_cap.json`\n* Transform the market cap currency using the exchange rate data\n* Load the transformed data into a seperate CSV\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": "#!pip install glob\n#!pip install pandas\n#!pip install requests\n#!pip install datetime" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Imports\n\nImport any additional libraries you may need here.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": "import glob\nimport pandas as pd\nfrom datetime import datetime" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "As the exchange rate fluctuates, we will download the same dataset to make marking simpler. This will be in the same format as the dataset you used in the last section\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": "--2021-12-06 16:32:26-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_1.json\nResolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245\nConnecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 2815 (2.7K) [application/json]\nSaving to: \u2018bank_market_cap_1.json.2\u2019\n\nbank_market_cap_1.j 100%[===================>] 2.75K --.-KB/s in 0s \n\n2021-12-06 16:32:26 (35.5 MB/s) - \u2018bank_market_cap_1.json.2\u2019 saved [2815/2815]\n\n--2021-12-06 16:32:27-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_2.json\nResolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245\nConnecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 1429 (1.4K) [application/json]\nSaving to: \u2018bank_market_cap_2.json.2\u2019\n\nbank_market_cap_2.j 100%[===================>] 1.40K --.-KB/s in 0s \n\n2021-12-06 16:32:27 (19.0 MB/s) - \u2018bank_market_cap_2.json.2\u2019 saved [1429/1429]\n\n--2021-12-06 16:32:28-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Final%20Assignment/exchange_rates.csv\nResolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245\nConnecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 590 [text/csv]\nSaving to: \u2018exchange_rates.csv.2\u2019\n\nexchange_rates.csv. 100%[===================>] 590 --.-KB/s in 0s \n\n2021-12-06 16:32:28 (7.51 MB/s) - \u2018exchange_rates.csv.2\u2019 saved [590/590]\n\n" | |
} | |
], | |
"source": "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_1.json\n!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Lab%20-%20Extract%20Transform%20Load/data/bank_market_cap_2.json\n!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0221EN-SkillsNetwork/labs/module%206/Final%20Assignment/exchange_rates.csv" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Extract\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### JSON Extract Function\n\nThis function will extract JSON files.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": "def extract_from_json(file_to_process):\n dataframe = pd.read_json(file_to_process)\n return dataframe" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Extract Function\n\nDefine the extract function that finds JSON file `bank_market_cap_1.json` and calls the function created above to extract data from them. Store the data in a `pandas` dataframe. Use the following list for the columns.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": "columns=['Name','Market Cap (US$ Billion)']" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": "def extract():\n # Write your code here\n extracted_data = pd.DataFrame(columns=['Name','Market Cap (US$ Billion)'])\n \n for json in glob.glob(\"*.json\"):\n extracted_data = extracted_data.append(extract_from_json(\"bank_market_cap_1.json\"), ignore_index= True)\n return extracted_data" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "<b>Question 1</b> Load the file <code>exchange_rates.csv</code> as a dataframe and find the exchange rate for British pounds with the symbol <code>GBP</code>, store it in the variable <code>exchange_rate</code>, you will be asked for the number. Hint: set the parameter <code>index_col</code> to 0.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "Index(['AUD', 'BGN', 'BRL', 'CAD', 'CHF', 'CNY', 'CZK', 'DKK', 'EUR', 'GBP',\n 'HKD', 'HRK', 'HUF', 'IDR', 'ILS', 'INR', 'ISK', 'JPY', 'KRW', 'MXN',\n 'MYR', 'NOK', 'NZD', 'PHP', 'PLN', 'RON', 'RUB', 'SEK', 'SGD', 'THB',\n 'TRY', 'USD', 'ZAR'],\n dtype='object')" | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "# Write your code here\ndf = pd.read_csv(\"exchange_rates.csv\", index_col = 0) \ndf\nexchange_rate= df.rename(columns= {\"Rates\": \"1USD\"})\ndf.index" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [], | |
"source": "exchange_rate= exchange_rate.loc[df.index.isin(['GBP'])]" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "0.7323984208000001" | |
}, | |
"execution_count": 9, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "exchange= exchange_rate.squeeze()\nexchange" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Transform\n\nUsing <code>exchange_rate</code> and the `exchange_rates.csv` file find the exchange rate of USD to GBP. Write a transform function that\n\n1. Changes the `Market Cap (US$ Billion)` column from USD to GBP\n2. Rounds the Market Cap (US$ Billion)\\` column to 3 decimal places\n3. Rename `Market Cap (US$ Billion)` to `Market Cap (GBP$ Billion)`\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": "targetedfile = 'bank_market_cap_gbp.csv'" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [], | |
"source": "#def transform(\"Add Function Parameters\"):\n # Write your code here\n\ndef transform(data):\n data['Market Cap (US$ Billion)']= data['Market Cap (US$ Billion)'] * exchange\n data['Market Cap (US$ Billion)']= data['Market Cap (US$ Billion)'].round(decimals = 3)\n data.rename(columns={\"Market Cap (US$ Billion)\": 'Market Cap (GBP$ Billion)'}, inplace= True)\n return data \n " | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>1USD</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>GBP</th>\n <td>0.732398</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " 1USD\nGBP 0.732398" | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "exchange_rate.head()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Load\n\nCreate a function that takes a dataframe and load it to a csv named `bank_market_cap_gbp.csv`. Make sure to set `index` to `False`.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": "#def load(\"Add Function Parameters\"):\n # Write your code here\n\ndef load(targetedfile, data_to_load):\n data_to_load.to_csv(targetedfile, index= False)\n \n# load(targetedfile, transformed_data)" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": "'bank_market_cap_gbp.csv'" | |
}, | |
"execution_count": 14, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "targetedfile" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Logging Function\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Write the logging function <code>log</code> to log your data:\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [], | |
"source": "def log(message):\n # Write your code here\n \n timeStemp_format= \"%Y-%h-%d-%H-%M-%S\"\n now = datetime.now()\n \n timeStemp= now.strftime(timeStemp_format)\n with open(\"logfile.txt\", \"a\") as f:\n f.write(timeStemp + ',' + message + '/n')" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Running the ETL Process\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Log the process accordingly using the following <code>\"ETL Job Started\"</code> and <code>\"Extract phase Started\"</code>\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [], | |
"source": "# Write your code here\nlog(\"ETL Job Started\")\n\nlog('Extract Phase started')" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Extract\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "<code>Question 2</code> Use the function <code>extract</code>, and print the first 5 rows, take a screen shot:\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Name</th>\n <th>Market Cap (US$ Billion)</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>JPMorgan Chase</td>\n <td>390.934</td>\n </tr>\n <tr>\n <th>1</th>\n <td>Industrial and Commercial Bank of China</td>\n <td>345.214</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Bank of America</td>\n <td>325.331</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Wells Fargo</td>\n <td>308.013</td>\n </tr>\n <tr>\n <th>4</th>\n <td>China Construction Bank</td>\n <td>257.399</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " Name Market Cap (US$ Billion)\n0 JPMorgan Chase 390.934\n1 Industrial and Commercial Bank of China 345.214\n2 Bank of America 325.331\n3 Wells Fargo 308.013\n4 China Construction Bank 257.399" | |
}, | |
"execution_count": 18, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "# Call the function here\n\nextracted_data= extract()\n# Print the rows here\nextracted_data.head()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Log the data as <code>\"Extract phase Ended\"</code>\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [], | |
"source": "# Write your code here\nlog(\"Extract phase Ended\")" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Transform\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Log the following <code>\"Transform phase Started\"</code>\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [], | |
"source": "# Write your code here\nlog(\"Transform phase Started\")" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "<code>Question 3</code> Use the function <code>transform</code> and print the first 5 rows of the output, take a screen shot:\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Name</th>\n <th>Market Cap (GBP$ Billion)</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>JPMorgan Chase</td>\n <td>286.319</td>\n </tr>\n <tr>\n <th>1</th>\n <td>Industrial and Commercial Bank of China</td>\n <td>252.834</td>\n </tr>\n <tr>\n <th>2</th>\n <td>Bank of America</td>\n <td>238.272</td>\n </tr>\n <tr>\n <th>3</th>\n <td>Wells Fargo</td>\n <td>225.588</td>\n </tr>\n <tr>\n <th>4</th>\n <td>China Construction Bank</td>\n <td>188.519</td>\n </tr>\n </tbody>\n</table>\n</div>", | |
"text/plain": " Name Market Cap (GBP$ Billion)\n0 JPMorgan Chase 286.319\n1 Industrial and Commercial Bank of China 252.834\n2 Bank of America 238.272\n3 Wells Fargo 225.588\n4 China Construction Bank 188.519" | |
}, | |
"execution_count": 21, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": "# Call the function here\ntransformed_data= transform(extracted_data)\n# Print the first 5 rows here\ntransformed_data.head()" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Log your data <code>\"Transform phase Ended\"</code>\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [], | |
"source": "# Write your code here\nlog(\"Transform phase Ended\")" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Load\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Log the following `\"Load phase Started\"`.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [], | |
"source": "# Write your code here\nlog(\"Load phase Started\")" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Call the load function\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 24, | |
"metadata": {}, | |
"outputs": [], | |
"source": "# Write your code here\nload(targetedfile, transformed_data)" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": "import sys\nprint(sys.getrecursionlimit())" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": "sys.setrecursionlimit(5000)" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Log the following `\"Load phase Ended\"`.\n" | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": {}, | |
"outputs": [], | |
"source": "# Write your code here\nlog(\"Load phase Ended\")" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Authors\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Ramesh Sannareddy, Joseph Santrcangelo and Azim Hirjani\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "### Other Contributors\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Rav Ahuja\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "## Change Log\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ----------------- | ---------------------------------- |\n| 2020-11-25 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n" | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": "Copyright \u00a9 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-2021-01-01&cm_mmc=Email_Newsletter-\\_-Developer_Ed%2BTech-\\_-WW_WW-\\_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n" | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3.8", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.12" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment