Created
April 1, 2021 11:55
-
-
Save PeterKjeldsen/d36443a85434b4906ee6d3ff5a0e5f39 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "<center>\n <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n</center>\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "# **Web Scraping Lab**\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Estimated time needed: **30** minutes\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Objectives\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "After completing this lab you will be able to:\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- Download a webpage using requests module\n- Scrape all links from a web page\n- Scrape all image urls from a web page\n- Scrape data from html tables\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Scrape [www.ibm.com](http://www.ibm.com?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ)\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Import the required modules and functions\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "from bs4 import BeautifulSoup # this module helps in web scrapping.\nimport requests # this module helps us to download a web page", | |
"execution_count": 1, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Download the contents of the web page\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "url = \"http://www.ibm.com\"", | |
"execution_count": 2, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "# get the contents of the webpage in text format and store in a variable called data\ndata = requests.get(url).text ", | |
"execution_count": 3, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Create a soup object using the class BeautifulSoup\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "soup = BeautifulSoup(data,\"html5lib\") # create a soup object using the variable 'data'", | |
"execution_count": 4, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Scrape all links\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "for link in soup.find_all('a'): # in html anchor/link is represented by the tag <a>\n print(link.get('href'))", | |
"execution_count": 5, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "#main-content\nhttp://www.ibm.com\nhttps://www.ibm.com/cloud/hybrid/value-calculator/?lnk=ushpv18l1\nhttps://www.ibm.com/products/digital-health-pass?lnk=ushpv18f1\nhttps://www.ibm.com/security/services/cloud-security-services?lnk=ushpv18f2\nhttps://www.ibm.com/thought-leadership/smart/talks/?lnk=ushpv18f3\nhttps://www.ibm.com/cloud/watson-studio/premium?lnk=ushpv18f4\nhttps://developer.ibm.com/solutions/application-modernization/?lnk=ushpv18d1\nhttps://www.ibm.com/products/offers-and-discounts?link=ushpv18t5&lnk2=trial_mktpl_MPDISC\nhttps://www.ibm.com/products/cloud-pak-for-data?lnk=ushpv18t1&lnk2=trial_CloudPakData&psrc=none&pexp=def\nhttps://www.ibm.com/products/hosted-security-intelligence?lnk=ushpv18t2&lnk2=trial_QRadarCloud&psrc=none&pexp=def\nhttps://www.ibm.com/cloud/blog/instana?lnk=ushpv18t3&lnk2=trial_Instana&psrc=none&pexp=def\nhttps://www.ibm.com/cloud/watson-discovery?lnk=ushpv18t4&lnk2=trial_WatDiscovery&psrc=none&pexp=def\nhttps://www.ibm.com/search?lnk=ushpv18srch&locale=en-us&q=\nhttps://www.ibm.com/products?lnk=ushpv18p1&lnk2=trial_mktpl&psrc=none&pexp=def\nNone\nNone\nhttps://developer.ibm.com/depmodels/cloud/?lnk=ushpv18ct16\nhttps://developer.ibm.com/technologies/artificial-intelligence?lnk=ushpv18ct19\nhttps://www.ibm.com/demos/?lnk=ushpv18ct12\nhttps://developer.ibm.com/?lnk=ushpv18ct9\nhttp://www.ibm.com/support/knowledgecenter/?lnk=ushpv18ct14\nhttps://www.redbooks.ibm.com/?lnk=ushpv18ct10\nhttps://www.ibm.com/support/home/?lnk=ushpv18ct11\nhttps://www.ibm.com/training/?lnk=ushpv18ct15\nhttps://www.ibm.com/cloud/hybrid?lnk=ushpv18ct20\nhttps://www.ibm.com/cloud/learn/public-cloud?lnk=ushpv18ct17\nhttps://www.ibm.com/cloud/redhat?lnk=ushpv18ct13\nhttps://www.ibm.com/artificial-intelligence?lnk=ushpv18ct3\nhttps://www.ibm.com/quantum-computing?lnk=ushpv18ct18\nhttps://www.ibm.com/cloud/learn/kubernetes?lnk=ushpv18ct8\nhttps://www.ibm.com/products/spss-statistics?lnk=ushpv18ct7\nhttps://www.ibm.com/blockchain?lnk=ushpv18ct1\nhttps://www-03.ibm.com/employment/technicaltalent/developer/?lnk=ushpv18ct2\nhttps://www.ibm.com/search?lnk=ushpv18srch&locale=en-us&q=\nhttps://www.ibm.com/products?lnk=ushpv18p1&lnk2=trial_mktpl&psrc=none&pexp=def\nNone\nNone\nhttps://www.ibm.com/cloud/hybrid?lnk=ushpv18pt14&bv=true\nhttps://www.ibm.com/watson?lnk=ushpv18pt17&bv=true\nhttps://www.ibm.com/us-en/products/categories?technologyTopics[0][0]=cat.topic:Blockchain&isIBMOffering[0]=true&lnk=ushpv18pt4&bv=true\nhttps://www.ibm.com/us-en/products/category/technology/analytics?lnk=ushpv18pt1&bv=true\nhttps://www.ibm.com/financing?lnk=ushpv18pt3&bv=true\nhttps://www.ibm.com/cloud/public?lnk=ushpv18pt15&bv=true\nhttps://www.ibm.com/garage?lnk=ushpv18pt13&bv=true\nhttps://www.ibm.com/thought-leadership/institute-business-value/?lnk=ushpv18pt12&bv=true\nhttps://www.ibm.com/us-en/products/category/technology/security?lnk=ushpv18pt9&bv=true\nhttps://www.ibm.com/quantum-computing?lnk=ushpv18pt16&bv=true\nhttps://www.ibm.com/cloud/hybrid?lnk=ushpv18ct20\nhttps://www.ibm.com/cloud/public?lnk=ushpv18ct17\nhttps://www.ibm.com/cloud/redhat?lnk=ushpv18ct13\nhttps://www.ibm.com/artificial-intelligence?lnk=ushpv18ct3\nhttps://www.ibm.com/quantum-computing?lnk=ushpv18ct18\nhttps://www.ibm.com/cloud/learn/kubernetes?lnk=ushpv18ct8\nhttps://www.ibm.com/products/spss-statistics?lnk=ushpv18ct7\nhttps://www.ibm.com/blockchain?lnk=ushpv18ct1\nhttps://www-03.ibm.com/employment/technicaltalent/developer/?lnk=ushpv18ct2\nhttps://www.ibm.com/\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Scrape all images\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "for link in soup.find_all('img'):# in html image is represented by the tag <img>\n print(link.get('src'))", | |
"execution_count": 6, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/9b/4f/20210315-hybrid-cloud-value-calculator-25782-720x360.jpg\n\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/22/ac/20210329-f-digital-health-pass-25780.jpg\n\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/61/fb/20210329-f-security-suite-webinar-25817.jpg\n\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/46/87/20210329-smart-talks-rob-and-malcom-data-problem-444x320.jpg\n\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/0d/a0/20210222-watson-studio-api-25701-444x320.jpg\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/2f/51/grace-developer-diaries-thumbnail-800x450.jpg\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/12/0d/pratik-developer-diaries-thumbnail-800x450.jpg\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/9c/90/musicians-ai-cloud-developer-diaries-thumbnail-800x450.jpg\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/17/60/miscoded-credentials-developer-diaries-thumbnail-800x450.jpg\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/28/cf/serveless-developer-diaries-thumbnail-800x450.jpg\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/f4/ea/openshift-developer-diaries-thumbnail-800x450.jpg\nhttps://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/28/0f/niklas-developer-diaries-thumbnail-800x450.jpg\n\n\n\n\n\n\n\n\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Scrape data from html tables\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "#The below url contains a html table with data about colors and color codes.", | |
"execution_count": 7, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "url = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html\"", | |
"execution_count": 9, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "# get the contents of the webpage in text format and store in a variable called data\ndata = requests.get(url).text", | |
"execution_count": 10, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "soup = BeautifulSoup(data,\"html5lib\")", | |
"execution_count": 11, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "#find a html table in the web page\ntable = soup.find('table') # in html table is represented by the tag <table>", | |
"execution_count": 12, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"source": "#Get all rows from the table\nfor row in table.find_all('tr'): # in html table row is represented by the tag <tr>\n # Get all columns in each row.\n cols = row.find_all('td') # in html a column is represented by the tag <td>\n color_name = cols[2].getText() # store the value in column 3 as color_name\n color_code = cols[3].getText() # store the value in column 4 as color_code\n print(\"{}--->{}\".format(color_name,color_code))", | |
"execution_count": 13, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "Color Name--->Hex Code#RRGGBB\nlightsalmon--->#FFA07A\nsalmon--->#FA8072\ndarksalmon--->#E9967A\nlightcoral--->#F08080\ncoral--->#FF7F50\ntomato--->#FF6347\norangered--->#FF4500\ngold--->#FFD700\norange--->#FFA500\ndarkorange--->#FF8C00\nlightyellow--->#FFFFE0\nlemonchiffon--->#FFFACD\npapayawhip--->#FFEFD5\nmoccasin--->#FFE4B5\npeachpuff--->#FFDAB9\npalegoldenrod--->#EEE8AA\nkhaki--->#F0E68C\ndarkkhaki--->#BDB76B\nyellow--->#FFFF00\nlawngreen--->#7CFC00\nchartreuse--->#7FFF00\nlimegreen--->#32CD32\nlime--->#00FF00\nforestgreen--->#228B22\ngreen--->#008000\npowderblue--->#B0E0E6\nlightblue--->#ADD8E6\nlightskyblue--->#87CEFA\nskyblue--->#87CEEB\ndeepskyblue--->#00BFFF\nlightsteelblue--->#B0C4DE\ndodgerblue--->#1E90FF\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Authors\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Ramesh Sannareddy\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "### Other Contributors\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Rav Ahuja\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## Change Log\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ----------------- | ---------------------------------- |\n| 2020-10-17 | 0.1 | Ramesh Sannareddy | Created initial version of the lab |\n" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": " Copyright \u00a9 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBM-DA0321EN-SkillsNetwork-21426264&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n" | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3.7", | |
"language": "python" | |
}, | |
"language_info": { | |
"name": "python", | |
"version": "3.7.10", | |
"mimetype": "text/x-python", | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"pygments_lexer": "ipython3", | |
"nbconvert_exporter": "python", | |
"file_extension": ".py" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment