Skip to content

Instantly share code, notes, and snippets.

@shohrehsharifib
Created March 7, 2021 23:33
Show Gist options
  • Save shohrehsharifib/cca72abc82c98084d1b23ae9f00e0c63 to your computer and use it in GitHub Desktop.
Save shohrehsharifib/cca72abc82c98084d1b23ae9f00e0c63 to your computer and use it in GitHub Desktop.
Created on Skills Network Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <img src=\"https://gitlab.com/ibm/skills-network/courses/placeholder101/-/raw/master/labs/module%201/images/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n",
"</center>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# **Web Scraping Lab**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Estimated time needed: **30** minutes\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Objectives\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After completing this lab you will be able to:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Table of Contents</h2>\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
" <ul>\n",
" <li>\n",
" <a href=\"BSO\">Beautiful Soup Object</a>\n",
" <ul>\n",
" <li>Tag</li>\n",
" <li>Children, Parents, and Siblings</li>\n",
" <li>HTML Attributes</li>\n",
" <li>Navigable String</li>\n",
" </ul>\n",
" </li>\n",
" </ul>\n",
" <ul>\n",
" <li>\n",
" <a href=\"filter\">Filter</a>\n",
" <ul>\n",
" <li>find All</li>\n",
" <li>find </li>\n",
" <li>HTML Attributes</li>\n",
" <li>Navigable String</li>\n",
" </ul>\n",
" </li>\n",
" </ul>\n",
" <ul>\n",
" <li>\n",
" <a href=\"DSCW\">Downloading And Scraping The Contents Of A Web</a>\n",
" </li>\n",
" </ul>\n",
" <p>\n",
" Estimated time needed: <strong>25 min</strong>\n",
" </p>\n",
" \n",
"</div>\n",
"\n",
"<hr>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting bs4\n",
" Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz\n",
"Collecting beautifulsoup4 (from bs4)\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)\n",
"\u001b[K |████████████████████████████████| 122kB 8.0MB/s eta 0:00:01\n",
"\u001b[?25hCollecting soupsieve>1.2; python_version >= \"3.0\" (from beautifulsoup4->bs4)\n",
" Downloading https://files.pythonhosted.org/packages/41/e7/3617a4b988ed7744743fb0dbba5aa0a6e3f95a9557b43f8c4740d296b48a/soupsieve-2.2-py3-none-any.whl\n",
"Building wheels for collected packages: bs4\n",
" Building wheel for bs4 (setup.py) ... \u001b[?25ldone\n",
"\u001b[?25h Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472\n",
"Successfully built bs4\n",
"Installing collected packages: soupsieve, beautifulsoup4, bs4\n",
"Successfully installed beautifulsoup4-4.9.3 bs4-0.0.1 soupsieve-2.2\n"
]
}
],
"source": [
"!pip install bs4\n",
"#!pip install requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the required modules and functions\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from bs4 import BeautifulSoup # this module helps in web scrapping.\n",
"import requests # this module helps us to download a web page"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"BSO\">Beautiful Soup Objects</h2>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML. We can navigate the HTML as a tree and/or filter out what we are looking for. \n",
"\n",
"Consider the following HTML:\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<!DOCTYPE html>\n",
"<html>\n",
"<head>\n",
"<title>Page Title</title>\n",
"</head>\n",
"<body>\n",
"<h3><b id='boldest'>Lebron James</b></h3>\n",
"<p> Salary: $ 92,000,000 </p>\n",
"<h3> Stephen Curry</h3>\n",
"<p> Salary: $85,000, 000 </p>\n",
"<h3> Kevin Durant </h3>\n",
"<p> Salary: $73,200, 000</p>\n",
"</body>\n",
"</html>\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%html\n",
"<!DOCTYPE html>\n",
"<html>\n",
"<head>\n",
"<title>Page Title</title>\n",
"</head>\n",
"<body>\n",
"<h3><b id='boldest'>Lebron James</b></h3>\n",
"<p> Salary: $ 92,000,000 </p>\n",
"<h3> Stephen Curry</h3>\n",
"<p> Salary: $85,000, 000 </p>\n",
"<h3> Kevin Durant </h3>\n",
"<p> Salary: $73,200, 000</p>\n",
"</body>\n",
"</html>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can store it as a string in the variable HTML:\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"html=\"<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To parse a document, pass it into the <code>BeautifulSoup</code> constructor, the <code>BeautifulSoup</code> object, which represents the document as a nested data structure:\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"soup = BeautifulSoup(html, 'html5lib')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, the document is converted to Unicode, (similar to ASCII), and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The <code>BeautifulSoup</code> object can create other types of objects. In this lab, we will cover <code>BeautifulSoup</code> and <code>Tag</code> objects that for the purposes of this lab are identical, and <code>NavigableString</code> objects.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the method <code>prettify()</code> to display the HTML in the nested structure:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(soup.prettify())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tags\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say we want the title of the page and the name of the top paid player we can use the <code>Tag</code>. The <code>Tag</code> object corresponds to an HTML tag in the original document, for example, the tag title.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tag_object=soup.title\n",
"print(\"tag object:\",tag_object)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we can see the tag type <code>bs4.element.Tag</code>\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tag object type: <class 'bs4.element.Tag'>\n"
]
}
],
"source": [
"print(\"tag object type:\",type(tag_object))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If there is more than one <code>Tag</code> with the same name, the first element with that <code>Tag</code> name is called, this corresponds to the most paid player: \n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<h3><b id=\"boldest\">Lebron James</b></h3>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tag_object=soup.h3\n",
"tag_object"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Enclosed in the bold attribute <code>b</code>, it helps to use the tree representation. We can navigate down the tree using the child attribute to get the name. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Children, Parents, and Siblings\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As stated above the <code>Tag</code> object is a tree of objects we can access the child of the tag or navigate down the branch as follows:\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<b id=\"boldest\">Lebron James</b>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tag_child =tag_object.b\n",
"tag_child"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can access the parent with the <code> parent</code>\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<h3><b id=\"boldest\">Lebron James</b></h3>"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"parent_tag=tag_child.parent\n",
"parent_tag"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"this is identical to \n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<h3><b id=\"boldest\">Lebron James</b></h3>"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tag_object"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<code>tag_object</code> parent is the <code>body</code> element.\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<body><h3><b id=\"boldest\">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tag_object.parent"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<code>tag_object</code> sibling is the <code>paragraph</code> element\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<p> Salary: $ 92,000,000 </p>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sibling_1=tag_object.next_sibling\n",
"sibling_1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`sibling_2` is the `header` element which is also a sibling of both `sibling_1` and `tag_object`\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<h3> Stephen Curry</h3>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sibling_2=sibling_1.next_sibling\n",
"sibling_2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3 id=\"first_question\">Exercise: <code>next_sibling</code></h3>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the object <code>sibling_2</code> and the method <code>next_sibling</code> to find the salary of Stephen Curry:\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<p> Salary: $85,000, 000 </p>"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sibling_2.next_sibling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
"sibling_2.next_sibling\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### HTML Attributes\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the tag has attributes, the tag <code>id=\"boldest\"</code> has an attribute <code>id</code> whose value is <code>boldest</code>. You can access a tag’s attributes by treating the tag like a dictionary:\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'boldest'"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tag_child['id']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can access that dictionary directly as <code>attrs</code>:\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'id': 'boldest'}"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tag_child.attrs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also work with Multi-valued attribute check out <a href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\">[1]</a> for more.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also obtain the content if the attribute of the <code>tag</code> using the Python <code>get()</code> method.\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'boldest'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tag_child.get('id')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Navigable String\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the <code>NavigableString</code> class to contain this text. In our HTML we can obtain the name of the first player by extracting the sting of the <code>Tag</code> object <code>tag_child</code> as follows:\n"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Lebron James'"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tag_string=tag_child.string\n",
"tag_string"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we can verify the type is Navigable String\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"bs4.element.NavigableString"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(tag_string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A NavigableString is just like a Python string or Unicode string, to be more precise. The main difference is that it also supports some <code>BeautifulSoup</code> features. We can covert it to sting object in Python:\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Lebron James'"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"unicode_string = str(tag_string)\n",
"unicode_string"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"filter\">Filter</h2>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Filters allow you to find complex patterns, the simplest filter is a string. In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string. Consider the following HTML of rocket launchs:\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
" <tr>\n",
" <td id='flight' >Flight No</td>\n",
" <td>Launch site</td> \n",
" <td>Payload mass</td>\n",
" </tr>\n",
" <tr> \n",
" <td>1</td>\n",
" <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>\n",
" <td>300 kg</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>\n",
" <td>94 kg</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>\n",
" <td>80 kg</td>\n",
" </tr>\n",
"</table>\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%html\n",
"<table>\n",
" <tr>\n",
" <td id='flight' >Flight No</td>\n",
" <td>Launch site</td> \n",
" <td>Payload mass</td>\n",
" </tr>\n",
" <tr> \n",
" <td>1</td>\n",
" <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>\n",
" <td>300 kg</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>\n",
" <td>94 kg</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>\n",
" <td>80 kg</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can store it as a string in the variable <code>table</code>:\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"table=\"<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>\""
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"table_bs = BeautifulSoup(table, 'html5lib')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## find All\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The <code>find_all()</code> method looks through a tag’s descendants and retrieves all descendants that match your filters. \n",
"\n",
"<p>\n",
"The Method signature for <code>find_all(name, attrs, recursive, string, limit, **kwargs)<c/ode>\n",
"</p>\n",
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Name\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we set the <code>name</code> parameter to a tag name, the method will extract all the tags with that name and its children.\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<tr><td id=\"flight\">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,\n",
" <tr> <td>1</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td><td>300 kg</td></tr>,\n",
" <tr><td>2</td><td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td><td>94 kg</td></tr>,\n",
" <tr><td>3</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td><td>80 kg</td></tr>]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table_rows=table_bs.find_all('tr')\n",
"table_rows"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is a Python Iterable just like a list, each element is a <code>tag</code> object:\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<tr><td id=\"flight\">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"first_row =table_rows[0]\n",
"first_row"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The type is <code>tag</code>\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'bs4.element.Tag'>\n"
]
}
],
"source": [
"print(type(first_row))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we can obtain the child \n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<td id=\"flight\">Flight No</td>"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"first_row.td"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we iterate through the list, each element corresponds to a row in the table:\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"row 0 is <tr><td id=\"flight\">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>\n",
"row 1 is <tr> <td>1</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td><td>300 kg</td></tr>\n",
"row 2 is <tr><td>2</td><td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td><td>94 kg</td></tr>\n",
"row 3 is <tr><td>3</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td><td>80 kg</td></tr>\n"
]
}
],
"source": [
"for i,row in enumerate(table_rows):\n",
" print(\"row\",i,\"is\",row)\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As <code>row</code> is a <code>cell</code> object, we can apply the method <code>find_all</code> to it and extract table cells in the object <code>cells</code> using the tag <code>td</code>, this is all the children with the name <code>td</code>. The result is a list, each element corresponds to a cell and is a <code>Tag</code> object, we can iterate through this list as well. We can extract the content using the <code>string</code> attribute.\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"row 0\n",
"colunm 0 cell <td id=\"flight\">Flight No</td>\n",
"colunm 1 cell <td>Launch site</td>\n",
"colunm 2 cell <td>Payload mass</td>\n",
"row 1\n",
"colunm 0 cell <td>1</td>\n",
"colunm 1 cell <td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td>\n",
"colunm 2 cell <td>300 kg</td>\n",
"row 2\n",
"colunm 0 cell <td>2</td>\n",
"colunm 1 cell <td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td>\n",
"colunm 2 cell <td>94 kg</td>\n",
"row 3\n",
"colunm 0 cell <td>3</td>\n",
"colunm 1 cell <td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td>\n",
"colunm 2 cell <td>80 kg</td>\n"
]
}
],
"source": [
"for i,row in enumerate(table_rows):\n",
" print(\"row\",i)\n",
" cells=row.find_all('td')\n",
" for j,cell in enumerate(cells):\n",
" print('colunm',j,\"cell\",cell)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we use a list we can match against any item in that list.\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<tr><td id=\"flight\">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,\n",
" <td id=\"flight\">Flight No</td>,\n",
" <td>Launch site</td>,\n",
" <td>Payload mass</td>,\n",
" <tr> <td>1</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td><td>300 kg</td></tr>,\n",
" <td>1</td>,\n",
" <td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td>,\n",
" <td>300 kg</td>,\n",
" <tr><td>2</td><td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td><td>94 kg</td></tr>,\n",
" <td>2</td>,\n",
" <td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td>,\n",
" <td>94 kg</td>,\n",
" <tr><td>3</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td><td>80 kg</td></tr>,\n",
" <td>3</td>,\n",
" <td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td>,\n",
" <td>80 kg</td>]"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list_input=table_bs .find_all(name=[\"tr\", \"td\"])\n",
"list_input"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Attributes\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the argument is not recognized it will be turned into a filter on the tag’s attributes. For example the <code>id</code> argument, Beautiful Soup will filter against each tag’s <code>id</code> attribute. For example, the first <code>td</code> elements have a value of <code>id</code> of <code>flight</code>, therefore we can filter based on that <code>id</code> value. \n"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<td id=\"flight\">Flight No</td>]"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table_bs.find_all(id=\"flight\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can find all the elements that have links to the Florida Wikipedia page:\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a>,\n",
" <a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a>]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list_input=table_bs.find_all(href=\"https://en.wikipedia.org/wiki/Florida\")\n",
"list_input"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we set the <code>href</code> attribute to True, regardless of what the value is, the code finds all tags with <code>href</code> value:\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a>,\n",
" <a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a>,\n",
" <a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a>]"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table_bs.find_all(href=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are other methods for dealing with attributes and other related methods; Check out the following <a href='https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors'>link</a>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3 id=\"exer_type\">Exercise: <code>find_all</code></h3>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the logic above, find all the elements without <code>href</code> value \n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<html><head></head><body><table><tbody><tr><td id=\"flight\">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body></html>,\n",
" <head></head>,\n",
" <body><table><tbody><tr><td id=\"flight\">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body>,\n",
" <table><tbody><tr><td id=\"flight\">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table>,\n",
" <tbody><tr><td id=\"flight\">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td><td>80 kg</td></tr></tbody>,\n",
" <tr><td id=\"flight\">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>,\n",
" <td id=\"flight\">Flight No</td>,\n",
" <td>Launch site</td>,\n",
" <td>Payload mass</td>,\n",
" <tr> <td>1</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td><td>300 kg</td></tr>,\n",
" <td>1</td>,\n",
" <td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a></a></td>,\n",
" <a></a>,\n",
" <td>300 kg</td>,\n",
" <tr><td>2</td><td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td><td>94 kg</td></tr>,\n",
" <td>2</td>,\n",
" <td><a href=\"https://en.wikipedia.org/wiki/Texas\">Texas</a></td>,\n",
" <td>94 kg</td>,\n",
" <tr><td>3</td><td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td><td>80 kg</td></tr>,\n",
" <td>3</td>,\n",
" <td><a href=\"https://en.wikipedia.org/wiki/Florida\">Florida</a><a> </a></td>,\n",
" <a> </a>,\n",
" <td>80 kg</td>]"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table_bs.find_all(href=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
"table_bs.find_all(href=False)\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the soup object <code>soup</code>, find the element with the <code>id</code> attribute content set to <code>\"boldest\"</code>. \n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[<b id=\"boldest\">Lebron James</b>]"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"soup.find_all(id=\"boldest\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```\n",
"soup.find_all(id=\"boldest\")\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### string\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With string you can search for strings instead of tags, where we find all the elments with Florida:\n"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Florida', 'Florida']"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table_bs.find_all(string=\"Florida\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## find\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The <code>find_all()</code> method scans the entire document looking for results, it’s if you are looking for one element you can use the <code>find()</code> method to find the first element in the document. Consider the following two table:\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<h3>Rocket Launch </h3>\n",
"\n",
"<p>\n",
"<table class='rocket'>\n",
" <tr>\n",
" <td>Flight No</td>\n",
" <td>Launch site</td> \n",
" <td>Payload mass</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>Florida</td>\n",
" <td>300 kg</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>Texas</td>\n",
" <td>94 kg</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>Florida </td>\n",
" <td>80 kg</td>\n",
" </tr>\n",
"</table>\n",
"</p>\n",
"<p>\n",
"\n",
"<h3>Pizza Party </h3>\n",
" \n",
" \n",
"<table class='pizza'>\n",
" <tr>\n",
" <td>Pizza Place</td>\n",
" <td>Orders</td> \n",
" <td>Slices </td>\n",
" </tr>\n",
" <tr>\n",
" <td>Domino's Pizza</td>\n",
" <td>10</td>\n",
" <td>100</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Little Caesars</td>\n",
" <td>12</td>\n",
" <td >144 </td>\n",
" </tr>\n",
" <tr>\n",
" <td>Papa John's </td>\n",
" <td>15 </td>\n",
" <td>165</td>\n",
" </tr>\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%html\n",
"<h3>Rocket Launch </h3>\n",
"\n",
"<p>\n",
"<table class='rocket'>\n",
" <tr>\n",
" <td>Flight No</td>\n",
" <td>Launch site</td> \n",
" <td>Payload mass</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>Florida</td>\n",
" <td>300 kg</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>Texas</td>\n",
" <td>94 kg</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>Florida </td>\n",
" <td>80 kg</td>\n",
" </tr>\n",
"</table>\n",
"</p>\n",
"<p>\n",
"\n",
"<h3>Pizza Party </h3>\n",
" \n",
" \n",
"<table class='pizza'>\n",
" <tr>\n",
" <td>Pizza Place</td>\n",
" <td>Orders</td> \n",
" <td>Slices </td>\n",
" </tr>\n",
" <tr>\n",
" <td>Domino's Pizza</td>\n",
" <td>10</td>\n",
" <td>100</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Little Caesars</td>\n",
" <td>12</td>\n",
" <td >144 </td>\n",
" </tr>\n",
" <tr>\n",
" <td>Papa John's </td>\n",
" <td>15 </td>\n",
" <td>165</td>\n",
" </tr>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We store the HTML as a Python string and assign <code>two_tables</code>:\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"two_tables=\"<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We create a <code>BeautifulSoup</code> object <code>two_tables_bs</code>\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"two_tables_bs= BeautifulSoup(two_tables, 'html.parser')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can find the first table using the tag name table\n"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<table class=\"rocket\"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"two_tables_bs.find(\"table\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can filter on the class attribute to find the second table, but because class is a keyword in Python, we add an underscore.\n"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<table class=\"pizza\"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"two_tables_bs.find(\"table\",class_='pizza')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"DSCW\">Downloading And Scraping The Contents Of A Web Page</h2> \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We Download the contents of the web page:\n"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"url = \"http://www.ibm.com\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use <code>get</code> to download the contents of the webpage in text format and store in a variable called <code>data</code>:\n"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"data = requests.get(url).text "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We create a <code>BeautifulSoup</code> object using the <code>BeautifulSoup</code> constructor \n"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"soup = BeautifulSoup(data,\"html5lib\") # create a soup object using the variable 'data'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scrape all links\n"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://www.ibm.com/ca/en\n",
"\n",
"https://www.ibm.com/ca-en/products?lnk=hpmps_bupr_caen&lnk2=link\n",
"https://www.ibm.com/industries?lnk=hpmps_buin_caen&lnk2=link\n",
"https://www.ibm.com/artificial-intelligence?lnk=hpmps_buai_caen&lnk2=link\n",
"https://www.ibm.com/automation?lnk=hpmps_buau_caen&lnk2=link\n",
"https://www.ibm.com/blockchain?lnk=hpmps_bubc_caen&lnk2=link\n",
"https://www.ibm.com/business-operations?lnk=hpmps_buop_caen&lnk2=link\n",
"https://www.ibm.com/cloud?lnk=hpmps_bucl_caen&lnk2=link\n",
"https://www.ibm.com/analytics?lnk=hpmps_buda_caen&lnk2=link\n",
"https://www.ibm.com/it-infrastructure?lnk=hpmps_buit_caen&lnk2=link\n",
"https://www.ibm.com/security?lnk=hpmps_buse_caen&lnk2=link\n",
"https://www.ibm.com/supply-chain?lnk=hpmps_busc_caen&lnk2=link\n",
"https://www.ibm.com/financing?lnk=hpmps_bufi_caen&lnk2=link\n",
"https://www.ibm.com/ca-en/products?lnk=hpmps_buall_caen&lnk2=link\n",
"\n",
"https://www.ibm.com/services/process?lnk=hpmsc_bups_caen&lnk2=link\n",
"https://www.ibm.com/services/ibmix/?lnk=hpmsc_budbs_caen&lnk2=link\n",
"https://www.ibm.com/services/cloud?lnk=hpmsc_buhs?lnk=hpmsc_buhs_caen\n",
"https://www.ibm.com/talent-management?lnk=hpmsc_buta_caen&lnk2=link\n",
"https://www.ibm.com/services/applications?lnk=hpmsc_buas_caen&lnk2=link\n",
"https://www.ibm.com/garage?lnk=hpmsc_buas_caen&lnk2=link\n",
"https://www.ibm.com/security/services?lnk=hpmsc_buse_caen&lnk2=link\n",
"https://www.ibm.com/services/technology-support?lnk=hpmsc_busv_caen&lnk2=link\n",
"https://www.ibm.com/financing/solutions/it-services-financing?lnk=hpmsc_bufi_caen&lnk2=link\n",
"https://www.ibm.com/services?lnk=hpmsc_buall_caen&lnk2=link\n",
"\n",
"https://www.ibm.com/support/ca/en/?lnk=hpmls_busu_caen&lnk2=link\n",
"https://www.ibm.com/support/knowledgecenter/?lnk=hpmls_budc_caen&lnk2=link\n",
"https://developer.ibm.com/?lnk=hpmls_bude_caen&lnk2=link\n",
"https://www.ibm.com/training/?lnk=hpmls_butr_caen&lnk2=link\n",
"https://www.ibm.com/blogs/?lnk=hpmls_bure_caen&lnk2=link\n",
"https://www.ibm.com/cloud/learn?lnk=hpmls_buwi_caen&lnk2=link\n",
"\n",
"https://www.ibm.com/partnerworld/public?lnk=hpmex_bupa_caen&lnk2=link\n",
"https://www.research.ibm.com/?lnk=hpmex_bure_caen&lnk2=link\n",
"https://www.ibm.com/about?lnk=hpmex_buab_caen&lnk2=link\n",
"https://www.ibm.com/impact/covid-19?lnk=hpmex_buco_caen&lnk2=link\n",
"https://www.ibm.com/sitemap/ca/en\n",
"/ca-en/node/1706826\n",
"https://www.ibm.com/ca-en/employment/?lnk=hpv18l2\n",
"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/64/d3/IBM_Canada_Diversity_Leadspace_Desktop.jpg\n",
"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/ae/41/IBM_Canada_Diversity_Leadspace_mobile.jpg\n",
"https://1.dam.s81c.com/m/7401c562cef06861/original/Culinary_think.jpg\n",
"https://www.ibm.com/blogs/jobs/2021/03/03/ibm-canada-recognized-as-a-diversity-leader-for-the-third-straight-year-en-fr/?utm_medium=OSocial&utm_source=Blog&utm_content=000040UV&utm_term=10014626&utm_id=SocialLeadDHCADiversityAwardBlog-HP-EN%3Flnk%3Dhpv18l1\n",
"/taxonomy/term/85416\n",
"/ca-en/node/2384286\n",
"https://www.ibm.com/ca-en/events/emerge-smarter?lnk=hpv18nf\n",
"/taxonomy/term/85416\n",
"/ca-en/node/1706851\n",
"/ca-en/node/1706831\n",
"/ca-en/node/1706841\n",
"/ca-en/node/1706836\n",
"/ca-en/node/1706846\n",
"/taxonomy/term/85416\n",
"/ca-en/node/1706821\n",
"/taxonomy/term/85416\n",
"https://www.ibm.com/employment/ca/en/?lnk=fab\n",
"/ca-en/node/1706816\n",
"https://www.ibm.com/ca-en/products/offers-and-discounts?link=caenhpv18t5&lnk2=trial_mktpl_MPDISC\n",
"/ca-en/node/2281352\n",
"/ca-en/node/1706801\n",
"/ca-en/node/2281350\n",
"/ca-en/node/1706796\n",
"/taxonomy/term/85416\n",
"/ca-en/node/1706866\n",
"/ca-en/node/1706861\n",
"/taxonomy/term/85416\n",
"/ca-en/products\n",
"https://www.ibm.com/blogs/jobs/2021/03/03/ibm-canada-recognized-as-a-diversity-leader-for-the-third-straight-year-en-fr/?utm_medium=OSocial&utm_source=Blog&utm_content=000040UV&utm_term=10014626&utm_id=SocialLeadDHCADiversityAwardBlog-HP-EN%3Flnk%3Dhpv18l1\n",
"https://www.ibm.com/ca-en/employment/?lnk=hpv18l2\n",
"https://www.ibm.com/blogs/jobs/2021/03/03/ibm-canada-recognized-as-a-diversity-leader-for-the-third-straight-year-en-fr/?utm_medium=OSocial&utm_source=Blog&utm_content=000040UV&utm_term=10014626&utm_id=SocialLeadDHCADiversityAwardBlog-HP-EN%3Flnk%3Dhpv18l1\n",
"https://www.ibm.com/ca-en/employment/?lnk=hpv18l2\n",
"https://www.ibm.com/ca-en/events/emerge-smarter?lnk=hpv18nf\n",
"https://www.ibm.com/blogs/ibm-canada/2021/02/the-digital-bank-of-the-future-platforms-technology-and-culture/?utm_medium=OSocial&utm_source=Blog&utm_content=000040UV&utm_term=10014626&utm_id=SQUADEpiphanyDigitalBankofFutureBLOG-IBMCAWeb%3Flnk%3Dhpv18f1\n",
"https://newsroom.ibm.com/2021-02-16-IBM-Commits-To-Net-Zero-Greenhouse-Gas-Emissions-By-2030?utm_source=ibmorg_landingpage&utm_medium=ibmorg&utm_campaign=netzero-2030&_ga=2.207757339.846999529.1613753549-810398125.1613456448%3Flnk%3Dhpv18f2\n",
"https://event.on24.com/eventRegistration/EventLobbyServlet?target=reg20.jsp&referrer=&eventid=3003041&sessionid=1&key=687B66FDDF6E77B0FF7D50408A7F8183&regTag=&V2=false&sourcepage=register&utm_medium=OSocial&utm_source=Blog&utm_content=000040UV&utm_term=10014626&utm_id=DHSocialIBMCloudStatEventMar9-Homepage%3Flnk%3Dhpv18f2&_ga=2.129964812.10980677.1614281408-472146828.1614281408%3Flnk%3Dhpv18f3\n",
"https://www.ibm.com/blogs/ibm-canada/2021/02/bridging-the-digital-divide-digital-transformation-across-the-non-profit-sector/?lnk=hpv18f4\n",
"https://www.ibm.com/ca-en/products/offers-and-discounts?link=caenhpv18t5&lnk2=trial_mktpl_MPDISC\n",
"https://www.ibm.com/garage?lnk=STW_US_HPT_T1_BLK&psrc=NONE&pexp=DEF&lnk2=trial_CloudGarage\n",
"https://www.ibm.com/ca-en/products/maximo?lnk=STW_CA_HPT_T2_BLK&psrc=NONE&pexp=DEF&lnk2=trial_Maximo\n",
"https://www.ibm.com/ca-en/products/blueworkslive?lnk=STW_CA_HPT_T3_BLK&psrc=NONE&pexp=DEF&lnk2=trial_BluLive\n",
"https://www.ibm.com/ca-en/cloud/watson-personality-insights?lnk=STW_CA_HPT_T4_BLK&psrc=NONE&pexp=DEF&lnk2=trial_WatPersonInsight\n",
"https://www.ibm.com/search?lang=en&amp;cc=ca&amp;q=\n",
"/ca-en/products\n",
"/ca-en/products\n",
"//www.ibm.com/ca-en/products/category/technology/analytics\n",
"//www.ibm.com/ca-en/products/category/technology/cloud-computing\n",
"//www.ibm.com/ca-en/products/category/technology/mobile-technology\n",
"//www.ibm.com/ca-en/products/category/technology/cognitive-computing-and-AI\n",
"//www.ibm.com/ca-en/products/category/technology/IT-infrastructure\n",
"//www.ibm.com/ca-en/products/category/technology/security\n",
"//www.ibm.com/ca-en/products/category/technology/blockchain\n",
"//www.ibm.com/ca-en/products/category/technology/IT-management\n",
"//www.ibm.com/ca-en/products/category/technology/software-development\n",
"//www.ibm.com/ca-en/products/category/technology/analytics\n",
"//www.ibm.com/ca-en/products/category/technology/cloud-computing\n",
"//www.ibm.com/ca-en/products/category/technology/mobile-technology\n",
"//www.ibm.com/ca-en/products/category/technology/cognitive-computing-and-AI\n",
"//www.ibm.com/ca-en/products/category/technology/IT-infrastructure\n",
"//www.ibm.com/ca-en/products/category/technology/security\n",
"//www.ibm.com/ca-en/products/category/technology/blockchain\n",
"//www.ibm.com/ca-en/products/category/technology/IT-management\n",
"//www.ibm.com/ca-en/products/category/technology/software-development\n",
"//www.ibm.com/ca-en/products/category/business/business-operations\n",
"//www.ibm.com/ca-en/products/category/business/content-management\n",
"//www.ibm.com/ca-en/products/category/business/human-resources\n",
"//www.ibm.com/ca-en/products/category/business/collaboration\n",
"//www.ibm.com/ca-en/products/category/business/customer-service-and-CRM\n",
"//www.ibm.com/ca-en/products/category/business/marketing\n",
"//www.ibm.com/ca-en/products/category/business/commerce\n",
"//www.ibm.com/ca-en/products/category/business/finance\n",
"//www.ibm.com/ca-en/products/category/business/supply-chain-management\n",
"https://www.ibm.com/ca-en/products/aspera?lnk=hpv18cs1\n",
"https://www.ibm.com/ca-en/products/aspera?lnk=hpv18cs1\n",
"https://www.ibm.com/ca-en/products/aspera?lnk=hpv18cs1\n",
"##\n",
"https://www.ibm.com/ca-en/products/aspera?lnk=hpv18cs1\n",
"https://www.ibm.com/ca-en/products/aspera?lnk=hpv18cs1\n",
"##\n",
"https://www.ibm.com/ca-en/products/aspera?lnk=hpv18cs1\n",
"https://www.ibm.com/ca-en/products/aspera?lnk=hpv18cs1\n",
"https://www.ibm.com/ca-en/products/aspera?lnk=hpv18cs1\n",
"https://www.ibm.com/employment/ca/en/?lnk=fab\n",
"https://www.ibm.com/employment/ca/en/?lnk=fab\n",
"https://www.ibm.com/ca-en/products?lnk=fdi-caen\n",
"https://www.ibm.com/services/en-us/?lnk=fdi\n",
"https://www.ibm.com/industries/en-us/?lnk=fdi\n",
"https://www.ibm.com/case-studies?lnk=fdi\n",
"https://www.ibm.com/partnerworld/wps/servlet/ContentHandler/partnerworld-home?lnk=mdev_pw_caen&lnk2=learn\n",
"https://www.ibm.com/support/home/?lnk=msu_usen\n",
"https://www.ibm.com/partnerworld/wps/bplocator/search.jsp?lnk=fcw-caen\n",
"https://www.ibm.com/employment/ca/en/?lnk=fab_caen\n",
"https://www.ibm.com/news/ca/en/?lnk=fab-caen\n",
"https://www.ibm.com/blogs/ibm-canada/?lnk=fab-caen\n",
"https://www.ibm.com/investor/?lnk=fab-caen\n",
"https://www.ibm.com/ibm/responsibility/?lnk=fab-caen\n",
"https://www.ibm.com/ca-en/about\n",
"https://www.ibm.com/contact/ca/en/?lnk=flg-cont-caen\n",
"https://www.ibm.com/privacy/ca/en/?lnk=flg-priv-caen\n",
"https://www.ibm.com/ca-en/legal?lnk=flg-tous-caen\n",
"https://www.ibm.com/accessibility/ca/en/?lnk=flg-acce-caen\n",
"#\n"
]
}
],
"source": [
"for link in soup.find_all('a',href=True): # in html anchor/link is represented by the tag <a>\n",
"\n",
" print(link.get('href'))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scrape all images Tags\n"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<img alt=\"CEO Study Graphic\" class=\"ibm-resize\" src=\"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/ae/41/IBM_Canada_Diversity_Leadspace_mobile.jpg\"/>\n",
"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/ae/41/IBM_Canada_Diversity_Leadspace_mobile.jpg\n",
"<img alt=\"Young woman paying with her phone at a restaurant\" class=\"ibm-resize ibm-flex\" height=\"170\" src=\"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/82/21/Digital_Bank_of_Future_Card.jpg\" width=\"300\"/>\n",
"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/82/21/Digital_Bank_of_Future_Card.jpg\n",
"<img alt=\"Aerial view of a healthy forrest\" class=\"ibm-resize ibm-flex\" height=\"170\" src=\"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/5c/4c/Net%20Zero%20Commitment%20CA%20Card.jpg\" width=\"300\"/>\n",
"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/5c/4c/Net%20Zero%20Commitment%20CA%20Card.jpg\n",
"<img alt=\"Connected network graphic\" class=\"ibm-resize ibm-flex\" height=\"170\" src=\"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/be/19/IBM%20Cloud%20Satellite%20Launch%20CA%20card.jpg\" width=\"300\"/>\n",
"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/be/19/IBM%20Cloud%20Satellite%20Launch%20CA%20card.jpg\n",
"<img alt=\"A group of volunteers working at a food drive\" class=\"ibm-resize ibm-flex\" height=\"170\" src=\"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/68/cb/Digital%20tranformation%20non%20profit%20blog%20card.jpg\" width=\"300\"/>\n",
"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/68/cb/Digital%20tranformation%20non%20profit%20blog%20card.jpg\n",
"<img alt=\"IBM Garage screenshot\" class=\"ibm-resize ibm-flex\" height=\"170\" src=\"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/3d/1e/3d1ee562-911a-4cdb-b767127ecf27ddce.png\" width=\"300\"/>\n",
"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/3d/1e/3d1ee562-911a-4cdb-b767127ecf27ddce.png\n",
"<img alt=\"IBM Maximo screenshot\" class=\"ibm-resize ibm-flex\" height=\"170\" src=\"https://1.dam.s81c.com/m/202276edc8f3b974/original/Maximo-20423-700x420.png\" width=\"300\"/>\n",
"https://1.dam.s81c.com/m/202276edc8f3b974/original/Maximo-20423-700x420.png\n",
"<img alt=\"IBM Blueworks Live screenshot\" class=\"ibm-resize ibm-flex\" height=\"170\" src=\"https://1.dam.s81c.com/m/268fdd27d9144f6b/original/23582-blueworks-live-700x420.png\" width=\"300\"/>\n",
"https://1.dam.s81c.com/m/268fdd27d9144f6b/original/23582-blueworks-live-700x420.png\n",
"<img alt=\"Watson Personality Insights screenshot\" class=\"ibm-resize ibm-flex\" height=\"170\" src=\"https://1.dam.s81c.com/m/3b5864e508c1dac2/original/Watson_Personality_Insights_700x420.png\" width=\"300\"/>\n",
"https://1.dam.s81c.com/m/3b5864e508c1dac2/original/Watson_Personality_Insights_700x420.png\n",
"<img alt=\"Glowing network connected over the world graphic\" class=\"ibm-resize\" src=\"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/ff/13/IBM%20Aspera%20Product%20CA%20Card.jpg\"/>\n",
"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/ff/13/IBM%20Aspera%20Product%20CA%20Card.jpg\n",
"<img alt=\"Glowing network connected over the world graphic\" class=\"ibm-resize\" src=\"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/ff/13/IBM%20Aspera%20Product%20CA%20Card.jpg\"/>\n",
"https://1.dam.s81c.com/public/content/dam/worldwide-content/homepage/ul/g/ff/13/IBM%20Aspera%20Product%20CA%20Card.jpg\n"
]
}
],
"source": [
"for link in soup.find_all('img'):# in html image is represented by the tag <img>\n",
" print(link)\n",
" print(link.get('src'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scrape data from HTML tables\n"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"#The below url contains an html table with data about colors and color codes.\n",
"url = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check how many rows and columns are there in the color table.\n"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"# get the contents of the webpage in text format and store in a variable called data\n",
"data = requests.get(url).text"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [],
"source": [
"soup = BeautifulSoup(data,\"html5lib\")"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [],
"source": [
"#find a html table in the web page\n",
"table = soup.find('table') # in html table is represented by the tag <table>"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Color Name--->None\n",
"lightsalmon--->#FFA07A\n",
"salmon--->#FA8072\n",
"darksalmon--->#E9967A\n",
"lightcoral--->#F08080\n",
"coral--->#FF7F50\n",
"tomato--->#FF6347\n",
"orangered--->#FF4500\n",
"gold--->#FFD700\n",
"orange--->#FFA500\n",
"darkorange--->#FF8C00\n",
"lightyellow--->#FFFFE0\n",
"lemonchiffon--->#FFFACD\n",
"papayawhip--->#FFEFD5\n",
"moccasin--->#FFE4B5\n",
"peachpuff--->#FFDAB9\n",
"palegoldenrod--->#EEE8AA\n",
"khaki--->#F0E68C\n",
"darkkhaki--->#BDB76B\n",
"yellow--->#FFFF00\n",
"lawngreen--->#7CFC00\n",
"chartreuse--->#7FFF00\n",
"limegreen--->#32CD32\n",
"lime--->#00FF00\n",
"forestgreen--->#228B22\n",
"green--->#008000\n",
"powderblue--->#B0E0E6\n",
"lightblue--->#ADD8E6\n",
"lightskyblue--->#87CEFA\n",
"skyblue--->#87CEEB\n",
"deepskyblue--->#00BFFF\n",
"lightsteelblue--->#B0C4DE\n",
"dodgerblue--->#1E90FF\n"
]
}
],
"source": [
"#Get all rows from the table\n",
"for row in table.find_all('tr'): # in html table row is represented by the tag <tr>\n",
" # Get all columns in each row.\n",
" cols = row.find_all('td') # in html a column is represented by the tag <td>\n",
" color_name = cols[2].string # store the value in column 3 as color_name\n",
" color_code = cols[3].string # store the value in column 4 as color_code\n",
" print(\"{}--->{}\".format(color_name,color_code))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas\n"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"#The below url contains html tables with data about world population.\n",
"url = \"https://en.wikipedia.org/wiki/World_population\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before proceeding to scrape a web site, you need to examine the contents, and the way data is organized on the website. Open the above url in your browser and check the tables on the webpage.\n"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"# get the contents of the webpage in text format and store in a variable called data\n",
"data = requests.get(url).text"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [],
"source": [
"soup = BeautifulSoup(data,\"html5lib\")"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"#find all html tables in the web page\n",
"tables = soup.find_all('table') # in html table is represented by the tag <table>"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"25"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# we can see how many tables were found by checking the length of the tables list\n",
"len(tables)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assume that we are looking for the `10 most densly populated countries` table, we can look through the tables list and find the right one we are look for based on the data in each table or we can search for the table name if it is in the table but this option might not always work.\n"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5\n"
]
}
],
"source": [
"for index,table in enumerate(tables):\n",
" if (\"10 most densely populated countries\" in str(table)):\n",
" table_index = index\n",
"print(table_index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See if you can locate the table name of the table, `10 most densly populated countries`, below.\n"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<table class=\"wikitable sortable\" style=\"text-align:right\">\n",
" <caption>\n",
" 10 most densely populated countries\n",
" <small>\n",
" (with population above 5 million)\n",
" </small>\n",
" </caption>\n",
" <tbody>\n",
" <tr>\n",
" <th>\n",
" Rank\n",
" </th>\n",
" <th>\n",
" Country\n",
" </th>\n",
" <th>\n",
" Population\n",
" </th>\n",
" <th>\n",
" Area\n",
" <br/>\n",
" <small>\n",
" (km\n",
" <sup>\n",
" 2\n",
" </sup>\n",
" )\n",
" </small>\n",
" </th>\n",
" <th>\n",
" Density\n",
" <br/>\n",
" <small>\n",
" (pop/km\n",
" <sup>\n",
" 2\n",
" </sup>\n",
" )\n",
" </small>\n",
" </th>\n",
" </tr>\n",
" <tr>\n",
" <td>\n",
" 1\n",
" </td>\n",
" <td align=\"left\">\n",
" <span class=\"flagicon\">\n",
" <img alt=\"\" class=\"thumbborder\" data-file-height=\"2880\" data-file-width=\"4320\" decoding=\"async\" height=\"15\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/35px-Flag_of_Singapore.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/45px-Flag_of_Singapore.svg.png 2x\" width=\"23\"/>\n",
" </span>\n",
" <a href=\"/wiki/Singapore\" title=\"Singapore\">\n",
" Singapore\n",
" </a>\n",
" </td>\n",
" <td>\n",
" 5,704,000\n",
" </td>\n",
" <td>\n",
" 710\n",
" </td>\n",
" <td>\n",
" 8,033\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>\n",
" 2\n",
" </td>\n",
" <td align=\"left\">\n",
" <span class=\"flagicon\">\n",
" <img alt=\"\" class=\"thumbborder\" data-file-height=\"600\" data-file-width=\"1000\" decoding=\"async\" height=\"14\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Flag_of_Bangladesh.svg/23px-Flag_of_Bangladesh.svg.png\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Flag_of_Bangladesh.svg/35px-Flag_of_Bangladesh.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/f9/Flag_of_Bangladesh.svg/46px-Flag_of_Bangladesh.svg.png 2x\" width=\"23\"/>\n",
" </span>\n",
" <a href=\"/wiki/Bangladesh\" title=\"Bangladesh\">\n",
" Bangladesh\n",
" </a>\n",
" </td>\n",
" <td>\n",
" 170,250,000\n",
" </td>\n",
" <td>\n",
" 143,998\n",
" </td>\n",
" <td>\n",
" 1,182\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>\n",
" 3\n",
" </td>\n",
" <td align=\"left\">\n",
" <span class=\"flagicon\">\n",
" <img alt=\"\" class=\"thumbborder\" data-file-height=\"600\" data-file-width=\"900\" decoding=\"async\" height=\"15\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/5/59/Flag_of_Lebanon.svg/23px-Flag_of_Lebanon.svg.png\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/5/59/Flag_of_Lebanon.svg/35px-Flag_of_Lebanon.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/59/Flag_of_Lebanon.svg/45px-Flag_of_Lebanon.svg.png 2x\" width=\"23\"/>\n",
" </span>\n",
" <a href=\"/wiki/Lebanon\" title=\"Lebanon\">\n",
" Lebanon\n",
" </a>\n",
" </td>\n",
" <td>\n",
" 6,856,000\n",
" </td>\n",
" <td>\n",
" 10,452\n",
" </td>\n",
" <td>\n",
" 656\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>\n",
" 4\n",
" </td>\n",
" <td align=\"left\">\n",
" <span class=\"flagicon\">\n",
" <img alt=\"\" class=\"thumbborder\" data-file-height=\"600\" data-file-width=\"900\" decoding=\"async\" height=\"15\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/7/72/Flag_of_the_Republic_of_China.svg/23px-Flag_of_the_Republic_of_China.svg.png\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/7/72/Flag_of_the_Republic_of_China.svg/35px-Flag_of_the_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/7/72/Flag_of_the_Republic_of_China.svg/45px-Flag_of_the_Republic_of_China.svg.png 2x\" width=\"23\"/>\n",
" </span>\n",
" <a href=\"/wiki/Taiwan\" title=\"Taiwan\">\n",
" Taiwan\n",
" </a>\n",
" </td>\n",
" <td>\n",
" 23,604,000\n",
" </td>\n",
" <td>\n",
" 36,193\n",
" </td>\n",
" <td>\n",
" 652\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>\n",
" 5\n",
" </td>\n",
" <td align=\"left\">\n",
" <span class=\"flagicon\">\n",
" <img alt=\"\" class=\"thumbborder\" data-file-height=\"600\" data-file-width=\"900\" decoding=\"async\" height=\"15\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/0/09/Flag_of_South_Korea.svg/23px-Flag_of_South_Korea.svg.png\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/0/09/Flag_of_South_Korea.svg/35px-Flag_of_South_Korea.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/0/09/Flag_of_South_Korea.svg/45px-Flag_of_South_Korea.svg.png 2x\" width=\"23\"/>\n",
" </span>\n",
" <a href=\"/wiki/South_Korea\" title=\"South Korea\">\n",
" South Korea\n",
" </a>\n",
" </td>\n",
" <td>\n",
" 51,781,000\n",
" </td>\n",
" <td>\n",
" 99,538\n",
" </td>\n",
" <td>\n",
" 520\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>\n",
" 6\n",
" </td>\n",
" <td align=\"left\">\n",
" <span class=\"flagicon\">\n",
" <img alt=\"\" class=\"thumbborder\" data-file-height=\"720\" data-file-width=\"1080\" decoding=\"async\" height=\"15\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/1/17/Flag_of_Rwanda.svg/23px-Flag_of_Rwanda.svg.png\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/1/17/Flag_of_Rwanda.svg/35px-Flag_of_Rwanda.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/1/17/Flag_of_Rwanda.svg/45px-Flag_of_Rwanda.svg.png 2x\" width=\"23\"/>\n",
" </span>\n",
" <a href=\"/wiki/Rwanda\" title=\"Rwanda\">\n",
" Rwanda\n",
" </a>\n",
" </td>\n",
" <td>\n",
" 12,374,000\n",
" </td>\n",
" <td>\n",
" 26,338\n",
" </td>\n",
" <td>\n",
" 470\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>\n",
" 7\n",
" </td>\n",
" <td align=\"left\">\n",
" <span class=\"flagicon\">\n",
" <img alt=\"\" class=\"thumbborder\" data-file-height=\"600\" data-file-width=\"1000\" decoding=\"async\" height=\"14\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/5/56/Flag_of_Haiti.svg/23px-Flag_of_Haiti.svg.png\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/5/56/Flag_of_Haiti.svg/35px-Flag_of_Haiti.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/56/Flag_of_Haiti.svg/46px-Flag_of_Haiti.svg.png 2x\" width=\"23\"/>\n",
" </span>\n",
" <a href=\"/wiki/Haiti\" title=\"Haiti\">\n",
" Haiti\n",
" </a>\n",
" </td>\n",
" <td>\n",
" 11,578,000\n",
" </td>\n",
" <td>\n",
" 27,065\n",
" </td>\n",
" <td>\n",
" 428\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>\n",
" 8\n",
" </td>\n",
" <td align=\"left\">\n",
" <span class=\"flagicon\">\n",
" <img alt=\"\" class=\"thumbborder\" data-file-height=\"600\" data-file-width=\"900\" decoding=\"async\" height=\"15\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/2/20/Flag_of_the_Netherlands.svg/23px-Flag_of_the_Netherlands.svg.png\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/2/20/Flag_of_the_Netherlands.svg/35px-Flag_of_the_Netherlands.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/2/20/Flag_of_the_Netherlands.svg/45px-Flag_of_the_Netherlands.svg.png 2x\" width=\"23\"/>\n",
" </span>\n",
" <a href=\"/wiki/Netherlands\" title=\"Netherlands\">\n",
" Netherlands\n",
" </a>\n",
" </td>\n",
" <td>\n",
" 17,570,000\n",
" </td>\n",
" <td>\n",
" 41,526\n",
" </td>\n",
" <td>\n",
" 423\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>\n",
" 9\n",
" </td>\n",
" <td align=\"left\">\n",
" <span class=\"flagicon\">\n",
" <img alt=\"\" class=\"thumbborder\" data-file-height=\"800\" data-file-width=\"1100\" decoding=\"async\" height=\"15\" src=\"//upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Flag_of_Israel.svg/21px-Flag_of_Israel.svg.png\" srcset=\"//upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Flag_of_Israel.svg/32px-Flag_of_Israel.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Flag_of_Israel.svg/41px-Flag_of_Israel.svg.png 2x\" width=\"21\"/>\n",
" </span>\n",
" <a href=\"/wiki/Israel\" title=\"Israel\">\n",
" Israel\n",
" </a>\n",
" </td>\n",
" <td>\n",
" 9,320,000\n",
" </td>\n",
" <td>\n",
" 22,072\n",
" </td>\n",
" <td>\n",
" 422\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td>\n",
" 10\n",
" </td>\n",
" <td align=\"left\">\n",
" <span class=\"flagicon\">\n",
" <img alt=\"\" class=\"thumbborder\" data-file-height=\"900\" data-file-width=\"1350\" decoding=\"async\" height=\"15\" src=\"//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/23px-Flag_of_India.svg.png\" srcset=\"//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/35px-Flag_of_India.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/45px-Flag_of_India.svg.png 2x\" width=\"23\"/>\n",
" </span>\n",
" <a href=\"/wiki/India\" title=\"India\">\n",
" India\n",
" </a>\n",
" </td>\n",
" <td>\n",
" 1,373,980,000\n",
" </td>\n",
" <td>\n",
" 3,287,240\n",
" </td>\n",
" <td>\n",
" 418\n",
" </td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"\n"
]
}
],
"source": [
"print(tables[table_index].prettify())"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Rank</th>\n",
" <th>Country</th>\n",
" <th>Population</th>\n",
" <th>Area</th>\n",
" <th>Density</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Singapore</td>\n",
" <td>5,704,000</td>\n",
" <td>710</td>\n",
" <td>8,033</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Bangladesh</td>\n",
" <td>170,250,000</td>\n",
" <td>143,998</td>\n",
" <td>1,182</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Lebanon</td>\n",
" <td>6,856,000</td>\n",
" <td>10,452</td>\n",
" <td>656</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Taiwan</td>\n",
" <td>23,604,000</td>\n",
" <td>36,193</td>\n",
" <td>652</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>South Korea</td>\n",
" <td>51,781,000</td>\n",
" <td>99,538</td>\n",
" <td>520</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>Rwanda</td>\n",
" <td>12,374,000</td>\n",
" <td>26,338</td>\n",
" <td>470</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>Haiti</td>\n",
" <td>11,578,000</td>\n",
" <td>27,065</td>\n",
" <td>428</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>Netherlands</td>\n",
" <td>17,570,000</td>\n",
" <td>41,526</td>\n",
" <td>423</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>Israel</td>\n",
" <td>9,320,000</td>\n",
" <td>22,072</td>\n",
" <td>422</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>India</td>\n",
" <td>1,373,980,000</td>\n",
" <td>3,287,240</td>\n",
" <td>418</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Rank Country Population Area Density\n",
"0 1  Singapore 5,704,000 710 8,033\n",
"1 2  Bangladesh 170,250,000 143,998 1,182\n",
"2 3  Lebanon 6,856,000 10,452 656\n",
"3 4  Taiwan 23,604,000 36,193 652\n",
"4 5  South Korea 51,781,000 99,538 520\n",
"5 6  Rwanda 12,374,000 26,338 470\n",
"6 7  Haiti 11,578,000 27,065 428\n",
"7 8  Netherlands 17,570,000 41,526 423\n",
"8 9  Israel 9,320,000 22,072 422\n",
"9 10  India 1,373,980,000 3,287,240 418"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"population_data = pd.DataFrame(columns=[\"Rank\", \"Country\", \"Population\", \"Area\", \"Density\"])\n",
"\n",
"for row in tables[table_index].tbody.find_all(\"tr\"):\n",
" col = row.find_all(\"td\")\n",
" if (col != []):\n",
" rank = col[0].text\n",
" country = col[1].text\n",
" population = col[2].text.strip()\n",
" area = col[3].text.strip()\n",
" density = col[4].text.strip()\n",
" population_data = population_data.append({\"Rank\":rank, \"Country\":country, \"Population\":population, \"Area\":area, \"Density\":density}, ignore_index=True)\n",
"\n",
"population_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the same `url`, `data`, `soup`, and `tables` object as in the last section we can use the `read_html` function to create a DataFrame.\n",
"\n",
"Remember the table we need is located in `tables[table_index]`\n",
"\n",
"We can now use the `pandas` function `read_html` and give it the string version of the table as well as the `flavor` which is the parsing engine `bs4`.\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[ Rank Country Population Area(km2) Density(pop/km2)\n",
" 0 1 Singapore 5704000 710 8033\n",
" 1 2 Bangladesh 170250000 143998 1182\n",
" 2 3 Lebanon 6856000 10452 656\n",
" 3 4 Taiwan 23604000 36193 652\n",
" 4 5 South Korea 51781000 99538 520\n",
" 5 6 Rwanda 12374000 26338 470\n",
" 6 7 Haiti 11578000 27065 428\n",
" 7 8 Netherlands 17570000 41526 423\n",
" 8 9 Israel 9320000 22072 422\n",
" 9 10 India 1373980000 3287240 418]"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_html(str(tables[5]), flavor='bs4')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The function `read_html` always returns a list of DataFrames so we must pick the one we want out of the list.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]\n",
"\n",
"population_data_read_html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scrape data from HTML tables into a DataFrame using read_html\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use the `read_html` function to directly get DataFrames from a `url`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataframe_list = pd.read_html(url, flavor='bs4')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see there are 25 DataFrames just like when we used `find_all` on the `soup` object.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(dataframe_list)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally we can pick the DataFrame we need out of the list.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataframe_list[5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use the `match` parameter to select the specific table we want. If the table contains a string matching the text it will be read.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.read_html(url, match=\"10 most densely populated countries\", flavor='bs4')[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Authors\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ramesh Sannareddy\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other Contributors\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rav Ahuja\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Change Log\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"| ----------------- | ------- | -------------------------------------------------------- | ------------------ |\n",
"| 2020-10-17 | 0.1 | Joseph Santarcangelo Created initial version of the lab | |\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Copyright © 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license?cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork-19487395&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork-19487395&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork-19487395&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork-19487395&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork-23455606&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ&cm_mmc=Email_Newsletter-_-Developer_Ed%2BTech-_-WW_WW-_-SkillsNetwork-Courses-IBMDeveloperSkillsNetwork-PY0220EN-SkillsNetwork-23455606&cm_mmca1=000026UJ&cm_mmca2=10006555&cm_mmca3=M12345678&cvosrc=email.Newsletter.M12345678&cvo_campaign=000026UJ).\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment