Skip to content

Instantly share code, notes, and snippets.

@nick3499
Last active December 3, 2020 21:02
Show Gist options
  • Save nick3499/90c021a5faab85b94a6df58e59a32b7c to your computer and use it in GitHub Desktop.
Save nick3499/90c021a5faab85b94a6df58e59a32b7c to your computer and use it in GitHub Desktop.
BeautifulSoup HTML parser
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "beautifulsoup_html_parsing.ipynb",
"provenance": [],
"authorship_tag": "ABX9TyPlGIPxb4eNIrUwlhgQwDut",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/nick3499/90c021a5faab85b94a6df58e59a32b7c/beautifulsoup_html_parsing.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "05IvMHnCRX3a"
},
"source": [
"## HTML Data"
]
},
{
"cell_type": "code",
"metadata": {
"id": "qey1pYR4RMbv"
},
"source": [
"html_doc = '''\n",
"<html>\n",
" <head>\n",
" <title>Searching Tree</title>\n",
" </head>\n",
" <body>\n",
" <h1>Searching Parse Tree In BeautifulSoup</h1>\n",
" <p class=\"Main\">Learning\n",
" <a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>, \n",
" <a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a> and \n",
" <a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a> are fun!\n",
" </p>\n",
" <p class=\"Secondary\">\n",
" <b>Please subscribe!</b>\n",
" </p>\n",
" <p class=\"Secondary\" id=\"finxter\">\n",
" <b>copyright - FINXTER</b>\n",
" </p>\n",
" </body>\n",
"</html>\n",
"'''"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "SIxzJDWpg2w6"
},
"source": [
"## Soup Method Docs"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-wcoqufdenv6"
},
"source": [
"The `__doc__` of Soup `find()` method reads:\n",
"\n",
"'''Return only the first child of this Tag matching the given criteria.'''\n",
"\n",
"The `__doc__` of Soup `find_all()` method reads:\n",
"\n",
"'''Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.\n",
"\n",
"The value of a key-value pair in the 'attrs' map can be a string, a list of strings, a regular expression object, or a callable that takes a string and returns whether or not the string matches for some custom definition of 'matches'. The same is true of the tag name.'''"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nEbh4y4fRdLg"
},
"source": [
"## Import BeautifulSoup"
]
},
{
"cell_type": "code",
"metadata": {
"id": "nDKvFlinRhgb"
},
"source": [
"from bs4 import BeautifulSoup"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "LnHFIBv1Rlrz"
},
"source": [
"Import `BeautifulSoup` class"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9Cy6v8QeRr9o"
},
"source": [
"## Instantiate Soup's Parse Tree"
]
},
{
"cell_type": "code",
"metadata": {
"id": "0Tzz5sWTRu44"
},
"source": [
"soup = BeautifulSoup(html_doc, 'html.parser')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "BF-LFSxuRxrN"
},
"source": [
"Pass `html_doc` string object to `BeautifulSoup` class, along with `'html.parser'` to instantiate parse tree named `soup`."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hswg6v5JTt09"
},
"source": [
"## Get H1"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "S_Kw1TBETzJH",
"outputId": "d0e62791-4711-436f-a1b9-27a7e8c4cc70"
},
"source": [
"soup.h1"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<h1>Searching Parse Tree In BeautifulSoup</h1>"
]
},
"metadata": {
"tags": []
},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gvhjSOIkT1q4"
},
"source": [
"To get the first `h1` tag found in the parse tree."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cpdBC1UxUJW6"
},
"source": [
"## Iterate through All Anchor Tags"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ZfOC7n6zUgbX",
"outputId": "4a5aa7fb-211a-484a-929f-81583b686be4"
},
"source": [
"for i in soup.find_all('a'):\n",
" print(i)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"<a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>\n",
"<a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a>\n",
"<a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a>\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HoSKawM5UpKy"
},
"source": [
"`soup` parser's `find_all()` method lists all anchor tags which can then be iterated through using a `for` loop."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "S-ichoO6VTTl"
},
"source": [
"## Find Anchor Tag with Attribute"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "lmyK_x_tVutu",
"outputId": "b4c142b2-00f8-47f8-c69d-84b9e5172815"
},
"source": [
"soup.find('a', id='golang')"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a>"
]
},
"metadata": {
"tags": []
},
"execution_count": 6
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w0W45aCpVyeL"
},
"source": [
"`soup` parser's `find()` method receives two attributes\n",
"\n",
"- `'a'` anchor\n",
"- `id='golang'` attribute"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cIdGoDptWQtU"
},
"source": [
"## List All Anchor Tags in Language Class"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "eok7lRQwWhP8",
"outputId": "923ba6aa-a99e-4e0c-b964-8dc7299a2792"
},
"source": [
"for i in soup.select('a.language'):\n",
" print(i)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"<a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>\n",
"<a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a>\n",
"<a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a>\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NAZdJPbVWmh-"
},
"source": [
"`soup` parser's `select()` method receives one parameter:\n",
"\n",
"- `'a.language'` every anchor tag with `language` class."
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "FHbaY0gLXr1j",
"outputId": "336337cb-187a-4b3d-a406-2d615d6b42e7"
},
"source": [
"for i in soup.find_all('a', {'class': 'language'}):\n",
" print(i)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"<a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>\n",
"<a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a>\n",
"<a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a>\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vjaKXudBXx-D"
},
"source": [
"`soup` parser's `find_all()` method lists all anchor tags with `class` key and `language` value."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pjyMAxX8YOkP"
},
"source": [
"## Get First Anchor Tag with Language Class"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "grIyvYhHYj5T",
"outputId": "e06ad087-5e4a-413e-eefc-0d8197849331"
},
"source": [
"for i in soup.find('a', class_='language'):\n",
" print(i)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Python\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qPWQAucfYpbU"
},
"source": [
"`soup` parser's `find` method gets label from first instance of anchor tag with `language` `class`."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "N_54NtMpV7hl"
},
"source": [
"## Type of Find()"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2mpc8c8AWBEM",
"outputId": "e76cb9fd-a468-499f-c9d1-019c12fc5bdb"
},
"source": [
"type(soup.find('h1'))"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"bs4.element.Tag"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "akCZrmveW5zu"
},
"source": [
"## Attrs Attribute"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2ShbUrcTW8uq",
"outputId": "21103000-b7d2-4e52-ddb1-f892dab3f7bf"
},
"source": [
"soup.find_all('a', attrs={'id': 'java', 'class': 'language'})"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[<a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a>]"
]
},
"metadata": {
"tags": []
},
"execution_count": 11
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8SUC90bFYxz3"
},
"source": [
"The `attrs` attribute can be used with `find_all()` method to find key/value pairs for `id` and `class` attributes in anchor tags. This is where searches become more specific."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kwyZlz65XpSA"
},
"source": [
"## String Attribute"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "zDewBhnPXs_m",
"outputId": "e03a00fc-30cf-4027-9fbe-c631a6caca7e"
},
"source": [
"soup.find_all(string=[\"Python\", \"Java\", \"Golang\"])"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['Python', 'Java', 'Golang']"
]
},
"metadata": {
"tags": []
},
"execution_count": 12
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mkFacwj1YWSE"
},
"source": [
"The `string` attribute lists all of the strings that it can find. In this case, they are the three anchor tag **labels** for three computer languages."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fsij1h66X4O9"
},
"source": [
"## Limit Attribute"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "UuqUVx8nX6Qz",
"outputId": "1a4c7903-3660-41a5-e2ce-3d5b44847ff2"
},
"source": [
"soup.find_all('a', limit=2)"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[<a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>,\n",
" <a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a>]"
]
},
"metadata": {
"tags": []
},
"execution_count": 13
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZxI2ttepiImz"
},
"source": [
"The `limit` attribute limits the total amount of elements Soup's `find_all()` method extracts."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gJ4JQmnM2h7A"
},
"source": [
"## Text Method"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "rX_NSb842kIV",
"outputId": "05af1d77-72ae-415c-bc0e-16b1af3ba1fa"
},
"source": [
"for i in soup.find_all('a'):\n",
" print(i.text)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"Python\n",
"Java\n",
"Golang\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E9-IyAkR28mk"
},
"source": [
"The `text` method extracts **text labels** from anchor tags."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "R-FKZo9Kl3Yp"
},
"source": [
"## Additional Methods"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "NQdQjW9Il_Bz",
"outputId": "9f6c7071-bd39-4ecd-d09e-8d260bf24dce"
},
"source": [
"current = soup.find('a', id='java')\n",
"current.find_parent()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<p class=\"Main\">Learning\n",
" <a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>, \n",
" <a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a> and \n",
" <a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a> are fun!\n",
" </p>"
]
},
"metadata": {
"tags": []
},
"execution_count": 15
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "w-aNRoc1mL7u",
"outputId": "3540683f-b2be-4268-c70f-8e092763b7b4"
},
"source": [
"current.find_parents()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[<p class=\"Main\">Learning\n",
" <a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>, \n",
" <a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a> and \n",
" <a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a> are fun!\n",
" </p>, <body>\n",
" <h1>Searching Parse Tree In BeautifulSoup</h1>\n",
" <p class=\"Main\">Learning\n",
" <a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>, \n",
" <a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a> and \n",
" <a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a> are fun!\n",
" </p>\n",
" <p class=\"Secondary\">\n",
" <b>Please subscribe!</b>\n",
" </p>\n",
" <p class=\"Secondary\" id=\"finxter\">\n",
" <b>copyright - FINXTER</b>\n",
" </p>\n",
" </body>, <html>\n",
" <head>\n",
" <title>Searching Tree</title>\n",
" </head>\n",
" <body>\n",
" <h1>Searching Parse Tree In BeautifulSoup</h1>\n",
" <p class=\"Main\">Learning\n",
" <a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>, \n",
" <a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a> and \n",
" <a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a> are fun!\n",
" </p>\n",
" <p class=\"Secondary\">\n",
" <b>Please subscribe!</b>\n",
" </p>\n",
" <p class=\"Secondary\" id=\"finxter\">\n",
" <b>copyright - FINXTER</b>\n",
" </p>\n",
" </body>\n",
" </html>, \n",
" <html>\n",
" <head>\n",
" <title>Searching Tree</title>\n",
" </head>\n",
" <body>\n",
" <h1>Searching Parse Tree In BeautifulSoup</h1>\n",
" <p class=\"Main\">Learning\n",
" <a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>, \n",
" <a class=\"language\" href=\"https://docs.oracle.com/en/java/\" id=\"java\">Java</a> and \n",
" <a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a> are fun!\n",
" </p>\n",
" <p class=\"Secondary\">\n",
" <b>Please subscribe!</b>\n",
" </p>\n",
" <p class=\"Secondary\" id=\"finxter\">\n",
" <b>copyright - FINXTER</b>\n",
" </p>\n",
" </body>\n",
" </html>]"
]
},
"metadata": {
"tags": []
},
"execution_count": 16
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "3PVxbADvmRGm",
"outputId": "c933301b-e0bb-4957-e406-27f751862874"
},
"source": [
"current.find_previous_sibling()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>"
]
},
"metadata": {
"tags": []
},
"execution_count": 17
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "T1AewC_OmVQU",
"outputId": "29ab09e8-831f-4c12-f4c5-4169c1391eaa"
},
"source": [
"current.find_previous_siblings()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[<a class=\"language\" href=\"https://docs.python.org/3/\" id=\"python\">Python</a>]"
]
},
"metadata": {
"tags": []
},
"execution_count": 18
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "W5_ImHNKmZTr",
"outputId": "4ded964f-8800-440b-eafc-49a824a56db9"
},
"source": [
"current.find_next()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a>"
]
},
"metadata": {
"tags": []
},
"execution_count": 19
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ZJZIQ6MAmff1",
"outputId": "4fe0b02a-cdd2-41e0-b1c9-5d7ab073712b"
},
"source": [
"current.find_all_next()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[<a class=\"language\" href=\"https://golang.org/doc/\" id=\"golang\">Golang</a>,\n",
" <p class=\"Secondary\">\n",
" <b>Please subscribe!</b>\n",
" </p>,\n",
" <b>Please subscribe!</b>,\n",
" <p class=\"Secondary\" id=\"finxter\">\n",
" <b>copyright - FINXTER</b>\n",
" </p>,\n",
" <b>copyright - FINXTER</b>]"
]
},
"metadata": {
"tags": []
},
"execution_count": 20
}
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment