Skip to content

Instantly share code, notes, and snippets.

@blude
Created August 20, 2025 14:50
Show Gist options
  • Save blude/3f24de6c58c8913d1b630a8417a562f0 to your computer and use it in GitHub Desktop.
Save blude/3f24de6c58c8913d1b630a8417a562f0 to your computer and use it in GitHub Desktop.
Einführungskurs Web Scraping
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "67a92e34",
"metadata": {},
"source": [
"# Übung Web Scraping"
]
},
{
"cell_type": "markdown",
"id": "12d9b427",
"metadata": {
"vscode": {
"languageId": "markdown"
}
},
"source": [
"## Instructions\n",
"\n",
"Hi! This is an interactive Jupyter Notebook. Below are the basics for learning how to navigate and use this tool.\n",
"\n",
"- **Running Cells:** \n",
" Click on a code cell and press `Shift + Enter` to execute it. The output will appear below the cell.\n",
"\n",
"- **Adding Cells:** \n",
" Use the `+` button in the toolbar or press `B` (below) or `A` (above) in command mode to add new cells.\n",
"\n",
"- **Editing Cells:** \n",
" Double-click a cell to edit its content. Press `Esc` to exit edit mode.\n",
"\n",
"- **Saving Your Work:** \n",
" Click the save icon or press `Ctrl + S` to save your notebook.\n",
"\n",
"- **Markdown Cells:** \n",
" Use markdown cells for formatted text, headings, and instructions. Change a cell to markdown by selecting it and pressing `M` in command mode.\n",
"\n",
"- **Restarting the Kernel:** \n",
" If your code stops working, restart the kernel from the menu (`Kernel > Restart`) and re-run the cells.\n",
"\n",
"- **Variable Scope:** \n",
" Variables and imports defined in one cell can be used in other cells, as long as the kernel has not been restarted.\n",
"\n",
"- **Order of Execution:** \n",
" The order in which you run cells matters. Make sure to run cells that define variables or import libraries before using them in other cells."
]
},
{
"cell_type": "markdown",
"id": "57e9cebe",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "markdown",
"id": "911ba59d",
"metadata": {},
"source": [
"### URLs\n",
"\n",
"**Important**: For the completion of the exercises you should use the following URLs:"
]
},
{
"cell_type": "code",
"execution_count": 355,
"id": "e49a05f5",
"metadata": {},
"outputs": [],
"source": [
"url_dw_article1 = \"https://web.archive.org/web/20250820004953/https://www.dw.com/de/treibt-frost-in-der-türkei-den-nutellapreis/a-73614362\"\n",
"url_dw_category = \"https://web.archive.org/web/20250818134329/https://www.dw.com/de/wissenschaft/s-12296\""
]
},
{
"cell_type": "markdown",
"id": "515f7863",
"metadata": {},
"source": [
"## Practical Evaluation"
]
},
{
"cell_type": "markdown",
"id": "7557a38f",
"metadata": {},
"source": [
"### Welcome question\n",
"\n",
"Which of the following is **NOT** a known property in a `headers` definiton? \n",
"\n",
"1. `'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0 Safari/537.36'`\n",
"2. `'Accept-Language': 'de-DE,de;q=0.9'`\n",
"3. `'X-Allow-All-Requests-From': 'crawlers, scrapers, robots'`\n",
"3. `'Accept-Encoding': 'gzip, deflate, br'`\n",
"4. `'Connection': 'keep-alive'`\n",
"\n",
"You may use the correct ones to create your own `headers` definition, if needed."
]
},
{
"cell_type": "markdown",
"id": "dcf940d3",
"metadata": {},
"source": [
"### Task 1\n",
"Fetch the article web page using the URL `url_dw_article1` and print the status code and content length."
]
},
{
"cell_type": "code",
"execution_count": 58,
"id": "c60fe26e",
"metadata": {},
"outputs": [],
"source": [
"# 1. import the `requests` library\n",
"# 2. set headers to mimic a browser (if necessary)\n",
"# 3. make a GET request to the URL\n",
"# 4. print the status code and content length"
]
},
{
"cell_type": "markdown",
"id": "db0f11c2",
"metadata": {},
"source": [
"Expected ouput:\n",
"\n",
"- `Status code: 200`\n",
"- `Length: 150188 bytes`"
]
},
{
"cell_type": "markdown",
"id": "b611b317",
"metadata": {},
"source": [
"### Question 1\n",
"What does a status code of `200` mean?"
]
},
{
"cell_type": "code",
"execution_count": 354,
"id": "19740682",
"metadata": {},
"outputs": [],
"source": [
"# Answer:"
]
},
{
"cell_type": "markdown",
"id": "3af406de",
"metadata": {},
"source": [
"### Task 2\n",
"Extract and print the title of a given web page."
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "bb44aa6e",
"metadata": {},
"outputs": [],
"source": [
"# 1. import the BeautifulSoup library\n",
"# 2. parse the HTML content of the page and save it to a variable named `soup1`\n",
"# 3. extract the title of the page and print it"
]
},
{
"cell_type": "markdown",
"id": "4dce9dec",
"metadata": {},
"source": [
"Expected output:\n",
"`Treibt Frost in der Türkei den Nutellapreis? – DW – 12.08.2025`"
]
},
{
"cell_type": "markdown",
"id": "2a796ea8",
"metadata": {},
"source": [
"### Task 2.b\n",
"\n",
"Print the source code of the `<header>` element in a _pretty_ format."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "27986bb5",
"metadata": {},
"outputs": [],
"source": [
"# your solution here"
]
},
{
"cell_type": "markdown",
"id": "7363d74e",
"metadata": {},
"source": [
"Expected output:\n",
"\n",
"```html\n",
"<header class=\"sgeegmk\">\n",
" <div class=\"kicker s1wafcin s1tby7rj lxmvniw icns9en rcjjkz7 w128axg5 b1fzgn0z\" data-tracking-name=\"content-detail-kicker\">\n",
" <span>\n",
" Handel\n",
" </span>\n",
" <span>\n",
" <a class=\"sngcpkw btl76l3 e1eo633p w128axg5 b1fzgn0z\" href=\"/de/europa/s-12322\" tabindex=\"0\">\n",
" Europa\n",
" </a>\n",
" </span>\n",
" </div>\n",
" <h1 class=\"daqvxdf h1du6kc5 l1ozsu87 p1s74fjj s16w0xvi sngcpkw b1fzgn0z\">\n",
" Treibt Frost in der Türkei den Nutellapreis?\n",
" </h1>\n",
" <div class=\"a9wq4ao author-details\">\n",
" <span>\n",
" <span class=\"no-link m1ho1h07 l1evdo4u blt0baw s16w0xvi sngcpkw w128axg5 b1fzgn0z\">\n",
" Nik Martin\n",
" </span>\n",
" </span>\n",
" </div>\n",
" <span class=\"time-area s1i8qhb9 lxmvniw icns9en rcjjkz7 w128axg5 b1fzgn0z\">\n",
" <span class=\"publication lxmvniw icns9en rcjjkz7 w128axg5 b1fzgn0z\">\n",
" <time aria-hidden=\"true\">\n",
" 12.08.2025\n",
" </time>\n",
" <span class=\"sr-only s28j2rd\">\n",
" 12. August 2025\n",
" </span>\n",
" </span>\n",
" </span>\n",
" <p class=\"teaser-text l1evdo4u blt0baw s16w0xvi sngcpkw w128axg5 b1fzgn0z\">\n",
" Frost im Frühjahr hat die türkische Haselnussernte hart getroffen. Europas Süßwarenhersteller leiden bereits unter hohen Kakaopreisen, nun könnten auch Produkte teurer werden, die Nüsse enthalten.\n",
" </p>\n",
"</header>\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "8bfde359",
"metadata": {},
"source": [
"### Question 2\n",
"What is the difference between `.find()` and `.select()` in BeautifulSoup?"
]
},
{
"cell_type": "code",
"execution_count": 169,
"id": "2d015839",
"metadata": {},
"outputs": [],
"source": [
"# Answer:"
]
},
{
"cell_type": "markdown",
"id": "e330b689",
"metadata": {},
"source": [
"### Question 2.b\n",
"What's the difference between the `.find()`/`.find_all()` and `.select_one()`/`.select()` methods in BeautifulSoup?"
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "0dfc3c87",
"metadata": {},
"outputs": [],
"source": [
"# Answer:"
]
},
{
"cell_type": "markdown",
"id": "e01864bd",
"metadata": {},
"source": [
"### Task 3\n",
"Find all level two headings (`<h2>` tags) on the page using three different approaches.\n",
"\n",
"Tip: Use the equality operator `==` to check whether the result of all approaches is identical."
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "43fc45f1",
"metadata": {},
"outputs": [],
"source": [
"# replace the following with your solution\n",
"approach1 = None\n",
"approach2 = None\n",
"approach3 = None\n",
"# print(...)"
]
},
{
"cell_type": "markdown",
"id": "0b7f2423",
"metadata": {},
"source": [
"Expected output: `True`"
]
},
{
"cell_type": "markdown",
"id": "524fd0ea",
"metadata": {},
"source": [
"### Question 3\n",
"How can you extract the text inside a specific HTML tag?"
]
},
{
"cell_type": "code",
"execution_count": 167,
"id": "3fa4651c",
"metadata": {},
"outputs": [],
"source": [
"# Answer:"
]
},
{
"cell_type": "markdown",
"id": "750517ee",
"metadata": {},
"source": [
"### Question 4\n",
"\n",
"Now, how can you extract the URL from a link (`<a>` tag)?\n",
"\n",
"There're a couple of ways of achieving this, but try to answer with the method that doesn't cause an runtime error if the URL doesn't exist?"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "9dc34924",
"metadata": {},
"outputs": [],
"source": [
"# Answer:"
]
},
{
"cell_type": "markdown",
"id": "79e09458",
"metadata": {},
"source": [
"### Task 4\n",
"Extract all links (`<a>` tags) containing the `.internal-link` class from the web page and print their text and URLs.\n",
"\n",
"Tip: you can use a `for`-loop to iterate over the items."
]
},
{
"cell_type": "code",
"execution_count": 54,
"id": "a5995571",
"metadata": {},
"outputs": [],
"source": [
"# your solution here"
]
},
{
"cell_type": "markdown",
"id": "66a780ea",
"metadata": {},
"source": [
"Expected output:\n",
"\n",
"```\n",
"in der Türkei /de/türkei/t-17600264\n",
"von einem verheerenden Frost /de/dossier-klimawandel/a-73389631\n",
"Chile /de/chile/t-17909752\n",
"USA /de/vereinigte-staaten-von-amerika-usa/t-17286012\n",
"Georgien /de/georgien/t-18456307\n",
"die Angst vor Engpässen /de/klima-iwf-flucht-migration-extremwetter-klimawandel-wohlstand-gesundheit-hitze-dürre/a-40742370\n",
"Klimawandels /de/klimawandel/t-17477721\n",
"die Klimakrise /de/klimawandel-europa-kämpft-mit-dürre-und-wassermangel-hitze-wassermanagement-wasser-v2/a-72351705\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "1d76c977",
"metadata": {},
"source": [
"### Quesntion 5\n",
"\n",
"Which of the following options represents the correct CSS selector for matching one element or another?\n",
"\n",
"1. `element + element`\n",
"2. `element, element`\n",
"3. `element > element`\n",
"4. `element ~ element`"
]
},
{
"cell_type": "markdown",
"id": "20b48ef3",
"metadata": {},
"source": [
"### Task 5\n",
"\n",
"Extract all the article headlines from the news category page and print only each article's title.\n",
"\n",
"Important:\n",
"1. Pay attention to exclude items that aren't actual headlines (e.g. they don't contain an `<a>` element).\n",
"2. Duplicated entries should also be filtered out.\n"
]
},
{
"cell_type": "code",
"execution_count": 163,
"id": "2a2acf8a",
"metadata": {},
"outputs": [],
"source": [
"# your solution here"
]
},
{
"cell_type": "markdown",
"id": "2ae6fccc",
"metadata": {},
"source": [
"Expected output:\n",
"```\n",
"Mexikos Lieferfahrerinnen kämpfen für mehr Sicherheit\n",
"So schön wie auf Social Media? Die Hafenstadt Hamburg\n",
"Mit taktilen Westen Musik fühlen können\n",
"KI-Modell soll menschliches Verhalten vorhersagen \n",
"Pflege per App: Risiko für Patienten?\n",
"Schnellere Erde: Das steckt hinter dem \"kürzesten Tag\"\n",
"Medizinischer KI-Chatbot - speziell für arabische Frauen\n",
"50 Jahre europäische Raumfahrt: Happy Birthday, ESA!\n",
"Warum gepökeltes Fleisch nicht auf den Grill gehört\n",
"Wie Ecuador mit künstlicher Intelligenz Kolibris rettet\n",
"Weltraum: Neue deutsche Raumkapsel startet für die Forschung\n",
"Elternsein in Europa: weniger Zufriedenheit, mehr Lebenssinn\n",
"Gefälschte Medikamente: Ein weltweites Problem\n",
"Plastikmüll: Gelingt ein internationales Abkommen?\n",
"Wer gewinnt den Wettlauf im Weltall?\n",
"KI kann Sudokus lösen, aber nicht erklären\n",
"Türkei: War prähistorisches Çatalhöyük ein Matriarchat?\n",
"Vibrio-Bakterien in Europa: Sicher baden trotz Risiko\n",
"Wie entwickelt sich die sexuelle Orientierung?\n",
"Günstigere Prothesen dank 3D-Druck\n",
"Liger, Maultier und Co.: Skurrile Mischlinge der Tierwelt\n",
"Oft erst spät erkannt: Altersdepressionen\n",
"Weltstillwoche: Muttermilch kann vor Brustkrebs schützen\n",
"Boom bei deutschen Rüstungsstartups\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "ef610c6b",
"metadata": {},
"source": [
"### Question 6\n",
"..."
]
},
{
"cell_type": "code",
"execution_count": 171,
"id": "42fb3bca",
"metadata": {},
"outputs": [],
"source": [
"# Answer"
]
},
{
"cell_type": "markdown",
"id": "06e800f9",
"metadata": {},
"source": [
"### Task 6\n",
"Save extracted headlines and URLs to a CSV file."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "da1fb843",
"metadata": {},
"outputs": [],
"source": [
"# 1. import CSV library\n",
"# 2. build a list with the title and URL of each headline\n",
"# 3. open a CSV file named `headlines.csv` in write mode\n",
"# 4. write the header row with 'Title' and 'URL'\n",
"# 5. write the rows to the CSV file"
]
},
{
"cell_type": "markdown",
"id": "f7a54b66",
"metadata": {},
"source": [
"Expected output: File `headlines.csv` was saved to the filesystem.\n"
]
},
{
"cell_type": "markdown",
"id": "3f4546ff",
"metadata": {},
"source": [
"## Solutions"
]
},
{
"cell_type": "markdown",
"id": "d7a0580b",
"metadata": {},
"source": [
"### Welcome question\n",
"\n",
"The option #3 `'X-Allow-All-Requests-From': 'crawlers, scrapers, robots'` is not a known or valid headers property.\n",
"\n",
"### Task 1\n",
"\n",
"```python\n",
"import requests\n",
"response = requests.get(url_dw_article1)\n",
"print(\"Status code: \", response.status_code)\n",
"print(\"Length:\", len(response.content), 'bytes')\n",
"```\n",
"\n",
"### Question 1\n",
"\n",
"The HTTP status code `200` indicates that a request was successful and the requested data was correctly transferred from the server to the browser.\n",
"\n",
"### Task 2\n",
"\n",
"```python\n",
"from bs4 import BeautifulSoup\n",
"soup = BeautifulSoup(response.text, 'html.parser')\n",
"print(soup.title.text)\n",
"```\n",
"\n",
"### Task 2.b\n",
"\n",
"```python\n",
"print(soup.header.prettify())\n",
"```\n",
"\n",
"### Question 2\n",
"\n",
"The `.find()` method allows matching elements using tag name + attributes while the `.select()` method uses CSS selectors.\n",
"\n",
"### Question 2.b\n",
"\n",
"`.select()` and `.find_all()` match all elements while `.select_one()` and `.find()` mnatch only the first element.\n",
"\n",
"### Task 3\n",
"\n",
"```python\n",
"approach1 = soup.find_all('h2')\n",
"approach2 = soup.select('h2')\n",
"approach3 = soup('h2')\n",
"print(approach1 == approach2 == approach3)\n",
"```\n",
"\n",
"### Question 3\n",
"\n",
"By using either the method `.get_text()` or the properties `.text`, `.string` or `.string_stripped` on an element.\n",
"\n",
"### Question 4\n",
"\n",
"The `.get('href')` method can be utilized to get the URL of a link.\n",
"\n",
"### Task 4\n",
"\n",
"```python\n",
"for link in soup.find_all('a', class_='internal-link'):\n",
" print(link.text, link.get('href'))\n",
"```\n",
"\n",
"### Question 5\n",
"\n",
"Option #2 `element, element` is the correct CSS selector to match one element or the other.\n",
"\n",
"### Task 5\n",
"\n",
"```python\n",
"items = soup2.select('h3, h4')\n",
"headlines = []\n",
"\n",
"for item in items:\n",
" if item.contents[0].name == 'a':\n",
" headlines.append(item.contents[0])\n",
"\n",
"unique_headlines = set(headlines)\n",
"\n",
"for h in unique_headlines:\n",
" print(h.text)\n",
"```\n",
"\n",
"### Question 6\n",
"\n",
"\n",
"\n",
"### Task 6\n",
"\n",
"```python\n",
"import csv\n",
"\n",
"csv_rows = []\n",
"\n",
"for unique in unique_headlines:\n",
" title = unique.text\n",
" url = unique.get('href')\n",
" csv_rows.append([title, url])\n",
"\n",
"with open('headlines.csv', 'w', newline='', encoding='utf-8') as csv_file:\n",
" writer = csv.writer(csv_file)\n",
" writer.writerow(['Title', 'URL'])\n",
" writer.writerows(csv_rows)\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "67810fe5",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv (3.9.6)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment