blude · August 20, 2025 14:50
diff --git a/Prüfung.ipynb b/Prüfung.ipynb
diff --git a/Übung.ipynb b/Übung.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "67a92e34",
   "metadata": {},
   "source": [
    "# Übung Web Scraping"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12d9b427",
   "metadata": {
    "vscode": {
     "languageId": "markdown"
    }
   },
   "source": [
    "## Instructions\n",
    "\n",
    "Hi! This is an interactive Jupyter Notebook. Below are the basics for learning how to navigate and use this tool.\n",
    "\n",
    "- **Running Cells:**  \n",
    "    Click on a code cell and press `Shift + Enter` to execute it. The output will appear below the cell.\n",
    "\n",
    "- **Adding Cells:**  \n",
    "    Use the `+` button in the toolbar or press `B` (below) or `A` (above) in command mode to add new cells.\n",
    "\n",
    "- **Editing Cells:**  \n",
    "    Double-click a cell to edit its content. Press `Esc` to exit edit mode.\n",
    "\n",
    "- **Saving Your Work:**  \n",
    "    Click the save icon or press `Ctrl + S` to save your notebook.\n",
    "\n",
    "- **Markdown Cells:**  \n",
    "    Use markdown cells for formatted text, headings, and instructions. Change a cell to markdown by selecting it and pressing `M` in command mode.\n",
    "\n",
    "- **Restarting the Kernel:**  \n",
    "    If your code stops working, restart the kernel from the menu (`Kernel > Restart`) and re-run the cells.\n",
    "\n",
    "- **Variable Scope:**  \n",
    "    Variables and imports defined in one cell can be used in other cells, as long as the kernel has not been restarted.\n",
    "\n",
    "- **Order of Execution:**  \n",
    "    The order in which you run cells matters. Make sure to run cells that define variables or import libraries before using them in other cells."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57e9cebe",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "911ba59d",
   "metadata": {},
   "source": [
    "### URLs\n",
    "\n",
    "**Important**: For the completion of the exercises you should use the following URLs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 355,
   "id": "e49a05f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "url_dw_article1 = \"https://web.archive.org/web/20250820004953/https://www.dw.com/de/treibt-frost-in-der-türkei-den-nutellapreis/a-73614362\"\n",
    "url_dw_category = \"https://web.archive.org/web/20250818134329/https://www.dw.com/de/wissenschaft/s-12296\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "515f7863",
   "metadata": {},
   "source": [
    "## Practical Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7557a38f",
   "metadata": {},
   "source": [
    "### Welcome question\n",
    "\n",
    "Which of the following is **NOT** a known property in a `headers` definiton? \n",
    "\n",
    "1. `'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0 Safari/537.36'`\n",
    "2. `'Accept-Language': 'de-DE,de;q=0.9'`\n",
    "3. `'X-Allow-All-Requests-From': 'crawlers, scrapers, robots'`\n",
    "3. `'Accept-Encoding': 'gzip, deflate, br'`\n",
    "4. `'Connection': 'keep-alive'`\n",
    "\n",
    "You may use the correct ones to create your own `headers` definition, if needed."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dcf940d3",
   "metadata": {},
   "source": [
    "### Task 1\n",
    "Fetch the article web page using the URL `url_dw_article1` and print the status code and content length."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "c60fe26e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. import the `requests` library\n",
    "# 2. set headers to mimic a browser (if necessary)\n",
    "# 3. make a GET request to the URL\n",
    "# 4. print the status code and content length"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db0f11c2",
   "metadata": {},
   "source": [
    "Expected ouput:\n",
    "\n",
    "- `Status code: 200`\n",
    "- `Length: 150188 bytes`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b611b317",
   "metadata": {},
   "source": [
    "### Question 1\n",
    "What does a status code of `200` mean?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 354,
   "id": "19740682",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Answer:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3af406de",
   "metadata": {},
   "source": [
    "### Task 2\n",
    "Extract and print the title of a given web page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "bb44aa6e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. import the BeautifulSoup library\n",
    "# 2. parse the HTML content of the page and save it to a variable named `soup1`\n",
    "# 3. extract the title of the page and print it"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4dce9dec",
   "metadata": {},
   "source": [
    "Expected output:\n",
    "`Treibt Frost in der Türkei den Nutellapreis? – DW – 12.08.2025`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a796ea8",
   "metadata": {},
   "source": [
    "### Task 2.b\n",
    "\n",
    "Print the source code of the `<header>` element in a _pretty_ format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "27986bb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your solution here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7363d74e",
   "metadata": {},
   "source": [
    "Expected output:\n",
    "\n",
    "```html\n",
    "<header class=\"sgeegmk\">\n",
    " <div class=\"kicker s1wafcin s1tby7rj lxmvniw icns9en rcjjkz7 w128axg5 b1fzgn0z\" data-tracking-name=\"content-detail-kicker\">\n",
    "  <span>\n",
    "   Handel\n",
    "  </span>\n",
    "  <span>\n",
    "   <a class=\"sngcpkw btl76l3 e1eo633p w128axg5 b1fzgn0z\" href=\"/de/europa/s-12322\" tabindex=\"0\">\n",
    "    Europa\n",
    "   </a>\n",
    "  </span>\n",
    " </div>\n",
    " <h1 class=\"daqvxdf h1du6kc5 l1ozsu87 p1s74fjj s16w0xvi sngcpkw b1fzgn0z\">\n",
    "  Treibt Frost in der Türkei den Nutellapreis?\n",
    " </h1>\n",
    " <div class=\"a9wq4ao author-details\">\n",
    "  <span>\n",
    "   <span class=\"no-link m1ho1h07 l1evdo4u blt0baw s16w0xvi sngcpkw w128axg5 b1fzgn0z\">\n",
    "    Nik Martin\n",
    "   </span>\n",
    "  </span>\n",
    " </div>\n",
    " <span class=\"time-area s1i8qhb9 lxmvniw icns9en rcjjkz7 w128axg5 b1fzgn0z\">\n",
    "  <span class=\"publication lxmvniw icns9en rcjjkz7 w128axg5 b1fzgn0z\">\n",
    "   <time aria-hidden=\"true\">\n",
    "    12.08.2025\n",
    "   </time>\n",
    "   <span class=\"sr-only s28j2rd\">\n",
    "    12. August 2025\n",
    "   </span>\n",
    "  </span>\n",
    " </span>\n",
    " <p class=\"teaser-text l1evdo4u blt0baw s16w0xvi sngcpkw w128axg5 b1fzgn0z\">\n",
    "  Frost im Frühjahr hat die türkische Haselnussernte hart getroffen. Europas Süßwarenhersteller leiden bereits unter hohen Kakaopreisen, nun könnten auch Produkte teurer werden, die Nüsse enthalten.\n",
    " </p>\n",
    "</header>\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bfde359",
   "metadata": {},
   "source": [
    "### Question 2\n",
    "What is the difference between `.find()` and `.select()` in BeautifulSoup?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "id": "2d015839",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Answer:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e330b689",
   "metadata": {},
   "source": [
    "### Question 2.b\n",
    "What's the difference between the `.find()`/`.find_all()` and `.select_one()`/`.select()` methods in BeautifulSoup?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "0dfc3c87",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Answer:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e01864bd",
   "metadata": {},
   "source": [
    "### Task 3\n",
    "Find all level two headings (`<h2>` tags) on the page using three different approaches.\n",
    "\n",
    "Tip: Use the equality operator `==` to check whether the result of all approaches is identical."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "43fc45f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# replace the following with your solution\n",
    "approach1 = None\n",
    "approach2 = None\n",
    "approach3 = None\n",
    "# print(...)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b7f2423",
   "metadata": {},
   "source": [
    "Expected output: `True`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "524fd0ea",
   "metadata": {},
   "source": [
    "### Question 3\n",
    "How can you extract the text inside a specific HTML tag?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 167,
   "id": "3fa4651c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Answer:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "750517ee",
   "metadata": {},
   "source": [
    "### Question 4\n",
    "\n",
    "Now, how can you extract the URL from a link (`<a>` tag)?\n",
    "\n",
    "There're a couple of ways of achieving this, but try to answer with the method that doesn't cause an runtime error if the URL doesn't exist?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "9dc34924",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Answer:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79e09458",
   "metadata": {},
   "source": [
    "### Task 4\n",
    "Extract all links (`<a>` tags) containing the `.internal-link` class from the web page and print their text and URLs.\n",
    "\n",
    "Tip: you can use a `for`-loop to iterate over the items."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "a5995571",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your solution here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66a780ea",
   "metadata": {},
   "source": [
    "Expected output:\n",
    "\n",
    "```\n",
    "in der Türkei /de/türkei/t-17600264\n",
    "von einem verheerenden Frost /de/dossier-klimawandel/a-73389631\n",
    "Chile /de/chile/t-17909752\n",
    "USA /de/vereinigte-staaten-von-amerika-usa/t-17286012\n",
    "Georgien /de/georgien/t-18456307\n",
    "die Angst vor Engpässen /de/klima-iwf-flucht-migration-extremwetter-klimawandel-wohlstand-gesundheit-hitze-dürre/a-40742370\n",
    "Klimawandels /de/klimawandel/t-17477721\n",
    "die Klimakrise /de/klimawandel-europa-kämpft-mit-dürre-und-wassermangel-hitze-wassermanagement-wasser-v2/a-72351705\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d76c977",
   "metadata": {},
   "source": [
    "### Quesntion 5\n",
    "\n",
    "Which of the following options represents the correct CSS selector for matching one element or another?\n",
    "\n",
    "1. `element + element`\n",
    "2. `element, element`\n",
    "3. `element > element`\n",
    "4. `element ~ element`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20b48ef3",
   "metadata": {},
   "source": [
    "### Task 5\n",
    "\n",
    "Extract all the article headlines from the news category page and print only each article's title.\n",
    "\n",
    "Important:\n",
    "1. Pay attention to exclude items that aren't actual headlines (e.g. they don't contain an `<a>` element).\n",
    "2. Duplicated entries should also be filtered out.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 163,
   "id": "2a2acf8a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# your solution here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ae6fccc",
   "metadata": {},
   "source": [
    "Expected output:\n",
    "```\n",
    "Mexikos Lieferfahrerinnen kämpfen für mehr Sicherheit\n",
    "So schön wie auf Social Media? Die Hafenstadt Hamburg\n",
    "Mit taktilen Westen Musik fühlen können\n",
    "KI-Modell soll menschliches Verhalten vorhersagen \n",
    "Pflege per App: Risiko für Patienten?\n",
    "Schnellere Erde: Das steckt hinter dem \"kürzesten Tag\"\n",
    "Medizinischer KI-Chatbot - speziell für arabische Frauen\n",
    "50 Jahre europäische Raumfahrt: Happy Birthday, ESA!\n",
    "Warum gepökeltes Fleisch nicht auf den Grill gehört\n",
    "Wie Ecuador mit künstlicher Intelligenz Kolibris rettet\n",
    "Weltraum: Neue deutsche Raumkapsel startet für die Forschung\n",
    "Elternsein in Europa: weniger Zufriedenheit, mehr Lebenssinn\n",
    "Gefälschte Medikamente: Ein weltweites Problem\n",
    "Plastikmüll: Gelingt ein internationales Abkommen?\n",
    "Wer gewinnt den Wettlauf im Weltall?\n",
    "KI kann Sudokus lösen, aber nicht erklären\n",
    "Türkei: War prähistorisches Çatalhöyük ein Matriarchat?\n",
    "Vibrio-Bakterien in Europa: Sicher baden trotz Risiko\n",
    "Wie entwickelt sich die sexuelle Orientierung?\n",
    "Günstigere Prothesen dank 3D-Druck\n",
    "Liger, Maultier und Co.: Skurrile Mischlinge der Tierwelt\n",
    "Oft erst spät erkannt: Altersdepressionen\n",
    "Weltstillwoche: Muttermilch kann vor Brustkrebs schützen\n",
    "Boom bei deutschen Rüstungsstartups\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef610c6b",
   "metadata": {},
   "source": [
    "### Question 6\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 171,
   "id": "42fb3bca",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Answer"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06e800f9",
   "metadata": {},
   "source": [
    "### Task 6\n",
    "Save extracted headlines and URLs to a CSV file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da1fb843",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. import CSV library\n",
    "# 2. build a list with the title and URL of each headline\n",
    "# 3. open a CSV file named `headlines.csv` in write mode\n",
    "# 4. write the header row with 'Title' and 'URL'\n",
    "# 5. write the rows to the CSV file"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7a54b66",
   "metadata": {},
   "source": [
    "Expected output: File `headlines.csv` was saved to the filesystem.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3f4546ff",
   "metadata": {},
   "source": [
    "## Solutions"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7a0580b",
   "metadata": {},
   "source": [
    "### Welcome question\n",
    "\n",
    "The option #3 `'X-Allow-All-Requests-From': 'crawlers, scrapers, robots'` is not a known or valid headers property.\n",
    "\n",
    "### Task 1\n",
    "\n",
    "```python\n",
    "import requests\n",
    "response = requests.get(url_dw_article1)\n",
    "print(\"Status code: \", response.status_code)\n",
    "print(\"Length:\", len(response.content), 'bytes')\n",
    "```\n",
    "\n",
    "### Question 1\n",
    "\n",
    "The HTTP status code `200` indicates that a request was successful and the requested data was correctly transferred from the server to the browser.\n",
    "\n",
    "### Task 2\n",
    "\n",
    "```python\n",
    "from bs4 import BeautifulSoup\n",
    "soup = BeautifulSoup(response.text, 'html.parser')\n",
    "print(soup.title.text)\n",
    "```\n",
    "\n",
    "### Task 2.b\n",
    "\n",
    "```python\n",
    "print(soup.header.prettify())\n",
    "```\n",
    "\n",
    "### Question 2\n",
    "\n",
    "The `.find()` method allows matching elements using tag name + attributes while the `.select()` method uses CSS selectors.\n",
    "\n",
    "### Question 2.b\n",
    "\n",
    "`.select()` and `.find_all()` match all elements while `.select_one()` and `.find()` mnatch only the first element.\n",
    "\n",
    "### Task 3\n",
    "\n",
    "```python\n",
    "approach1 = soup.find_all('h2')\n",
    "approach2 = soup.select('h2')\n",
    "approach3 = soup('h2')\n",
    "print(approach1 == approach2 == approach3)\n",
    "```\n",
    "\n",
    "### Question 3\n",
    "\n",
    "By using either the method `.get_text()` or the properties `.text`, `.string` or `.string_stripped` on an element.\n",
    "\n",
    "### Question 4\n",
    "\n",
    "The `.get('href')` method can be utilized to get the URL of a link.\n",
    "\n",
    "### Task 4\n",
    "\n",
    "```python\n",
    "for link in soup.find_all('a', class_='internal-link'):\n",
    "    print(link.text, link.get('href'))\n",
    "```\n",
    "\n",
    "### Question 5\n",
    "\n",
    "Option #2 `element, element` is the correct CSS selector to match one element or the other.\n",
    "\n",
    "### Task 5\n",
    "\n",
    "```python\n",
    "items = soup2.select('h3, h4')\n",
    "headlines = []\n",
    "\n",
    "for item in items:\n",
    "    if item.contents[0].name == 'a':\n",
    "        headlines.append(item.contents[0])\n",
    "\n",
    "unique_headlines = set(headlines)\n",
    "\n",
    "for h in unique_headlines:\n",
    "    print(h.text)\n",
    "```\n",
    "\n",
    "### Question 6\n",
    "\n",
    "\n",
    "\n",
    "### Task 6\n",
    "\n",
    "```python\n",
    "import csv\n",
    "\n",
    "csv_rows = []\n",
    "\n",
    "for unique in unique_headlines:\n",
    "    title = unique.text\n",
    "    url = unique.get('href')\n",
    "    csv_rows.append([title, url])\n",
    "\n",
    "with open('headlines.csv', 'w', newline='', encoding='utf-8') as csv_file:\n",
    "    writer = csv.writer(csv_file)\n",
    "    writer.writerow(['Title', 'URL'])\n",
    "    writer.writerows(csv_rows)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67810fe5",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv (3.9.6)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"id": "67a92e34",
	"metadata": {},
	"source": [
	"# Übung Web Scraping"
	]
	},
	{
	"cell_type": "markdown",
	"id": "12d9b427",
	"metadata": {
	"vscode": {
	"languageId": "markdown"
	}
	},
	"source": [
	"## Instructions\n",
	"\n",
	"Hi! This is an interactive Jupyter Notebook. Below are the basics for learning how to navigate and use this tool.\n",
	"\n",
	"- Running Cells: \n",
	" Click on a code cell and press `Shift + Enter` to execute it. The output will appear below the cell.\n",
	"\n",
	"- Adding Cells: \n",
	" Use the `+` button in the toolbar or press `B` (below) or `A` (above) in command mode to add new cells.\n",
	"\n",
	"- Editing Cells: \n",
	" Double-click a cell to edit its content. Press `Esc` to exit edit mode.\n",
	"\n",
	"- Saving Your Work: \n",
	" Click the save icon or press `Ctrl + S` to save your notebook.\n",
	"\n",
	"- Markdown Cells: \n",
	" Use markdown cells for formatted text, headings, and instructions. Change a cell to markdown by selecting it and pressing `M` in command mode.\n",
	"\n",
	"- Restarting the Kernel: \n",
	" If your code stops working, restart the kernel from the menu (`Kernel > Restart`) and re-run the cells.\n",
	"\n",
	"- Variable Scope: \n",
	" Variables and imports defined in one cell can be used in other cells, as long as the kernel has not been restarted.\n",
	"\n",
	"- Order of Execution: \n",
	" The order in which you run cells matters. Make sure to run cells that define variables or import libraries before using them in other cells."
	]
	},
	{
	"cell_type": "markdown",
	"id": "57e9cebe",
	"metadata": {},
	"source": [
	"## Setup"
	]
	},
	{
	"cell_type": "markdown",
	"id": "911ba59d",
	"metadata": {},
	"source": [
	"### URLs\n",
	"\n",
	"Important: For the completion of the exercises you should use the following URLs:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 355,
	"id": "e49a05f5",
	"metadata": {},
	"outputs": [],
	"source": [
	"url_dw_article1 = \"https://web.archive.org/web/20250820004953/https://www.dw.com/de/treibt-frost-in-der-türkei-den-nutellapreis/a-73614362\"\n",
	"url_dw_category = \"https://web.archive.org/web/20250818134329/https://www.dw.com/de/wissenschaft/s-12296\""
	]
	},
	{
	"cell_type": "markdown",
	"id": "515f7863",
	"metadata": {},
	"source": [
	"## Practical Evaluation"
	]
	},
	{
	"cell_type": "markdown",
	"id": "7557a38f",
	"metadata": {},
	"source": [
	"### Welcome question\n",
	"\n",
	"Which of the following is NOT a known property in a `headers` definiton? \n",
	"\n",
	"1. `'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0 Safari/537.36'`\n",
	"2. `'Accept-Language': 'de-DE,de;q=0.9'`\n",
	"3. `'X-Allow-All-Requests-From': 'crawlers, scrapers, robots'`\n",
	"3. `'Accept-Encoding': 'gzip, deflate, br'`\n",
	"4. `'Connection': 'keep-alive'`\n",
	"\n",
	"You may use the correct ones to create your own `headers` definition, if needed."
	]
	},
	{
	"cell_type": "markdown",
	"id": "dcf940d3",
	"metadata": {},
	"source": [
	"### Task 1\n",
	"Fetch the article web page using the URL `url_dw_article1` and print the status code and content length."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 58,
	"id": "c60fe26e",
	"metadata": {},
	"outputs": [],
	"source": [
	"# 1. import the `requests` library\n",
	"# 2. set headers to mimic a browser (if necessary)\n",
	"# 3. make a GET request to the URL\n",
	"# 4. print the status code and content length"
	]
	},
	{
	"cell_type": "markdown",
	"id": "db0f11c2",
	"metadata": {},
	"source": [
	"Expected ouput:\n",
	"\n",
	"- `Status code: 200`\n",
	"- `Length: 150188 bytes`"
	]
	},
	{
	"cell_type": "markdown",
	"id": "b611b317",
	"metadata": {},
	"source": [
	"### Question 1\n",
	"What does a status code of `200` mean?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 354,
	"id": "19740682",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Answer:"
	]
	},
	{
	"cell_type": "markdown",
	"id": "3af406de",
	"metadata": {},
	"source": [
	"### Task 2\n",
	"Extract and print the title of a given web page."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 59,
	"id": "bb44aa6e",
	"metadata": {},
	"outputs": [],
	"source": [
	"# 1. import the BeautifulSoup library\n",
	"# 2. parse the HTML content of the page and save it to a variable named `soup1`\n",
	"# 3. extract the title of the page and print it"
	]
	},
	{
	"cell_type": "markdown",
	"id": "4dce9dec",
	"metadata": {},
	"source": [
	"Expected output:\n",
	"`Treibt Frost in der Türkei den Nutellapreis? – DW – 12.08.2025`"
	]
	},
	{
	"cell_type": "markdown",
	"id": "2a796ea8",
	"metadata": {},
	"source": [
	"### Task 2.b\n",
	"\n",
	"Print the source code of the `<header>` element in a _pretty_ format."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"id": "27986bb5",
	"metadata": {},
	"outputs": [],
	"source": [
	"# your solution here"
	]
	},
	{
	"cell_type": "markdown",
	"id": "7363d74e",
	"metadata": {},
	"source": [
	"Expected output:\n",
	"\n",
	"```html\n",
	"<header class=\"sgeegmk\">\n",
	" <div class=\"kicker s1wafcin s1tby7rj lxmvniw icns9en rcjjkz7 w128axg5 b1fzgn0z\" data-tracking-name=\"content-detail-kicker\">\n",
	" <span>\n",
	" Handel\n",
	" </span>\n",
	" <span>\n",
	" <a class=\"sngcpkw btl76l3 e1eo633p w128axg5 b1fzgn0z\" href=\"/de/europa/s-12322\" tabindex=\"0\">\n",
	" Europa\n",
	" </a>\n",
	" </span>\n",
	" </div>\n",
	" <h1 class=\"daqvxdf h1du6kc5 l1ozsu87 p1s74fjj s16w0xvi sngcpkw b1fzgn0z\">\n",
	" Treibt Frost in der Türkei den Nutellapreis?\n",
	" </h1>\n",
	" <div class=\"a9wq4ao author-details\">\n",
	" <span>\n",
	" <span class=\"no-link m1ho1h07 l1evdo4u blt0baw s16w0xvi sngcpkw w128axg5 b1fzgn0z\">\n",
	" Nik Martin\n",
	" </span>\n",
	" </span>\n",
	" </div>\n",
	" <span class=\"time-area s1i8qhb9 lxmvniw icns9en rcjjkz7 w128axg5 b1fzgn0z\">\n",
	" <span class=\"publication lxmvniw icns9en rcjjkz7 w128axg5 b1fzgn0z\">\n",
	" <time aria-hidden=\"true\">\n",
	" 12.08.2025\n",
	" </time>\n",
	" <span class=\"sr-only s28j2rd\">\n",
	" 12. August 2025\n",
	" </span>\n",
	" </span>\n",
	" </span>\n",
	" <p class=\"teaser-text l1evdo4u blt0baw s16w0xvi sngcpkw w128axg5 b1fzgn0z\">\n",
	" Frost im Frühjahr hat die türkische Haselnussernte hart getroffen. Europas Süßwarenhersteller leiden bereits unter hohen Kakaopreisen, nun könnten auch Produkte teurer werden, die Nüsse enthalten.\n",
	" </p>\n",
	"</header>\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"id": "8bfde359",
	"metadata": {},
	"source": [
	"### Question 2\n",
	"What is the difference between `.find()` and `.select()` in BeautifulSoup?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 169,
	"id": "2d015839",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Answer:"
	]
	},
	{
	"cell_type": "markdown",
	"id": "e330b689",
	"metadata": {},
	"source": [
	"### Question 2.b\n",
	"What's the difference between the `.find()`/`.find_all()` and `.select_one()`/`.select()` methods in BeautifulSoup?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 33,
	"id": "0dfc3c87",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Answer:"
	]
	},
	{
	"cell_type": "markdown",
	"id": "e01864bd",
	"metadata": {},
	"source": [
	"### Task 3\n",
	"Find all level two headings (`<h2>` tags) on the page using three different approaches.\n",
	"\n",
	"Tip: Use the equality operator `==` to check whether the result of all approaches is identical."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 47,
	"id": "43fc45f1",
	"metadata": {},
	"outputs": [],
	"source": [
	"# replace the following with your solution\n",
	"approach1 = None\n",
	"approach2 = None\n",
	"approach3 = None\n",
	"# print(...)"
	]
	},
	{
	"cell_type": "markdown",
	"id": "0b7f2423",
	"metadata": {},
	"source": [
	"Expected output: `True`"
	]
	},
	{
	"cell_type": "markdown",
	"id": "524fd0ea",
	"metadata": {},
	"source": [
	"### Question 3\n",
	"How can you extract the text inside a specific HTML tag?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 167,
	"id": "3fa4651c",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Answer:"
	]
	},
	{
	"cell_type": "markdown",
	"id": "750517ee",
	"metadata": {},
	"source": [
	"### Question 4\n",
	"\n",
	"Now, how can you extract the URL from a link (`<a>` tag)?\n",
	"\n",
	"There're a couple of ways of achieving this, but try to answer with the method that doesn't cause an runtime error if the URL doesn't exist?"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 35,
	"id": "9dc34924",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Answer:"
	]
	},
	{
	"cell_type": "markdown",
	"id": "79e09458",
	"metadata": {},
	"source": [
	"### Task 4\n",
	"Extract all links (`<a>` tags) containing the `.internal-link` class from the web page and print their text and URLs.\n",
	"\n",
	"Tip: you can use a `for`-loop to iterate over the items."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 54,
	"id": "a5995571",
	"metadata": {},
	"outputs": [],
	"source": [
	"# your solution here"
	]
	},
	{
	"cell_type": "markdown",
	"id": "66a780ea",
	"metadata": {},
	"source": [
	"Expected output:\n",
	"\n",
	"```\n",
	"in der Türkei /de/türkei/t-17600264\n",
	"von einem verheerenden Frost /de/dossier-klimawandel/a-73389631\n",
	"Chile /de/chile/t-17909752\n",
	"USA /de/vereinigte-staaten-von-amerika-usa/t-17286012\n",
	"Georgien /de/georgien/t-18456307\n",
	"die Angst vor Engpässen /de/klima-iwf-flucht-migration-extremwetter-klimawandel-wohlstand-gesundheit-hitze-dürre/a-40742370\n",
	"Klimawandels /de/klimawandel/t-17477721\n",
	"die Klimakrise /de/klimawandel-europa-kämpft-mit-dürre-und-wassermangel-hitze-wassermanagement-wasser-v2/a-72351705\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"id": "1d76c977",
	"metadata": {},
	"source": [
	"### Quesntion 5\n",
	"\n",
	"Which of the following options represents the correct CSS selector for matching one element or another?\n",
	"\n",
	"1. `element + element`\n",
	"2. `element, element`\n",
	"3. `element > element`\n",
	"4. `element ~ element`"
	]
	},
	{
	"cell_type": "markdown",
	"id": "20b48ef3",
	"metadata": {},
	"source": [
	"### Task 5\n",
	"\n",
	"Extract all the article headlines from the news category page and print only each article's title.\n",
	"\n",
	"Important:\n",
	"1. Pay attention to exclude items that aren't actual headlines (e.g. they don't contain an `<a>` element).\n",
	"2. Duplicated entries should also be filtered out.\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 163,
	"id": "2a2acf8a",
	"metadata": {},
	"outputs": [],
	"source": [
	"# your solution here"
	]
	},
	{
	"cell_type": "markdown",
	"id": "2ae6fccc",
	"metadata": {},
	"source": [
	"Expected output:\n",
	"```\n",
	"Mexikos Lieferfahrerinnen kämpfen für mehr Sicherheit\n",
	"So schön wie auf Social Media? Die Hafenstadt Hamburg\n",
	"Mit taktilen Westen Musik fühlen können\n",
	"KI-Modell soll menschliches Verhalten vorhersagen \n",
	"Pflege per App: Risiko für Patienten?\n",
	"Schnellere Erde: Das steckt hinter dem \"kürzesten Tag\"\n",
	"Medizinischer KI-Chatbot - speziell für arabische Frauen\n",
	"50 Jahre europäische Raumfahrt: Happy Birthday, ESA!\n",
	"Warum gepökeltes Fleisch nicht auf den Grill gehört\n",
	"Wie Ecuador mit künstlicher Intelligenz Kolibris rettet\n",
	"Weltraum: Neue deutsche Raumkapsel startet für die Forschung\n",
	"Elternsein in Europa: weniger Zufriedenheit, mehr Lebenssinn\n",
	"Gefälschte Medikamente: Ein weltweites Problem\n",
	"Plastikmüll: Gelingt ein internationales Abkommen?\n",
	"Wer gewinnt den Wettlauf im Weltall?\n",
	"KI kann Sudokus lösen, aber nicht erklären\n",
	"Türkei: War prähistorisches Çatalhöyük ein Matriarchat?\n",
	"Vibrio-Bakterien in Europa: Sicher baden trotz Risiko\n",
	"Wie entwickelt sich die sexuelle Orientierung?\n",
	"Günstigere Prothesen dank 3D-Druck\n",
	"Liger, Maultier und Co.: Skurrile Mischlinge der Tierwelt\n",
	"Oft erst spät erkannt: Altersdepressionen\n",
	"Weltstillwoche: Muttermilch kann vor Brustkrebs schützen\n",
	"Boom bei deutschen Rüstungsstartups\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"id": "ef610c6b",
	"metadata": {},
	"source": [
	"### Question 6\n",
	"..."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 171,
	"id": "42fb3bca",
	"metadata": {},
	"outputs": [],
	"source": [
	"# Answer"
	]
	},
	{
	"cell_type": "markdown",
	"id": "06e800f9",
	"metadata": {},
	"source": [
	"### Task 6\n",
	"Save extracted headlines and URLs to a CSV file."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"id": "da1fb843",
	"metadata": {},
	"outputs": [],
	"source": [
	"# 1. import CSV library\n",
	"# 2. build a list with the title and URL of each headline\n",
	"# 3. open a CSV file named `headlines.csv` in write mode\n",
	"# 4. write the header row with 'Title' and 'URL'\n",
	"# 5. write the rows to the CSV file"
	]
	},
	{
	"cell_type": "markdown",
	"id": "f7a54b66",
	"metadata": {},
	"source": [
	"Expected output: File `headlines.csv` was saved to the filesystem.\n"
	]
	},
	{
	"cell_type": "markdown",
	"id": "3f4546ff",
	"metadata": {},
	"source": [
	"## Solutions"
	]
	},
	{
	"cell_type": "markdown",
	"id": "d7a0580b",
	"metadata": {},
	"source": [
	"### Welcome question\n",
	"\n",
	"The option #3 `'X-Allow-All-Requests-From': 'crawlers, scrapers, robots'` is not a known or valid headers property.\n",
	"\n",
	"### Task 1\n",
	"\n",
	"```python\n",
	"import requests\n",
	"response = requests.get(url_dw_article1)\n",
	"print(\"Status code: \", response.status_code)\n",
	"print(\"Length:\", len(response.content), 'bytes')\n",
	"```\n",
	"\n",
	"### Question 1\n",
	"\n",
	"The HTTP status code `200` indicates that a request was successful and the requested data was correctly transferred from the server to the browser.\n",
	"\n",
	"### Task 2\n",
	"\n",
	"```python\n",
	"from bs4 import BeautifulSoup\n",
	"soup = BeautifulSoup(response.text, 'html.parser')\n",
	"print(soup.title.text)\n",
	"```\n",
	"\n",
	"### Task 2.b\n",
	"\n",
	"```python\n",
	"print(soup.header.prettify())\n",
	"```\n",
	"\n",
	"### Question 2\n",
	"\n",
	"The `.find()` method allows matching elements using tag name + attributes while the `.select()` method uses CSS selectors.\n",
	"\n",
	"### Question 2.b\n",
	"\n",
	"`.select()` and `.find_all()` match all elements while `.select_one()` and `.find()` mnatch only the first element.\n",
	"\n",
	"### Task 3\n",
	"\n",
	"```python\n",
	"approach1 = soup.find_all('h2')\n",
	"approach2 = soup.select('h2')\n",
	"approach3 = soup('h2')\n",
	"print(approach1 == approach2 == approach3)\n",
	"```\n",
	"\n",
	"### Question 3\n",
	"\n",
	"By using either the method `.get_text()` or the properties `.text`, `.string` or `.string_stripped` on an element.\n",
	"\n",
	"### Question 4\n",
	"\n",
	"The `.get('href')` method can be utilized to get the URL of a link.\n",
	"\n",
	"### Task 4\n",
	"\n",
	"```python\n",
	"for link in soup.find_all('a', class_='internal-link'):\n",
	" print(link.text, link.get('href'))\n",
	"```\n",
	"\n",
	"### Question 5\n",
	"\n",
	"Option #2 `element, element` is the correct CSS selector to match one element or the other.\n",
	"\n",
	"### Task 5\n",
	"\n",
	"```python\n",
	"items = soup2.select('h3, h4')\n",
	"headlines = []\n",
	"\n",
	"for item in items:\n",
	" if item.contents[0].name == 'a':\n",
	" headlines.append(item.contents[0])\n",
	"\n",
	"unique_headlines = set(headlines)\n",
	"\n",
	"for h in unique_headlines:\n",
	" print(h.text)\n",
	"```\n",
	"\n",
	"### Question 6\n",
	"\n",
	"\n",
	"\n",
	"### Task 6\n",
	"\n",
	"```python\n",
	"import csv\n",
	"\n",
	"csv_rows = []\n",
	"\n",
	"for unique in unique_headlines:\n",
	" title = unique.text\n",
	" url = unique.get('href')\n",
	" csv_rows.append([title, url])\n",
	"\n",
	"with open('headlines.csv', 'w', newline='', encoding='utf-8') as csv_file:\n",
	" writer = csv.writer(csv_file)\n",
	" writer.writerow(['Title', 'URL'])\n",
	" writer.writerows(csv_rows)\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"id": "67810fe5",
	"metadata": {},
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": ".venv (3.9.6)",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.9.6"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}