stephenhouser · August 31, 2018 17:04
diff --git a/README.md b/README.md
diff --git a/EscrevaLolaEscreva b/EscrevaLolaEscreva
 EscrevaLolaEscreva Jupyter Notebook
diff --git a/index.ipynb b/index.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scrape `escrevalolaescreva.blogspot.com` Articles\n",
    "Done on request for Jay Sosa, Bowdoin College by Stephen Houser\n",
    "\n",
    "This script scrapes a Blogspot blog by iterating back in its history.\n",
    "\n",
    "Usage:\n",
    "\n",
    "1. Set `url` to Blogspot URL of a specific article you want to start scraping from\n",
    "2. Set `output_filename` to the name of a file to save your results to\n",
    "3. Set `articles_to_scrape` to some maximim number of articles to try and collect\n",
    "\n",
    "Note: Your IP-number may be temporarily banned from the Blogger service if over-used.\n",
    "Use on your own risk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set url to Blogger blog article\n",
    "url = 'http://escrevalolaescreva.blogspot.com/2018/08/acreditei-que-minha-existencia-era.html'\n",
    "\n",
    "# Set to name of file to save resulting text to\n",
    "output_filename = 'escrevalolaescreva.txt'\n",
    "\n",
    "# Set to maximum number of articles to collect\n",
    "articles_to_scrape = 10\n",
    "\n",
    "# Where we will collect the scraped articles\n",
    "scraped_articles = []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is where the scraping actually happens. You should not need to modify this section, unless things break and you need to tweak the HTML class names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "import re\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "for n_article in range(1, articles_to_scrape+1):\n",
    "    result = requests.get(url)\n",
    "    if result.status_code != 200:\n",
    "        break\n",
    "\n",
    "    content = result.content\n",
    "    soup = BeautifulSoup(content, 'html.parser')\n",
    "\n",
    "    post_title = str(soup.find(class_='post-title').text).strip()\n",
    "    post_date = str(soup.find(class_='date-header').text).strip()\n",
    "\n",
    "    post_body = soup.find(class_='post-body')\n",
    "\n",
    "    post_text = str(post_body.text).strip()\n",
    "    post_text = re.sub('\\n+', '\\n', post_text)\n",
    "\n",
    "    post_image_links = []\n",
    "    for img in post_body.find_all('img', src=True):\n",
    "        post_image_links.append(img['src'])\n",
    "\n",
    "    scraped_articles.append({\n",
    "        'url': url,\n",
    "        'title': post_title,\n",
    "        'date': post_date,\n",
    "        'text': post_text,\n",
    "        'images': post_image_links\n",
    "    })\n",
    "\n",
    "    older_link = soup.find('a', class_='blog-pager-older-link')\n",
    "    url = older_link['href']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If doing some further text processing, all the articles are available in the `scraped_article` array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"ACREDITEI QUE MINHA EXISTÊNCIA ERA LIMITADA À APROVAÇÃO DOS OUTROS\"\n",
      "TODO MÊS É DE DESGOSTO PARA AS MULHERES\n",
      "SÉRIE SOBRE A CAÇA AO UNABOMBER DEIXA DE LADO AS MULHERES\n",
      "ELEIÇÕES 2018: A CHANCE DA ESQUERDA VOLTAR AO PALÁCIO DO PLANALTO E O MOMENTO DECISIVO PARA O PROGRESSISMO DO PAÍS\n",
      "GUEST POST: BONITA NÃO É ELOGIO\n",
      "QUEM VOCÊ CHAMOU DE IDIOTA ÚTIL?\n",
      "BABIY, PRESA E CONDENADA POR SER NEGRA\n",
      "DE SHARP OBJECTS A KILLING EVE: OS MELHORES ANTI-HERÓIS DA TV SÃO MULHERES\n",
      "O QUE DÓI MAIS: DOR DE DENTE OU MACHISMO?\n",
      "MARINA HUMILHA BOLSONARO NO DEBATE\n"
     ]
    }
   ],
   "source": [
    "for article in scraped_articles:\n",
    "    print(article['title'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save the articles to a text file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(output_filename, 'w') as output_f:\n",
    "    for article in scraped_articles:\n",
    "        output_f.write(\n",
    "            '--- --- ---\\n' +\n",
    "            article['url'] + '\\n' +\n",
    "            article['title'] + '\\n' +\n",
    "            article['date'] + '\\n\\n' +\n",
    "            article['text'] + '\\n\\n'\n",
    "        )\n",
    "\n",
    "        for img in article['images']:\n",
    "            output_f.write('\\t' + img + '\\n')\n",
    "\n",
    "        output_f.write('\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- --- ---\r\n",
      "http://escrevalolaescreva.blogspot.com/2018/08/acreditei-que-minha-existencia-era.html\r\n",
      "\"ACREDITEI QUE MINHA EXISTÊNCIA ERA LIMITADA À APROVAÇÃO DOS OUTROS\"\r\n",
      "quinta-feira, 30 de agosto de 2018\r\n",
      "\r\n",
      "Vinicius Simões, colaborador frequente e querido do blog, enviou este ótimo artigo de Kelly Marie Tran publicado no New York Times.\r\n",
      "Pra quem não sabe, a atriz foi muito atacada nas redes sociais por homens que não aceitam uma mulher asiática no filme Star Wars. Recentemente, após tantos xingamentos racistas e machistas, ela decidiu deletar seu Instagram. Um grupo de homens brancos online se declarou responsável pela \"vitória\" de fazer com que Kelly deletasse sua conta (o grupo também pregou boicote ao filme Pantera Negra, que já arrecadou US$ 1.3 bi). \r\n",
      "Esta é a primeira vez que Kelly fala sobre tudo isso. \r\n",
      "A atriz Kelly Marie Tran com a camiseta do filme\r\n",
      "Não foi pelas palavras deles, foi por eu ter começado a acreditar nelas.\r\n"
     ]
    }
   ],
   "source": [
    "!head escrevalolaescreva.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6",
   "language": "python",
   "name": "python3.6"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
diff --git a/requirements.txt b/requirements.txt
 requests==2.19.1
 beautifulsoup4==4.6.3
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Scrape `escrevalolaescreva.blogspot.com` Articles\n",
	"Done on request for Jay Sosa, Bowdoin College by Stephen Houser\n",
	"\n",
	"This script scrapes a Blogspot blog by iterating back in its history.\n",
	"\n",
	"Usage:\n",
	"\n",
	"1. Set `url` to Blogspot URL of a specific article you want to start scraping from\n",
	"2. Set `output_filename` to the name of a file to save your results to\n",
	"3. Set `articles_to_scrape` to some maximim number of articles to try and collect\n",
	"\n",
	"Note: Your IP-number may be temporarily banned from the Blogger service if over-used.\n",
	"Use on your own risk."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {},
	"outputs": [],
	"source": [
	"# Set url to Blogger blog article\n",
	"url = 'http://escrevalolaescreva.blogspot.com/2018/08/acreditei-que-minha-existencia-era.html'\n",
	"\n",
	"# Set to name of file to save resulting text to\n",
	"output_filename = 'escrevalolaescreva.txt'\n",
	"\n",
	"# Set to maximum number of articles to collect\n",
	"articles_to_scrape = 10\n",
	"\n",
	"# Where we will collect the scraped articles\n",
	"scraped_articles = []"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This is where the scraping actually happens. You should not need to modify this section, unless things break and you need to tweak the HTML class names."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {},
	"outputs": [],
	"source": [
	"import requests\n",
	"import re\n",
	"from bs4 import BeautifulSoup\n",
	"\n",
	"for n_article in range(1, articles_to_scrape+1):\n",
	" result = requests.get(url)\n",
	" if result.status_code != 200:\n",
	" break\n",
	"\n",
	" content = result.content\n",
	" soup = BeautifulSoup(content, 'html.parser')\n",
	"\n",
	" post_title = str(soup.find(class_='post-title').text).strip()\n",
	" post_date = str(soup.find(class_='date-header').text).strip()\n",
	"\n",
	" post_body = soup.find(class_='post-body')\n",
	"\n",
	" post_text = str(post_body.text).strip()\n",
	" post_text = re.sub('\\n+', '\\n', post_text)\n",
	"\n",
	" post_image_links = []\n",
	" for img in post_body.find_all('img', src=True):\n",
	" post_image_links.append(img['src'])\n",
	"\n",
	" scraped_articles.append({\n",
	" 'url': url,\n",
	" 'title': post_title,\n",
	" 'date': post_date,\n",
	" 'text': post_text,\n",
	" 'images': post_image_links\n",
	" })\n",
	"\n",
	" older_link = soup.find('a', class_='blog-pager-older-link')\n",
	" url = older_link['href']"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If doing some further text processing, all the articles are available in the `scraped_article` array."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 30,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"\"ACREDITEI QUE MINHA EXISTÊNCIA ERA LIMITADA À APROVAÇÃO DOS OUTROS\"\n",
	"TODO MÊS É DE DESGOSTO PARA AS MULHERES\n",
	"SÉRIE SOBRE A CAÇA AO UNABOMBER DEIXA DE LADO AS MULHERES\n",
	"ELEIÇÕES 2018: A CHANCE DA ESQUERDA VOLTAR AO PALÁCIO DO PLANALTO E O MOMENTO DECISIVO PARA O PROGRESSISMO DO PAÍS\n",
	"GUEST POST: BONITA NÃO É ELOGIO\n",
	"QUEM VOCÊ CHAMOU DE IDIOTA ÚTIL?\n",
	"BABIY, PRESA E CONDENADA POR SER NEGRA\n",
	"DE SHARP OBJECTS A KILLING EVE: OS MELHORES ANTI-HERÓIS DA TV SÃO MULHERES\n",
	"O QUE DÓI MAIS: DOR DE DENTE OU MACHISMO?\n",
	"MARINA HUMILHA BOLSONARO NO DEBATE\n"
	]
	}
	],
	"source": [
	"for article in scraped_articles:\n",
	" print(article['title'])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Save the articles to a text file"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 31,
	"metadata": {},
	"outputs": [],
	"source": [
	"with open(output_filename, 'w') as output_f:\n",
	" for article in scraped_articles:\n",
	" output_f.write(\n",
	" '--- --- ---\\n' +\n",
	" article['url'] + '\\n' +\n",
	" article['title'] + '\\n' +\n",
	" article['date'] + '\\n\\n' +\n",
	" article['text'] + '\\n\\n'\n",
	" )\n",
	"\n",
	" for img in article['images']:\n",
	" output_f.write('\\t' + img + '\\n')\n",
	"\n",
	" output_f.write('\\n')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 32,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"--- --- ---\r\n",
	"http://escrevalolaescreva.blogspot.com/2018/08/acreditei-que-minha-existencia-era.html\r\n",
	"\"ACREDITEI QUE MINHA EXISTÊNCIA ERA LIMITADA À APROVAÇÃO DOS OUTROS\"\r\n",
	"quinta-feira, 30 de agosto de 2018\r\n",
	"\r\n",
	"Vinicius Simões, colaborador frequente e querido do blog, enviou este ótimo artigo de Kelly Marie Tran publicado no New York Times.\r\n",
	"Pra quem não sabe, a atriz foi muito atacada nas redes sociais por homens que não aceitam uma mulher asiática no filme Star Wars. Recentemente, após tantos xingamentos racistas e machistas, ela decidiu deletar seu Instagram. Um grupo de homens brancos online se declarou responsável pela \"vitória\" de fazer com que Kelly deletasse sua conta (o grupo também pregou boicote ao filme Pantera Negra, que já arrecadou US$ 1.3 bi). \r\n",
	"Esta é a primeira vez que Kelly fala sobre tudo isso. \r\n",
	"A atriz Kelly Marie Tran com a camiseta do filme\r\n",
	"Não foi pelas palavras deles, foi por eu ter começado a acreditar nelas.\r\n"
	]
	}
	],
	"source": [
	"!head escrevalolaescreva.txt"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3.6",
	"language": "python",
	"name": "python3.6"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}