davidfischer · October 24, 2014 13:55
diff --git a/web-scrapers-toolkit.ipynb b/web-scrapers-toolkit.ipynb
 {
 "metadata": {
  "name": "",
  "signature": "sha256:0c2f0bc866b7699f623c690931eba26fa64b8c22a6306dbf072009ed0b7deba1"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "The (Mostly) Python Web Scraper's Toolkit"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "The website is the API"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Depending on the type of webscraping you're doing,"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import requests\n",
      "import bs4\n",
      "\n",
      "# World Series Game 2\n",
      "resp = requests.get('http://espn.go.com/mlb/playbyplay?id=341022107')\n",
      "soup = bs4.BeautifulSoup(resp.content)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 5,
     "metadata": {},
     "source": [
      "Once the \"soup\" is made, you can run queries against the DOM. If you're familiar with\n",
      "JavaScript and CSS selectors, this should feel fairly familiar."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print('Scoring Plays')\n",
      "for span in soup.find_all('span', class_='greenfont'):\n",
      "    print(span.get_text())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 5,
     "metadata": {},
     "source": [
      "In fact, it's possible to even use CSS selectors with Beautiful Soup."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print('Summary by inning')\n",
      "for thead in soup.select('table.mod-data thead'):\n",
      "    header_rows = thead.select('tr.team-color-strip')\n",
      "    if len(header_rows) > 0:\n",
      "        print(header_rows[0].select('th')[0].get_text())\n",
      "\n",
      "    footer_rows = thead.select('.colhead')\n",
      "    if len(footer_rows) > 0:\n",
      "        print(u'  {}'.format(footer_rows[0].get_text()))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 5,
     "metadata": {},
     "source": [
      "BeautifulSoup is not the only library for HTML parsing available. LXML,\n",
      "a library used by BeautifulSoup, can be used similarly."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from lxml import html\n",
      "\n",
      "tree = html.fromstring(resp.content)\n",
      "\n",
      "\n",
      "print('Pitching changes')\n",
      "for text in tree.xpath('//span[@class=\"redfont\"]/b/text()'):\n",
      "    print(text)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Crawl like Google"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Scrapy**, a Python module for building web spiders and crawlers, is a little tough\n",
      "to show here as a Scrapy project is usually broken into a number of files.\n",
      "\n",
      "Scrapy has lots of features that are useful for a spider:\n",
      "* Automatically follow links (or certain matching links)\n",
      "* Export data to various formats (CSV, JSON)\n",
      "* Restricting crawl depth and breadth"
     ]
    },
    {
     "cell_type": "heading",
     "level": 5,
     "metadata": {},
     "source": [
      "First, let's define the data we want to retrieve"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "## items.py\n",
      "import scrapy\n",
      "\n",
      "\n",
      "class GameItem(scrapy.Item):\n",
      "    url = scrapy.Field()\n",
      "    inning = scrapy.Field()\n",
      "    description = scrapy.Field()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 5,
     "metadata": {},
     "source": [
      "Next, let's define how to find this data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "## spiders/espn.py\n",
      "\n",
      "from scrapy.contrib.spiders import CrawlSpider, Rule\n",
      "from scrapy.contrib.linkextractors import LinkExtractor\n",
      "\n",
      "from myscraper.items import GameItem\n",
      "\n",
      "\n",
      "class EspnSpider(CrawlSpider):\n",
      "    name = 'espn'\n",
      "    allowed_domains = [\"espn.com\", \"espn.go.com\"]\n",
      "    \n",
      "    # start on the World Series page\n",
      "    start_urls = ['http://espn.go.com/mlb/playoffs/2014/matchup/_/teams/sfo-kan']\n",
      "    rules = [Rule(LinkExtractor(allow=['/mlb/playbyplay\\?id=\\d+']), 'parse_game')]\n",
      "\n",
      "    def parse_game(self, response):\n",
      "        game = GameItem()\n",
      "        game['url'] = response.url\n",
      "        game['description'] = response.xpath('//div[@class=\"game-time-location\"/p/text()').extract()\n",
      "        return game"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 5,
     "metadata": {},
     "source": [
      "Finally, we run our spider"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "    # On the terminal\n",
      "    % scrapy crawl espn -o data.json\n",
      "    % more data.json\n",
      "    [{\"url\": \"http://espn.go.com/mlb/playbyplay?id=341021107\", \"description\": [\"8:07 PM ET, October 21, 2014\", \"Kauffman Stadium, Kansas City, Missouri\\u00a0\"]},\n",
      "     {\"url\": \"http://espn.go.com/mlb/playbyplay?id=341022107\", \"description\": [\"8:07 PM ET, October 22, 2014\", \"Kauffman Stadium, Kansas City, Missouri\\u00a0\"]}]"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Not Python but very useful"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Wget** is a command-line program for downloading files over HTTP & FTP.\n",
      "\n",
      "Most importantly, it can download files \"recursively\" which means\n",
      "that it can follow links. There's also \"mirror\" mode which is useful\n",
      "for mirroring an entire website.\n",
      "\n",
      "    % wget --wait=0.1 --random-wait --no-host-directories \\\n",
      "       --cut-dirs=2 --mirror --relative -e robots=off \\\n",
      "       --no-parent --no-verbose --append-output=log.txt \\\n",
      "       http://gdx.mlb.com/components/game/mlb/year_2014/month_10/day_21/gid_2014_10_21_sfnmlb_kcamlb_1\n",
      "       \n",
      "For a \"complete\" project, see my [MLB downloader](https://github.com/davidfischer/gdx-downloader)"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Not covered here"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "* Automating a full web browser (JavaScript, AJAX)\n",
      "* Working around tougher countermeasures (Captchas, IP restrictions)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
 }
	{
	"metadata": {
	"name": "",
	"signature": "sha256:0c2f0bc866b7699f623c690931eba26fa64b8c22a6306dbf072009ed0b7deba1"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "heading",
	"level": 1,
	"metadata": {},
	"source": [
	"The (Mostly) Python Web Scraper's Toolkit"
	]
	},
	{
	"cell_type": "heading",
	"level": 3,
	"metadata": {},
	"source": [
	"The website is the API"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Depending on the type of webscraping you're doing,"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import requests\n",
	"import bs4\n",
	"\n",
	"# World Series Game 2\n",
	"resp = requests.get('http://espn.go.com/mlb/playbyplay?id=341022107')\n",
	"soup = bs4.BeautifulSoup(resp.content)"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "heading",
	"level": 5,
	"metadata": {},
	"source": [
	"Once the \"soup\" is made, you can run queries against the DOM. If you're familiar with\n",
	"JavaScript and CSS selectors, this should feel fairly familiar."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"print('Scoring Plays')\n",
	"for span in soup.find_all('span', class_='greenfont'):\n",
	" print(span.get_text())"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "heading",
	"level": 5,
	"metadata": {},
	"source": [
	"In fact, it's possible to even use CSS selectors with Beautiful Soup."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"print('Summary by inning')\n",
	"for thead in soup.select('table.mod-data thead'):\n",
	" header_rows = thead.select('tr.team-color-strip')\n",
	" if len(header_rows) > 0:\n",
	" print(header_rows[0].select('th')[0].get_text())\n",
	"\n",
	" footer_rows = thead.select('.colhead')\n",
	" if len(footer_rows) > 0:\n",
	" print(u' {}'.format(footer_rows[0].get_text()))"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "heading",
	"level": 5,
	"metadata": {},
	"source": [
	"BeautifulSoup is not the only library for HTML parsing available. LXML,\n",
	"a library used by BeautifulSoup, can be used similarly."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"from lxml import html\n",
	"\n",
	"tree = html.fromstring(resp.content)\n",
	"\n",
	"\n",
	"print('Pitching changes')\n",
	"for text in tree.xpath('//span[@class=\"redfont\"]/b/text()'):\n",
	" print(text)"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "heading",
	"level": 3,
	"metadata": {},
	"source": [
	"Crawl like Google"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Scrapy, a Python module for building web spiders and crawlers, is a little tough\n",
	"to show here as a Scrapy project is usually broken into a number of files.\n",
	"\n",
	"Scrapy has lots of features that are useful for a spider:\n",
	"* Automatically follow links (or certain matching links)\n",
	"* Export data to various formats (CSV, JSON)\n",
	"* Restricting crawl depth and breadth"
	]
	},
	{
	"cell_type": "heading",
	"level": 5,
	"metadata": {},
	"source": [
	"First, let's define the data we want to retrieve"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"## items.py\n",
	"import scrapy\n",
	"\n",
	"\n",
	"class GameItem(scrapy.Item):\n",
	" url = scrapy.Field()\n",
	" inning = scrapy.Field()\n",
	" description = scrapy.Field()"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "heading",
	"level": 5,
	"metadata": {},
	"source": [
	"Next, let's define how to find this data"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"## spiders/espn.py\n",
	"\n",
	"from scrapy.contrib.spiders import CrawlSpider, Rule\n",
	"from scrapy.contrib.linkextractors import LinkExtractor\n",
	"\n",
	"from myscraper.items import GameItem\n",
	"\n",
	"\n",
	"class EspnSpider(CrawlSpider):\n",
	" name = 'espn'\n",
	" allowed_domains = [\"espn.com\", \"espn.go.com\"]\n",
	" \n",
	" # start on the World Series page\n",
	" start_urls = ['http://espn.go.com/mlb/playoffs/2014/matchup/_/teams/sfo-kan']\n",
	" rules = [Rule(LinkExtractor(allow=['/mlb/playbyplay\\?id=\\d+']), 'parse_game')]\n",
	"\n",
	" def parse_game(self, response):\n",
	" game = GameItem()\n",
	" game['url'] = response.url\n",
	" game['description'] = response.xpath('//div[@class=\"game-time-location\"/p/text()').extract()\n",
	" return game"
	],
	"language": "python",
	"metadata": {},
	"outputs": []
	},
	{
	"cell_type": "heading",
	"level": 5,
	"metadata": {},
	"source": [
	"Finally, we run our spider"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	" # On the terminal\n",
	" % scrapy crawl espn -o data.json\n",
	" % more data.json\n",
	" [{\"url\": \"http://espn.go.com/mlb/playbyplay?id=341021107\", \"description\": [\"8:07 PM ET, October 21, 2014\", \"Kauffman Stadium, Kansas City, Missouri\\u00a0\"]},\n",
	" {\"url\": \"http://espn.go.com/mlb/playbyplay?id=341022107\", \"description\": [\"8:07 PM ET, October 22, 2014\", \"Kauffman Stadium, Kansas City, Missouri\\u00a0\"]}]"
	]
	},
	{
	"cell_type": "heading",
	"level": 3,
	"metadata": {},
	"source": [
	"Not Python but very useful"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Wget is a command-line program for downloading files over HTTP & FTP.\n",
	"\n",
	"Most importantly, it can download files \"recursively\" which means\n",
	"that it can follow links. There's also \"mirror\" mode which is useful\n",
	"for mirroring an entire website.\n",
	"\n",
	" % wget --wait=0.1 --random-wait --no-host-directories \\\n",
	" --cut-dirs=2 --mirror --relative -e robots=off \\\n",
	" --no-parent --no-verbose --append-output=log.txt \\\n",
	" http://gdx.mlb.com/components/game/mlb/year_2014/month_10/day_21/gid_2014_10_21_sfnmlb_kcamlb_1\n",
	" \n",
	"For a \"complete\" project, see my [MLB downloader](https://github.com/davidfischer/gdx-downloader)"
	]
	},
	{
	"cell_type": "heading",
	"level": 3,
	"metadata": {},
	"source": [
	"Not covered here"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"* Automating a full web browser (JavaScript, AJAX)\n",
	"* Working around tougher countermeasures (Captchas, IP restrictions)"
	]
	}
	],
	"metadata": {}
	}
	]
	}