Skip to content

Instantly share code, notes, and snippets.

@davidfischer
Created October 24, 2014 13:55
Show Gist options
  • Save davidfischer/6dd8d17546f9f7e26f5c to your computer and use it in GitHub Desktop.
Save davidfischer/6dd8d17546f9f7e26f5c to your computer and use it in GitHub Desktop.
This was a five minute talk on web scraping given by David Fischer at a San Diego Python meetup on October 23, 2014.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "",
"signature": "sha256:0c2f0bc866b7699f623c690931eba26fa64b8c22a6306dbf072009ed0b7deba1"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"The (Mostly) Python Web Scraper's Toolkit"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"The website is the API"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Depending on the type of webscraping you're doing,"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import requests\n",
"import bs4\n",
"\n",
"# World Series Game 2\n",
"resp = requests.get('http://espn.go.com/mlb/playbyplay?id=341022107')\n",
"soup = bs4.BeautifulSoup(resp.content)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 5,
"metadata": {},
"source": [
"Once the \"soup\" is made, you can run queries against the DOM. If you're familiar with\n",
"JavaScript and CSS selectors, this should feel fairly familiar."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print('Scoring Plays')\n",
"for span in soup.find_all('span', class_='greenfont'):\n",
" print(span.get_text())"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 5,
"metadata": {},
"source": [
"In fact, it's possible to even use CSS selectors with Beautiful Soup."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print('Summary by inning')\n",
"for thead in soup.select('table.mod-data thead'):\n",
" header_rows = thead.select('tr.team-color-strip')\n",
" if len(header_rows) > 0:\n",
" print(header_rows[0].select('th')[0].get_text())\n",
"\n",
" footer_rows = thead.select('.colhead')\n",
" if len(footer_rows) > 0:\n",
" print(u' {}'.format(footer_rows[0].get_text()))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 5,
"metadata": {},
"source": [
"BeautifulSoup is not the only library for HTML parsing available. LXML,\n",
"a library used by BeautifulSoup, can be used similarly."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from lxml import html\n",
"\n",
"tree = html.fromstring(resp.content)\n",
"\n",
"\n",
"print('Pitching changes')\n",
"for text in tree.xpath('//span[@class=\"redfont\"]/b/text()'):\n",
" print(text)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Crawl like Google"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Scrapy**, a Python module for building web spiders and crawlers, is a little tough\n",
"to show here as a Scrapy project is usually broken into a number of files.\n",
"\n",
"Scrapy has lots of features that are useful for a spider:\n",
"* Automatically follow links (or certain matching links)\n",
"* Export data to various formats (CSV, JSON)\n",
"* Restricting crawl depth and breadth"
]
},
{
"cell_type": "heading",
"level": 5,
"metadata": {},
"source": [
"First, let's define the data we want to retrieve"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"## items.py\n",
"import scrapy\n",
"\n",
"\n",
"class GameItem(scrapy.Item):\n",
" url = scrapy.Field()\n",
" inning = scrapy.Field()\n",
" description = scrapy.Field()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 5,
"metadata": {},
"source": [
"Next, let's define how to find this data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"## spiders/espn.py\n",
"\n",
"from scrapy.contrib.spiders import CrawlSpider, Rule\n",
"from scrapy.contrib.linkextractors import LinkExtractor\n",
"\n",
"from myscraper.items import GameItem\n",
"\n",
"\n",
"class EspnSpider(CrawlSpider):\n",
" name = 'espn'\n",
" allowed_domains = [\"espn.com\", \"espn.go.com\"]\n",
" \n",
" # start on the World Series page\n",
" start_urls = ['http://espn.go.com/mlb/playoffs/2014/matchup/_/teams/sfo-kan']\n",
" rules = [Rule(LinkExtractor(allow=['/mlb/playbyplay\\?id=\\d+']), 'parse_game')]\n",
"\n",
" def parse_game(self, response):\n",
" game = GameItem()\n",
" game['url'] = response.url\n",
" game['description'] = response.xpath('//div[@class=\"game-time-location\"/p/text()').extract()\n",
" return game"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "heading",
"level": 5,
"metadata": {},
"source": [
"Finally, we run our spider"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" # On the terminal\n",
" % scrapy crawl espn -o data.json\n",
" % more data.json\n",
" [{\"url\": \"http://espn.go.com/mlb/playbyplay?id=341021107\", \"description\": [\"8:07 PM ET, October 21, 2014\", \"Kauffman Stadium, Kansas City, Missouri\\u00a0\"]},\n",
" {\"url\": \"http://espn.go.com/mlb/playbyplay?id=341022107\", \"description\": [\"8:07 PM ET, October 22, 2014\", \"Kauffman Stadium, Kansas City, Missouri\\u00a0\"]}]"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Not Python but very useful"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Wget** is a command-line program for downloading files over HTTP & FTP.\n",
"\n",
"Most importantly, it can download files \"recursively\" which means\n",
"that it can follow links. There's also \"mirror\" mode which is useful\n",
"for mirroring an entire website.\n",
"\n",
" % wget --wait=0.1 --random-wait --no-host-directories \\\n",
" --cut-dirs=2 --mirror --relative -e robots=off \\\n",
" --no-parent --no-verbose --append-output=log.txt \\\n",
" http://gdx.mlb.com/components/game/mlb/year_2014/month_10/day_21/gid_2014_10_21_sfnmlb_kcamlb_1\n",
" \n",
"For a \"complete\" project, see my [MLB downloader](https://github.com/davidfischer/gdx-downloader)"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Not covered here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Automating a full web browser (JavaScript, AJAX)\n",
"* Working around tougher countermeasures (Captchas, IP restrictions)"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment