Created
October 24, 2014 13:55
-
-
Save davidfischer/6dd8d17546f9f7e26f5c to your computer and use it in GitHub Desktop.
This was a five minute talk on web scraping given by David Fischer at a San Diego Python meetup on October 23, 2014.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "", | |
"signature": "sha256:0c2f0bc866b7699f623c690931eba26fa64b8c22a6306dbf072009ed0b7deba1" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "heading", | |
"level": 1, | |
"metadata": {}, | |
"source": [ | |
"The (Mostly) Python Web Scraper's Toolkit" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"The website is the API" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Depending on the type of webscraping you're doing," | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import requests\n", | |
"import bs4\n", | |
"\n", | |
"# World Series Game 2\n", | |
"resp = requests.get('http://espn.go.com/mlb/playbyplay?id=341022107')\n", | |
"soup = bs4.BeautifulSoup(resp.content)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 5, | |
"metadata": {}, | |
"source": [ | |
"Once the \"soup\" is made, you can run queries against the DOM. If you're familiar with\n", | |
"JavaScript and CSS selectors, this should feel fairly familiar." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"print('Scoring Plays')\n", | |
"for span in soup.find_all('span', class_='greenfont'):\n", | |
" print(span.get_text())" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 5, | |
"metadata": {}, | |
"source": [ | |
"In fact, it's possible to even use CSS selectors with Beautiful Soup." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"print('Summary by inning')\n", | |
"for thead in soup.select('table.mod-data thead'):\n", | |
" header_rows = thead.select('tr.team-color-strip')\n", | |
" if len(header_rows) > 0:\n", | |
" print(header_rows[0].select('th')[0].get_text())\n", | |
"\n", | |
" footer_rows = thead.select('.colhead')\n", | |
" if len(footer_rows) > 0:\n", | |
" print(u' {}'.format(footer_rows[0].get_text()))" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 5, | |
"metadata": {}, | |
"source": [ | |
"BeautifulSoup is not the only library for HTML parsing available. LXML,\n", | |
"a library used by BeautifulSoup, can be used similarly." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"from lxml import html\n", | |
"\n", | |
"tree = html.fromstring(resp.content)\n", | |
"\n", | |
"\n", | |
"print('Pitching changes')\n", | |
"for text in tree.xpath('//span[@class=\"redfont\"]/b/text()'):\n", | |
" print(text)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Crawl like Google" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Scrapy**, a Python module for building web spiders and crawlers, is a little tough\n", | |
"to show here as a Scrapy project is usually broken into a number of files.\n", | |
"\n", | |
"Scrapy has lots of features that are useful for a spider:\n", | |
"* Automatically follow links (or certain matching links)\n", | |
"* Export data to various formats (CSV, JSON)\n", | |
"* Restricting crawl depth and breadth" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 5, | |
"metadata": {}, | |
"source": [ | |
"First, let's define the data we want to retrieve" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"## items.py\n", | |
"import scrapy\n", | |
"\n", | |
"\n", | |
"class GameItem(scrapy.Item):\n", | |
" url = scrapy.Field()\n", | |
" inning = scrapy.Field()\n", | |
" description = scrapy.Field()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 5, | |
"metadata": {}, | |
"source": [ | |
"Next, let's define how to find this data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"## spiders/espn.py\n", | |
"\n", | |
"from scrapy.contrib.spiders import CrawlSpider, Rule\n", | |
"from scrapy.contrib.linkextractors import LinkExtractor\n", | |
"\n", | |
"from myscraper.items import GameItem\n", | |
"\n", | |
"\n", | |
"class EspnSpider(CrawlSpider):\n", | |
" name = 'espn'\n", | |
" allowed_domains = [\"espn.com\", \"espn.go.com\"]\n", | |
" \n", | |
" # start on the World Series page\n", | |
" start_urls = ['http://espn.go.com/mlb/playoffs/2014/matchup/_/teams/sfo-kan']\n", | |
" rules = [Rule(LinkExtractor(allow=['/mlb/playbyplay\\?id=\\d+']), 'parse_game')]\n", | |
"\n", | |
" def parse_game(self, response):\n", | |
" game = GameItem()\n", | |
" game['url'] = response.url\n", | |
" game['description'] = response.xpath('//div[@class=\"game-time-location\"/p/text()').extract()\n", | |
" return game" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 5, | |
"metadata": {}, | |
"source": [ | |
"Finally, we run our spider" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
" # On the terminal\n", | |
" % scrapy crawl espn -o data.json\n", | |
" % more data.json\n", | |
" [{\"url\": \"http://espn.go.com/mlb/playbyplay?id=341021107\", \"description\": [\"8:07 PM ET, October 21, 2014\", \"Kauffman Stadium, Kansas City, Missouri\\u00a0\"]},\n", | |
" {\"url\": \"http://espn.go.com/mlb/playbyplay?id=341022107\", \"description\": [\"8:07 PM ET, October 22, 2014\", \"Kauffman Stadium, Kansas City, Missouri\\u00a0\"]}]" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Not Python but very useful" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**Wget** is a command-line program for downloading files over HTTP & FTP.\n", | |
"\n", | |
"Most importantly, it can download files \"recursively\" which means\n", | |
"that it can follow links. There's also \"mirror\" mode which is useful\n", | |
"for mirroring an entire website.\n", | |
"\n", | |
" % wget --wait=0.1 --random-wait --no-host-directories \\\n", | |
" --cut-dirs=2 --mirror --relative -e robots=off \\\n", | |
" --no-parent --no-verbose --append-output=log.txt \\\n", | |
" http://gdx.mlb.com/components/game/mlb/year_2014/month_10/day_21/gid_2014_10_21_sfnmlb_kcamlb_1\n", | |
" \n", | |
"For a \"complete\" project, see my [MLB downloader](https://github.com/davidfischer/gdx-downloader)" | |
] | |
}, | |
{ | |
"cell_type": "heading", | |
"level": 3, | |
"metadata": {}, | |
"source": [ | |
"Not covered here" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"* Automating a full web browser (JavaScript, AJAX)\n", | |
"* Working around tougher countermeasures (Captchas, IP restrictions)" | |
] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment