Skip to content

Instantly share code, notes, and snippets.

@jalperin
Last active December 23, 2015 09:29
Show Gist options
  • Save jalperin/6615312 to your computer and use it in GitHub Desktop.
Save jalperin/6615312 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "WebScraper2"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Web Scraping Session 2 ###\n",
"by Juan Pablo Alperin (@juancommander)\n",
"\n",
"Now, say you need to login to a site to access the pages that you need. Or that you need to move between pages. Then you need to simulate a browser.\n",
"\n",
"For this, one option is [Mechanize](https://pypi.python.org/pypi/mechanize/): _Stateful programmatic web browsing_."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import mechanize\n",
"\n",
"url='http://www.amazon.com/Hundred-Solitude-Gabriel-Garcia-Marquez/dp/006112009X/'\n",
"\n",
"# Browser\n",
"br = mechanize.Browser()\n",
"\n",
"# User-Agent (this is cheating, ok?)\n",
"br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]\n",
"\n",
"\n",
"# fetch the page\n",
"response = br.open(url)\n",
"print \"Book Page:\"\n",
"print br.title()\n",
"print response.geturl() # the URL the browser is currently at\n",
"# print response.info() # HTML headers \n",
"print \n",
"\n",
"response = br.follow_link(text_regex=r\"See all \\d+ customer reviews\")\n",
"print \"Review Page:\"\n",
"print br.title()\n",
"print response.geturl() # the URL the browser is currently at"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Book Page:\n",
"One Hundred Years of Solitude: Gabriel Garcia Marquez: 9780061120091: Amazon.com: Books"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"http://www.amazon.com/Hundred-Solitude-Gabriel-Garcia-Marquez/dp/006112009X/\n",
"\n",
"Review Page:"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Amazon.com: Customer Reviews: One Hundred Years of Solitude\n",
"http://www.amazon.com/Hundred-Solitude-Gabriel-Garcia-Marquez/product-reviews/006112009X/ref=cm_cr_dp_see_all_summary/178-7878868-7659035?ie=UTF8&showViewpoints=1&sortBy=byRankDescending\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Exercise 1:**\n",
"\n",
"Write a loop that follows the \"next\" link 10 times. \n",
"\n",
"_Hint:_ Spend some time reading through the examples on the [Mechanize Homepage](http://wwwsearch.sourceforge.net/mechanize/). The documentation is not as thorough as for BeautifulSoup, but there are enough examples on follow links to do this assignment. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for i in range(0,10):\n",
" ## your code here\n",
" br.geturl()\n",
" \n",
"# answer: \n",
"br.follow_link(url_regex=r'.*btm_link_next_.*')\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 2,
"text": [
"<response_seek_wrapper at 0x10378ed40 whose wrapped object = <closeable_response at 0x1037b6fc8 whose fp = <socket._fileobject object at 0x10377d950>>>"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some options that can come in handy:\n",
"\n",
" # Browser options\n",
" br.set_proxies({\"http\": \"joe:[email protected]:3128\"}) # if you need to go through a proxy\n",
" br.set_handle_redirect(True)\n",
" br.set_handle_referer(True)\n",
" br.set_handle_robots(False) # some pages do not want to be scraped (http://www.robotstxt.org/robotstxt.html)\n",
"\n",
" # Follows refresh 0 but not hangs on refresh > 0\n",
" br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)\n",
"\n",
" # Want debugging messages?\n",
" br.set_debug_http(False)\n",
" br.set_debug_redirects(False)\n",
" br.set_debug_responses(False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you need to fill out and submit Web Forms, mechanize [is particularly useful](http://wwwsearch.sourceforge.net/mechanize/forms.html):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"br.select_form(name=\"site-search\") # select the form\n",
"form = br.form # put it in a variable to make it easier to work with\n",
"\n",
"form['url'] = ['search-alias=stripbooks']\n",
"form['field-keywords'] = 'Gabriel Garcia Marquez'\n",
"br.submit()\n",
"\n",
"print br.title()\n",
"print br.geturl()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Amazon.com: Gabriel Garcia Marquez: Books\n",
"http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=Gabriel+Garcia+Marquez\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Which brings us to another way of submitting (some) forms: directly through a URL. \n",
"\n",
"Lets [break down](http://www.mattcutts.com/blog/seo-glossary-url-definitions/): [http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=Gabriel+Garcia+Marquez](http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=Gabriel+Garcia+Marquez)\n",
"\n",
"You use them everyday, now it is time to understand the parts: \n",
"\n",
"* `http://` is the \"protocol\". [http](http://en.wikipedia.org/wiki/Http) is the standard one, you will also see [https](http://en.wikipedia.org/wiki/Https) which is the \"secure\" one\n",
"* `www.amazon.com` is the [domain name](http://en.wikipedia.org/wiki/Domain_name). It can have more parts, but it is everything before the first `/`\n",
"* `s/ref=nb_sb_noss` is the path. It is not typical to have =, but acceptable. This specifies the page within the domain name. It can/be/many/parts/. It is everything between the first / and the first ?\n",
"* `url=search-alias%3Dstripbooks&field-keywords=Gabriel+Garcia+Marquez` are the query parameters. Basically the results of form submissions. \n",
"\n",
"Lets look at the parameters in more detail: \n",
"\n",
"* parameters are separated by `&`, so we have: \n",
" * url=search-alias%3Dstripbooks\n",
" * field-keywords=Gabriel+Garcia+Marquez\n",
"* parameter names are preceded by an `=`, so we have: \n",
" * url\n",
" * field-keywords\n",
"* parameter values are \"url escaped\" (sometimes known as [percent encoding](http://en.wikipedia.org/wiki/Url_escape))\n",
" * search-alias%3Dstripbooks => search-alias=stripbooks (this allows you to have = and other special symbols)\n",
" * Gabriel+Garcia+Marquez (sometimes Gabriel%20Garcia%20Marquez) => Gabriel Garcia Marquez\n",
"\n",
"Fortunately, you can get python to figure it all out for you: "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import urllib, urllib2\n",
"# make a dictionary of your parameters\n",
"params = {'url': 'search-alias=stripbooks', 'field-keywords': 'Gabriel Garcia Marquez'}\n",
"print urllib.urlencode(params)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"url=search-alias%3Dstripbooks&field-keywords=Gabriel+Garcia+Marquez\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# put it together with the rest of the URL\n",
"url = 'http://www.amazon.com/s/ref=nb_sb_noss' + '?' + urllib.urlencode(params)\n",
"print url"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Dstripbooks&field-keywords=Gabriel+Garcia+Marquez\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### working with APIs ###\n",
"\n",
"Working with (most) API's is not very different. <b>A</b>pplication <b>P</b>rogramming <b>I</b>nterfaces are basically just websites that return structured data instead of an HTML page. \n",
"\n",
"Lets have a look at the [World Bank API](http://data.worldbank.org/node/11). Like most of the APIs you'll ever use, it has a [RESTful API](http://stackoverflow.com/questions/671118/what-exactly-is-restful-programming) and parameter (argument) based endpoints.\n",
"\n",
"Again, the only difference with using an API from scraping is that the data comes back in a machine-readable format, not just HTML. The two most common formats are XML and JSON. JSON is actually very similar to the syntax for python dictionaries and there is a [json module](http://docs.python.org/2/library/json.html) to help us convert JSON into python.\n",
"\n",
"Lets look at the JSON module:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import json\n",
"# the output from the following URL: http://api.worldbank.org/countries/BRA?format=json\n",
"# this is just a string\n",
"api_response = '[{\"page\":1,\"pages\":1,\"per_page\":\"50\",\"total\":1},[{\"id\":\"BRA\",\"iso2Code\":\"BR\",\"name\":\"Brazil\",\"region\":{\"id\":\"LCN\",\"value\":\"Latin America & Caribbean (all income levels)\"},\"adminregion\":{\"id\":\"LAC\",\"value\":\"Latin America & Caribbean (developing only)\"},\"incomeLevel\":{\"id\":\"UMC\",\"value\":\"Upper middle income\"},\"lendingType\":{\"id\":\"IBD\",\"value\":\"IBRD\"},\"capitalCity\":\"Brasilia\",\"longitude\":\"-47.9292\",\"latitude\":\"-15.7801\"}]]'\n",
"\n",
"# loads (load string) will convert it to a Python object we can use\n",
"json_response = json.loads(api_response)\n",
"\n",
"# we have a list, with a list some lists or dicts inside\n",
"print json_response"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[{u'per_page': u'50', u'total': 1, u'page': 1, u'pages': 1}, [{u'longitude': u'-47.9292', u'name': u'Brazil', u'region': {u'id': u'LCN', u'value': u'Latin America & Caribbean (all income levels)'}, u'adminregion': {u'id': u'LAC', u'value': u'Latin America & Caribbean (developing only)'}, u'iso2Code': u'BR', u'capitalCity': u'Brasilia', u'latitude': u'-15.7801', u'incomeLevel': {u'id': u'UMC', u'value': u'Upper middle income'}, u'id': u'BRA', u'lendingType': {u'id': u'IBD', u'value': u'IBRD'}}]]\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# we can grab the data we want from our list\n",
"brazil_data = json_response[1][0]\n",
"print brazil_data['capitalCity']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Brasilia\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Excercise 2**\n",
"\n",
"Use the World Bank API to find the [GNI per Capita](http://data.worldbank.org/indicator/NY.GNP.PCAP.CD/) of Argentina, Brazil, Mexico and Canada\n",
"\n",
"_hint:_ remember urllib2.urlopen from the the beginning of the earlier session?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import json\n",
"\n",
"for country in ['arg', 'bra', 'mex', 'can']:\n",
" url = 'http://api.worldbank.org/countries/' + country + '/indicators/NY.GNP.PCAP.CD?date=2008&format=json'\n",
" content = urllib2.urlopen(url).read()\n",
" response = json.loads(content)\n",
" print response[1][0]['value']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"None\n",
"7480"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"9370"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"43460"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Time to take a drive"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last option for web scraping is to write some code to drive a browser. The easiest option to do that is to use [Selenium](https://addons.mozilla.org/en-us/firefox/addon/selenium-ide-button/). Selenium was actually intended to help do automated testing of Websites, but it works well for our purposes. \n",
"\n",
"While everything we've seen so far could be put on a server to run and only requires python, Selenium will require your actual browser. It will use the browser itself to fetch and render the pages. \n",
"\n",
"This is a great option in two scenarios: \n",
"\n",
"1. If the page you want to parse requires a lot of javascript or something else that does not let urllib or mechanize to work with the page. Some sites do this on purpose to prevent scraping. \n",
"2. If you have to do a lot of navigating around a page to get to the information you want to scrape\n",
"\n",
"Because of the installation requirements, I will only demonstrate this option. It requires you to install selenium in your python installation, and a selenium plugin in Firefox. "
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment