Last active
January 12, 2017 11:31
-
-
Save jonathanmorgan/a64ad537fa8f45bfaedb to your computer and use it in GitHub Desktop.
Scraping Dynamic Web Pages
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Scraping Dynamic Web Pages\n", | |
| "\n", | |
| "# Table of Contents\n", | |
| "\n", | |
| "- [Introduction](#Introduction)\n", | |
| "- [Installation](#Installation)\n", | |
| "\n", | |
| " - [`requests` package](#requests-package)\n", | |
| " - [`mechanize` package](#mechanize-package)\n", | |
| " - [`PhantomJS` and `selenium`](#PhantomJS-and-selenium)\n", | |
| " \n", | |
| " - [Notes](#Notes)\n", | |
| " - [Mac](#Mac)\n", | |
| " - [Windows](#Windows)\n", | |
| "\n", | |
| "- [Example code](#Example-code)\n", | |
| "\n", | |
| " - [Notes](#Notes)\n", | |
| " - [Code](#Code)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Introduction\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)\n", | |
| "\n", | |
| "As the Internet becomes more dynamic, more and more web pages respond to a request for a URL by passing a small packet of HTML that includes instructions that direct browsers to subsequently assemble the web page or application from an additional set of discrete HTTP requests initiated by the browser, rather than returning an entire page of HTML that the browser simply renders.\n", | |
| "\n", | |
| "Basic programmatic HTTP clients like `requests` and `mechanize` are built to deal with HTML, not with dynamic pages that are essentially Javascript applications delivered in a thin HTML wrapper. They can load and render HTML well, and even keep track of sessions and cookies and redirects, but they can not execute Javascript, and so will often fail to render the majority of a dynamic Javascript-based Dynamic HTML (DHTML) web page.\n", | |
| "\n", | |
| "When you need to pull data from a web page rendered in part by Javascript, you need to use Python to control a more substantial browser engine that can parse and execute Javascript code in addition to HTML. Below are instructions for using Selenium ( [http://docs.seleniumhq.org/](http://docs.seleniumhq.org/) ) to control a window-less browser named PhantomJS ( [http://phantomjs.org/](http://phantomjs.org/) ) that is based on WebKit, the browser rendering engine used by Safari in both OS X and iOS and that was the basis of the Chromium's Blink render engine that is used in Chrome and Opera.\n", | |
| "\n", | |
| "First, you'll install:\n", | |
| "\n", | |
| "- basic HTTP clients requests and mechanize, for comparison with WebKit.\n", | |
| "- node.js - a browser-independent Javascript runtime used by Phantom JS.\n", | |
| "- PhantomJS - the headless (no actual window, just runs in memory) web client we'll use to render Javascript-based web pages.\n", | |
| "- Selenium - a software package designed to let a programmer script interactions with web sites in actual browsers (including Firefox, Chrome, IE, and Safari).\n", | |
| "\n", | |
| "Then, you'll see and run an example program that can be configured to use any of the three request packages to load the same page, so you can compare the results from each." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Installation\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)\n", | |
| "\n", | |
| "In addition to the below, make sure that you've installed html5lib ( `conda install html5lib` ) for use in parsing the HTML once the selected HTTP client retrieves it." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## `requests` package\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)\n", | |
| "\n", | |
| "First, we'll install the requests package:\n", | |
| "\n", | |
| " conda install requests\n", | |
| " # OR\n", | |
| " pip install requests" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## `mechanize` package\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)\n", | |
| "\n", | |
| "Next, install the `mechanize` package:\n", | |
| "\n", | |
| " pip install mechanize" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## `PhantomJS` and `selenium`\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### Notes\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)\n", | |
| "\n", | |
| "`PhantomJS` and `selenium` notes:\n", | |
| "\n", | |
| "- You can install PhantomJS one of two ways: using the node package manager (npm) or by downloading directly from the PhantomJS web site ( [http://phantomjs.org/](http://phantomjs.org/) ). Each has its good and bad:\n", | |
| "\n", | |
| " - If you install using npm, you will get an older, more stable version of PhantomJS, but it is easy to upgrade it as updates are released, and it is also easy to remove it.\n", | |
| " - If you install from the PhantomJS site, you get the latest and the greatest, but you have to keep track of where it is installed and remember to update.\n", | |
| " - In general, it is probably a good idea to first install with npm, then if you find that the version of WebKit bundled with NPM is too old such that you can't render pages you need to render, install from the PhantomJS downloads site.\n", | |
| "\n", | |
| "- when using Selenium to control a PhantomJS browser instance, you'll need to specify the path to the PhantomJS executable. Selenium expects the path to the PhantomJS executable to use \"/\" as the indicator of directory changes. So, for both Mac and Windows below, the paths are specified using \"/\" as the directory change delimiter, even though the native character for that in Windows is \"\\\"." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### Mac\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)\n", | |
| "\n", | |
| "Mac installation:\n", | |
| "\n", | |
| "- Install node.js using nvm\n", | |
| "\n", | |
| " - based on: [http://flnkr.com/2014/11/install-nvm-and-node-js-on-os-x/](http://flnkr.com/2014/11/install-nvm-and-node-js-on-os-x/)\n", | |
| " - first, download and install NVM\n", | |
| "\n", | |
| " - https://github.com/creationix/nvm/\n", | |
| " - you will download and run a shell script. Get the one for the latest version of NVM. Example for v0.24.0:\n", | |
| "\n", | |
| " curl https://raw.githubusercontent.com/creationix/nvm/v0.24.0/install.sh | bash\n", | |
| " wget -qO- https://raw.githubusercontent.com/creationix/nvm/v0.24.0/install.sh | bash\n", | |
| "\n", | |
| " - Includes installing NVM in ~/.nvm and updating your .bash_profile with a command that makes the nvm command available in the shell. If you want that elsewhere (I set .bashrc to run .bash_profile, so I just put stuff like this there so it is always run, login or non-login - you might want it in .bashrc instead, or something), move it.\n", | |
| " - if you are logged in to a shell, log out, then log back in or close the shell then open a new one.\n", | |
| " - use nvm command to install node current stable version of node.js:\n", | |
| "\n", | |
| " nvm install stable\n", | |
| " \n", | |
| "- Install PhantomJS\n", | |
| "\n", | |
| " - using npm\n", | |
| "\n", | |
| " npm install PhantomJS\n", | |
| " \n", | |
| " - This will install PhantomJS in the Node Version Manager folder for the currently selected version of node, inside `lib/node_modules`.\n", | |
| " - The binary will be inside at path `phantomjs/lib/phantom/bin/phantomjs` (totally different from Windows... Nice.).\n", | |
| " \n", | |
| " - Example executable path for node v0.12.1: `/Users/jonathanmorgan/.nvm/versions/node/v0.12.1/lib/node_modules/phantomjs/lib/phantom/bin/phantomjs`\n", | |
| "\n", | |
| " - from downloaded distribution file\n", | |
| "\n", | |
| " - download the latest PhantomJS from [http://phantomjs.org/download.html](http://phantomjs.org/download.html)\n", | |
| " - unpack the folder wherever you want.\n", | |
| " - path to executable will be `<install_directory>/phantomjs-X.X.X-macosx/bin/phantomjs` where \"X.X.X\" is the version number of downloaded phantomjs.\n", | |
| "\n", | |
| "- Install Selenium\n", | |
| "\n", | |
| " pip install selenium" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### Windows\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)\n", | |
| "\n", | |
| "Windows installation:\n", | |
| "\n", | |
| "- Install node.js\n", | |
| " \n", | |
| "\t- using installer:\n", | |
| "\n", | |
| "\t\t- download the latest installer from https://nodejs.org/download/\n", | |
| "\t\t- run it.\n", | |
| " - reboot the machine to get `npm` and `node` into `PATH`.\n", | |
| "\n", | |
| "\t- To update to latest version, download latest installer and run it. It will overwrite older versions\n", | |
| "\n", | |
| "\n", | |
| "- Install PhantomJS\n", | |
| "\n", | |
| " - using npm\n", | |
| "\n", | |
| " npm install PhantomJS\n", | |
| " \n", | |
| " - examples below use \"/\" for directory delimiter since selenium expects that, won't work on Windows if you use \"\\\".\n", | |
| " - should install to either `~/node_modules/phantomjs` or `~/AppData/Roaming/npm/node_modules/phantomjs`.\n", | |
| " - Inside the `phantomjs` folder, the executable will be: at path `/lib/phantom/phantomjs.exe`\n", | |
| "\n", | |
| " - Example for `~/node_modules`: `~/node_modules/phantomjs/lib/phantom/phantomjs.exe`\n", | |
| " - Example for `~/AppData/...`: `~/AppData/Roaming/npm/node_modules/phantomjs/lib/phantom/phantomjs.exe`\n", | |
| "\n", | |
| " - from download\n", | |
| " \n", | |
| " - download the latest PhantomJS from [http://phantomjs.org/download.html](http://phantomjs.org/download.html)\n", | |
| " - unzip the folder wherever you want.\n", | |
| " - path to executable will be `<install_directory>/phantomjs-X.X.X-windows/bin/phantomjs.exe` where \"X.X.X\" is the version number of downloaded phantomjs.\n", | |
| "\n", | |
| "\n", | |
| "- Install Selenium\n", | |
| "\n", | |
| " pip install selenium" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Example code\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)\n", | |
| "\n", | |
| "## Notes\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)\n", | |
| "\n", | |
| "Example code notes:\n", | |
| "\n", | |
| "- Make sure all the installation above for node.js, phantomjs and selenium are completed before running the PhantomJS example.\n", | |
| "- The phantomJS client might be slow. If nothing happens immediately when you run it, give it a few minutes to do its thing (Javascript on the server isn't exactly fast yet, it seems, without some tuning).\n", | |
| "- code below is based in part on: http://stackoverflow.com/questions/13287490/is-there-a-way-to-use-phantomjs-in-python" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "## Code\n", | |
| "\n", | |
| "- Back to [Table of Contents](#Table-of-Contents)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "import mechanize\n", | |
| "import requests\n", | |
| "\n", | |
| "# try selenium and PhantomJS\n", | |
| "from selenium import webdriver\n", | |
| "\n", | |
| "from bs4 import BeautifulSoup\n", | |
| "\n", | |
| "# declare variables\n", | |
| "test_url = \"\"\n", | |
| "http_client = \"\"\n", | |
| "HTTP_REQUESTS = \"requests\"\n", | |
| "HTTP_MECHANIZE = \"mechanize\"\n", | |
| "HTTP_PHANTOM_JS = \"phantomJS\"\n", | |
| "test_html = \"\"\n", | |
| "\n", | |
| "# declare variables - requests and mechanize\n", | |
| "response = None\n", | |
| "\n", | |
| "# declare variables - PhantomJS\n", | |
| "phantomjs_bin_path = \"\"\n", | |
| "driver = None\n", | |
| "\n", | |
| "# declare variables - BeautifulSoup\n", | |
| "test_bs = None\n", | |
| "mlive_comment_div_bs = None\n", | |
| "\n", | |
| "# if PhantomJS, set the path to the phantomjs executable.\n", | |
| "\n", | |
| "# example from Mac and NVM/NPM:\n", | |
| "#phantomjs_bin_path = \"/Users/jonathanmorgan/.nvm/versions/node/v0.12.1/lib/node_modules/phantomjs/lib/phantom/bin/phantomjs\"\n", | |
| "\n", | |
| "# example from Windows (must be forward slashes between directories, even though it is Windows):\n", | |
| "phantomjs_bin_path = \"C:/Users/jonathanmorgan/AppData/Roaming/npm/node_modules/phantomjs/lib/phantom/phantomjs.exe\"\n", | |
| "#phantomjs_bin_path = \"C:/Users/jonathanmorgan/Downloads/phantomjs-2.0.0-windows/bin/phantomjs.exe\"\n", | |
| "\n", | |
| "# set URL you want to test\n", | |
| "\n", | |
| "# MLive page with dynamically loaded comments\n", | |
| "test_url = \"http://www.mlive.com/lansing-news/index.ssf/2014/06/michigan_senate_approves_histo.html\"\n", | |
| "\n", | |
| "# Sina Weibo page with dynamically loaded comments\n", | |
| "#test_url = \"http://www.weibo.com/1574684061/C0Oki37Va?type=comment\"\n", | |
| "\n", | |
| "# tell it what client you want to try.\n", | |
| "http_client = HTTP_PHANTOM_JS\n", | |
| "\n", | |
| "# rerieve html based on request_loader\n", | |
| "if ( http_client == HTTP_REQUESTS ):\n", | |
| "\n", | |
| " # retrieve with requests\n", | |
| " response = requests.get( test_url )\n", | |
| " test_html = response.text\n", | |
| " \n", | |
| "elif( http_client == HTTP_MECHANIZE ):\n", | |
| "\n", | |
| " # retrieve with mechanize\n", | |
| " response = mechanize.urlopen( test_url )\n", | |
| " test_html = response.read()\n", | |
| "\n", | |
| "elif( http_client == HTTP_PHANTOM_JS ):\n", | |
| " \n", | |
| " # retrieve with Phantom JS via selenium\n", | |
| " # get PhantomJS Selenium Driver - on a server, this will be your only choice.\n", | |
| " driver = webdriver.PhantomJS( phantomjs_bin_path ) # or add to your PATH\n", | |
| " #driver = webdriver.PhantomJS() # or add to your PATH\n", | |
| "\n", | |
| " # if you are on a machine with Firefox installed, an option:\n", | |
| " #driver = webdriver.Firefox()\n", | |
| "\n", | |
| " # if you are on a machine with Chrome installed, an option:\n", | |
| " #driver = webdriver.Chrome()\n", | |
| "\n", | |
| " # set up virtual window\n", | |
| " driver.set_window_size(1024, 768) # optional\n", | |
| "\n", | |
| " # grab page.\n", | |
| " driver.get( test_url )\n", | |
| "\n", | |
| " # save screenshot.\n", | |
| " driver.save_screenshot('screen.png') # save a screenshot to disk\n", | |
| "\n", | |
| " # get HTML source\n", | |
| " test_html = driver.page_source\n", | |
| "\n", | |
| "#-- END check to see what request loader we are using. --#\n", | |
| " \n", | |
| "# place HTML in BeautifulSoup\n", | |
| "test_bs = BeautifulSoup( test_html, \"html5lib\" )\n", | |
| "\n", | |
| "# get div that contains comments by ID\n", | |
| "mlive_comment_div_bs = test_bs.find( \"div\", id=\"rtb-comments\")\n", | |
| "print( \"- <div id=\\\"rtb-comments\\\">: \" + str( mlive_comment_div_bs ) )\n", | |
| "#print( test_bs )" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": true | |
| }, | |
| "outputs": [], | |
| "source": [] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 2", | |
| "language": "python", | |
| "name": "python2" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 2 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython2", | |
| "version": "2.7.10" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 0 | |
| } |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment