Matthew Phillips phillipsm

Engineer at Moon Creative Lab in Palo Alto, California. 🎈

phillipsm / gist:1629473

Created January 17, 2012 22:39

	...

	/**
	* A container object to house our incoming HTTP request
	*
	* @author Matt Phillips <[email protected]>
	* @license http://www.gnu.org/licenses/lgpl.html GNU Lesser Public License
	*/

	class http_request {

phillipsm / gist:8601065

Created January 24, 2014 16:43

wget command

	# Construct wget command
	command = 'wget '
	command = command + '--quiet ' # turn off wget's output
	command = command + '--tries=' + str(settings.NUMBER_RETRIES) + ' ' # number of retries (assuming no 404 or the like)
	command = command + '--wait=' + str(settings.WAIT_BETWEEN_TRIES) + ' ' # number of seconds between requests (lighten the load on a page that has a lot of assets)
	command = command + '--quota=' + settings.ARCHIVE_QUOTA + ' ' # only store this amount
	command = command + '--random-wait ' # random wait between .5 seconds and --wait=
	command = command + '--limit-rate=' + settings.ARCHIVE_LIMIT_RATE + ' ' # we'll be performing multiple archives at once. let's not download too much in one stream
	command = command + '--adjust-extension ' # if a page is served up at .asp, adjust to .html. (this is the new --html-extension flag)
	command = command + '--span-hosts ' # sometimes things like images are hosted at a CDN. let's span-hosts to get those

phillipsm / gist:ad7981ed6bf8571e0c5b

Created July 3, 2014 19:43

	function check_status() {

	// Check our status service to see if we have archivng jobs pending
	var request = $.ajax({
	url: status_url + newLinky.linky_id,
	type: "GET",
	dataType: "json",
	cache: false
	});

phillipsm / gist:0ed98b2585f0ada5a769

Last active February 7, 2025 19:55

Example of parsing a table using BeautifulSoup and requests in Python

	import requests
	from bs4 import BeautifulSoup

	# We've now imported the two packages that will do the heavy lifting
	# for us, reqeusts and BeautifulSoup

	# Let's put the URL of the page we want to scrape in a variable
	# so that our code down below can be a little cleaner
	url_to_scrape = 'http://apps2.polkcountyiowa.gov/inmatesontheweb/'

phillipsm / gist:c832c825c994735b31fe

Last active August 29, 2015 14:21

All material for dgmde15

All material used for dgmde15

still dumping material in here

phillipsm / gist:404780e419c49a5b62a8

Last active April 22, 2024 11:55

Inmate scraping script

	import requests
	from bs4 import BeautifulSoup
	import time

	# We've now imported the two packages that will do the heavy lifting
	# for us, reqeusts and BeautifulSoup

	# This is the URL that lists the current inmates
	# Should this URL go away, and archive is available at
	# http://perma.cc/2HZR-N38X

phillipsm / gist:2bdb5f622cbabe107c5b

Created June 24, 2015 20:14

Import our packages

	import requests
	from bs4 import BeautifulSoup

phillipsm / gist:7199f931a2de6787c0b6

Created June 24, 2015 20:16

Build list of inmates

	url_to_scrape = 'http://apps2.polkcountyiowa.gov/inmatesontheweb/'

	r = requests.get(url_to_scrape)

	soup = BeautifulSoup(r.text)

	inmates_links = []

	for table_row in soup.select(".inmatesList tr"):
	table_cells = table_row.findAll('td')

phillipsm / gist:1f272a7caec08e44df2f

Last active August 29, 2015 14:23

	inmates = []

	for inmate_link in inmates_links[:10]:
	r = requests.get(inmate_link)
	soup = BeautifulSoup(r.text)

	inmate_details = {}

	inmate_profile_rows = soup.select("#inmateProfile tr")
	inmate_details['age'] = inmate_profile_rows[0].findAll('td')[0].text.strip()

phillipsm / gist:29d4cb4addb5c5a21ae7

Created June 24, 2015 20:22

Sum and print aggregations

	inmate_cities = {}

	for inmate in inmates:
	if inmate['city'] in inmate_cities:
	inmate_cities[inmate['city']] += 1
	else:
	inmate_cities[inmate['city']] = 1

	print inmate_cities