Alex Woolford alexwoolford

@neo4j
Lafayette, CO, USA
woolford.io
@biggus_data

Recently created

Least recently created

Recently updated

Least recently updated

alexwoolford / IPythonMagicWithR.py

Last active August 29, 2015 14:07

	$ ipython

	In [1]: import pandas as pd

	In [2]: mpg = pd.read_csv('mpg.csv')

	# move mpg dataframe to R. The "-i mpg" is an input. Dataframe moves from pandas to R.

	In [3]: %load_ext rpy2.ipython

alexwoolford / scrapyToHive.py

Last active August 29, 2015 14:06

	# The scrapy framework has a helper function to create a skeleton scraper project, e.g.

	scrapy genspider somesiteSpider somesite.com


	# Sometimes patterns in the URL's make data acquisition a breeze. A textfile containing a list of URL's is created:

	outfile = open('urls.txt', 'w')
	for id in range(1, 17723):
	url = "https://somesite.com/getDetail.do?someId=" + str(id)

alexwoolford / loadCrimestatsToElasticsearch.py

Last active August 29, 2015 14:06

	# The csv dataset was downloaded from http://data.denvergov.org/dataset/city-and-county-of-denver-crime

	import pandas as pd
	from elasticsearch import Elasticsearch
	import json

	data = pd.read_csv('/Users/awoolford/Downloads/crime.csv')
	es = Elasticsearch()

	# ignore 400 cause by IndexAlreadyExistsException when creating an index

alexwoolford / dpsCreateSolrIndex.py

Created September 22, 2014 05:50

The DPS pages can be indexed in Solr.

	# Interesting features from the HTML can then be loaded into a Solr index.

	import json
	from bs4 import BeautifulSoup
	import solr

	s = solr.SolrConnection('http://localhost:8983/solr')

	for recordNum, line in enumerate(open('/Users/awoolford/Documents/scrapeDPS/dpsk12_org/dpsk12_org.json', 'r').readlines()):
	try:

alexwoolford / createNeo4jLinkGraph.py

Last active August 29, 2015 14:06

A graph database (e.g. Neo4j) is an interesting way to look at the links between pages on a site.

	# A JSON file containing the url and html is loaded into a Neo4j graph database.

	import json
	from bs4 import BeautifulSoup
	from py2neo import neo4j

	graph_db = neo4j.GraphDatabaseService()
	urls = graph_db.get_or_create_index(neo4j.Node, "Urls")
	connectedTo = graph_db.get_or_create_index(neo4j.Relationship, "ConnectedTo")

alexwoolford / mongousExcelGobbler.py

Last active August 29, 2015 14:06

Excel spreadsheets can be consolidated by loading them into a MongoDB, calculating the columns from all the worksheets/workbooks in a folder, forcing the output into a Hive table.

	#!/usr/bin/python

	"""
	From time-to-time it's necessary to consolidate data from multiple spreadsheets. If the spreadsheets lack validation
	rules, users may enter values that break business rules, add columns, change the column order, etc... it can become a
	messy problem.

	One possible way to handle this is to load all the workbooks/worksheets into a NoSQL database (e.g. MongoDB), and explore
	the dataset in its entirety in order to identify and address anomalies.

alexwoolford / dpsClosestSchools.py

Last active August 29, 2015 14:06

Denver Public Schools: find the closest schools

	#!/usr/bin/python

	from bs4 import BeautifulSoup
	import urllib2
	import re
	from pygeocoder import Geocoder
	import time
	from pymongo import MongoClient

	# Get the HTML for the school list and create BeautifulSoup object.

NewerOlder