Skip to content

Instantly share code, notes, and snippets.

View hughdbrown's full-sized avatar

Hugh Brown hughdbrown

View GitHub Profile
@hughdbrown
hughdbrown / so_read_url_write_file.py
Created October 7, 2015 16:02
Read from URL, write to file
"""
Code to write data read from a URL to a file
Based on an answer on SO:
http://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python/22721
"""
import urllib2
mp3file = urllib2.urlopen("http://www.example.com/songs/mp3.mp3")
@hughdbrown
hughdbrown / ds_simple_model.py
Last active October 2, 2015 19:23
Simple data science code for a model
from __future__ import print_function
import numpy as np
from nltk.corpus import stopwords
# from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn import metrics
from sklearn.cross_validation import train_test_split # , cross_val_score
@hughdbrown
hughdbrown / ds_a_b_test.py
Last active October 2, 2015 15:39
Data science: a-b-test
import numpy
import scipy.stats as scs
def a_b_test(new_views, new_clicks, old_views, old_clicks, size=10000):
new_site = scs.beta(a=new_clicks + 1, b=new_views + 1).rvs(size=size)
old_site = scs.beta(a=old_clicks + 1, b=old_views + 1).rvs(size=size)
return (new_site > old_site).mean()
@hughdbrown
hughdbrown / aws-copy-s3-to-s3.md
Last active September 4, 2015 17:10
Copy s3 to s3

Here is how I copied data from one S3 bucket to another:

aws s3 sync s3://bitly-challenges/hdb_sanitized s3://hughdbrown/data-capstone

Adapted from stackoverflow

@hughdbrown
hughdbrown / data-resume-clustering.md
Last active April 6, 2018 20:29
Resume clustering

Resume clustering

Description

I have a resume, but does it say what I want it to say? Specifically, do machine learning algorithms cluster my resume with the job title I would like them to?

Data source

  • Linkedin data/resumes for various job titles: developer, devops, data scientist, full stack, etc.

Method

  1. Create a database of resumes: developer, devops, data scientist, full stack, etc.
  2. Train a K-means model
@hughdbrown
hughdbrown / data-homeaway.md
Created August 31, 2015 23:04
Homeaway data

Homeaway data

Description

Homeaway has data on vacation rentals. The data is not nearly so worked over as AirBNB data. Possibly there is something interesting in there to disover.

Data source

  • Homeaway API access The main problem with the project is that the Homeaway API is pretty opaque. I can't figure out how to get a data dump. Also, the API requires registration and advance permission.
@hughdbrown
hughdbrown / data-job-recommender.md
Created August 31, 2015 22:59
Job recommender that bootstraps from list of job postings

Job recommender

Description

So often, job sites give candidates job listings that are far off topic. The job title is often not applicable for the candidate, and less often, the location does not match the cadidate's location.

Question

Can we build a better system for users by applying a recommender system to existing public listings?

Data source

  • glassdoor.com API
  • indeed.com web scraping
@hughdbrown
hughdbrown / data-UN-voting-blocs.md
Created August 31, 2015 17:31
UN global warming voting blocs

UN global warming voting blocs

Description

I was listening on NPR today and heard that within the UN, there are about a dozen different blocs that vote together on global warming issues:

  • Switzerland alone
  • Developed countries
  • European group
  • "77 countries plus China" ... which is actually 134 countries
  • Various island nations most affected
@hughdbrown
hughdbrown / data-wikipedia.md
Created August 31, 2015 17:26
Data project in wikipedia

Wikipedia data

Description

I like wikipedia. There must be some sort of project I could do with this data.

Data source

  • Wikipedia There are accessible dumps of wikipedia data.
@hughdbrown
hughdbrown / how-to-put-your-project-online.md
Last active August 31, 2015 16:06
Getting your project online

How to put your project online

This is a short description of the infrastructure you need to set up to get your project reachable on the web.

Github hosting

AWS hosting

Heroku hosting