Skip to content

Instantly share code, notes, and snippets.

@voidfiles
Last active January 2, 2016 16:29
Show Gist options
  • Select an option

  • Save voidfiles/8330873 to your computer and use it in GitHub Desktop.

Select an option

Save voidfiles/8330873 to your computer and use it in GitHub Desktop.
Result replication for http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/ with a big decrease in the amount of time taken to do the crawl.

Hello

This is a project that attempts to replicate the results from The Atlantic article How Netflix Reverse Engineered Hollywood by Alexis C. Madrigal. Instead of using "sketchy" software it attempts to use an open source stack.

It takes a 20 hour job and turns it into a 1 hour job.

QuickStart

>>> git clone https://gist.github.com/8330873.git
>>> cd 8330873
>>> virtualenv --no-site-packages venv
>>> source venv/bin/activate
>>> pip install -r requirements.txt
>>> python scrape.py netflix@example.com password

Notes

The threads are dicey. If you lower the number of threads things break less, but its takes slightly longer. Also, I am sure there is a better concurrency model for what I am doing the point here is easy of use.

Mako==0.9.1
MarkupSafe==0.18
PyYAML==3.10
SQLAlchemy==0.9.1
Unidecode==0.04.14
alembic==0.6.2
beautifulsoup4==4.3.2
dataset==0.4.0
ipython==1.1.0
python-slugify==0.0.6
requests==2.1.0
thready==0.1.2
wsgiref==0.1.2
from collections import OrderedDict
from functools import partial
import logging
import sys
from bs4 import BeautifulSoup
import dataset
import requests
from thready import threaded
import sqlalchemy
logger = logging.getLogger()
def netflix_url_generator():
for i in xrange(0, 90000):
yield 'http://movies.netflix.com/WiAltGenre?agid=%d' % i
def scrape_title(r, db, url):
resp = r.get(url)
resp.raise_for_status()
soup = BeautifulSoup(resp.content)
try:
micro_genre = soup.select('h1 .crumb a')[0].text
except IndexError:
# This can happen when they force you throught the profile screen
return
data = {
'source_url': url,
'micro_genre': micro_genre,
}
db['micro_genres'].upsert(data, ['source_url'])
logger.info('Found: %s', micro_genre)
def get_database():
return dataset.connect('sqlite:///missed_connections.db')
def setup_env(email, password):
r = requests.Session()
r.headers.update({
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36',
})
resp = r.get('https://signup.netflix.com/Login')
soup = BeautifulSoup(resp.content)
auth_data = OrderedDict([
('authURL', soup.select('input[name=authURL]')[0]['value']),
('email', email),
('password', password),
])
resp = r.post('https://signup.netflix.com/Login', data=auth_data, headers={
'Accept-Language': 'en-US,en;q=0.8',
'Origin': 'https://signup.netflix.com',
'Pragma': 'no-cache',
'Referer': 'https://signup.netflix.com/Login?nextpage=http%3A%2F%2Fmovies.netflix.com%2FWiHome%3Flocale%3Den-US%26ref%3Dec'
})
db = get_database()
# Make sure db is created
table_name = 'micro_genres'
try:
table = db.load_table(table_name)
except sqlalchemy.exc.NoSuchTableError:
table = db.get_table(table_name)
table.create_column('source_url', sqlalchemy.String)
table.create_column('micro_genre', sqlalchemy.String)
db.commit()
return r, db
def main(email, password):
ch = logging.StreamHandler(sys.stdout)
ch.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
ch.setFormatter(formatter)
logger.addHandler(ch)
r, db = setup_env(email, password)
handler = partial(scrape_title, r, db)
logger.info('About to start scraping micro genres')
threaded(netflix_url_generator(), handler, num_threads=200)
def print_data():
db = get_database()
for x in db['micro_genres']:
print x
print len(db['micro_genres'])
if __name__ == '__main__':
username = sys.argv[1]
password = sys.argv[2]
main(username, password)
print_data()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment