Skip to content

Instantly share code, notes, and snippets.

@gabrielgrant
Forked from anonymous/munge.py
Last active August 29, 2015 14:19
Show Gist options
  • Save gabrielgrant/2319f9ad6f1622bb1b29 to your computer and use it in GitHub Desktop.
Save gabrielgrant/2319f9ad6f1622bb1b29 to your computer and use it in GitHub Desktop.
IMDB Ratings Data Munging
""" Loads IMDB's Ratings data into Pandas
Assumes you've already downloaded the raw data by running:
wget -O - ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/ratings.list.gz | gunzip > ratings.list
See: http://www.imdb.com/interfaces
"""
import pandas as pd
# First, get a clean version of just the ratings data
ratings = open('ratings.list').read()
_, ratings = ratings.split('MOVIE RATINGS REPORT\n\n')
ratings, _ = ratings.split('\n\n------------------------------------------------------------------------------')
open('ratings.clean.list', 'w').write(ratings)
# Now play
titles, rating_data = ratings.split('\n', 1)
titles = titles.split()
rating_data_lines = rating_data.splitlines()
# split the lines on whitespace, but not with str.split(), because we need to preserve leading spaces
rating_data_split = [re.split(r"\s+", l, maxsplit=len(titles)-1) for l in rating_data_lines]
ratings = pd.DataFrame(rating_data_split, columns=titles).convert_objects(convert_numeric=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment