Skip to content

Instantly share code, notes, and snippets.

@saeedesmaili
Last active February 10, 2018 19:18
Show Gist options
  • Save saeedesmaili/eb8fe1d151c033ae3edd46dab142c5a8 to your computer and use it in GitHub Desktop.
Save saeedesmaili/eb8fe1d151c033ae3edd46dab142c5a8 to your computer and use it in GitHub Desktop.
Basic analysis of MovieLens dataset
import pandas as pd
users_columns = ['user_id', 'gender', 'age', 't', 'zip']
df_users = pd.read_table('users.dat', sep='::', header=None, names=users_columns, engine='python')
ratings_columns = ['user_id', 'movie_id', 'rating', 'timestamp']
df_ratings = pd.read_table('ratings.dat', sep='::', header=None, names=ratings_columns, engine='python')
movies_columns = ['movie_id', 'title', 'genres']
df_movies = pd.read_table('movies.dat', sep='::', header=None, names=movies_columns, engine='python')
df_users.head()
df_merged = pd.merge(pd.merge(df_ratings, df_users), df_movies)
df_users.age.min()
# 1
df_users[df_users.age == 1].user_id.count()
# 222
df_users[df_users.age == 1].user_id.count() / df_users.user_id.count()
# 0.036754966887417216
df_users.age.unique()
# array([ 1, 56, 25, 45, 50, 35, 18])
df_mean_ratings = df_merged.pivot_table('rating', index='title', columns='gender', aggfunc='mean')
ratings_by_title = df_merged.groupby('title').size()
active_titles = ratings_by_title.index[ratings_by_title >= 200]
df_mean_ratings = df_mean_ratings.loc[active_titles]
top_female_ratings = df_mean_ratings.sort_values(by='F', ascending=False)
top_male_ratings = df_mean_ratings.sort_values(by='M', ascending=False)
df_mean_ratings['diff'] = df_mean_ratings['M'] - df_mean_ratings['F']
sorted_by_diff = df_mean_ratings.sort_values(by='diff')
sorted_by_diff[::-1].head()
rating_std_by_title = df_merged.groupby('title')['rating'].std()
rating_std_by_title = rating_std_by_title.loc[active_titles]
rating_std_by_title.sort_values(ascending=False).head(10)
# title
# Plan 9 from Outer Space (1958) 1.455998
# Texas Chainsaw Massacre, The (1974) 1.332448
# Dumb & Dumber (1994) 1.321333
# Blair Witch Project, The (1999) 1.316368
# Natural Born Killers (1994) 1.307198
# Idle Hands (1999) 1.298439
# Transformers: The Movie, The (1986) 1.292917
# Very Bad Things (1998) 1.280074
# Tank Girl (1995) 1.277695
# Hellraiser: Bloodline (1996) 1.271939
# Name: rating, dtype: float64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment