Skip to content

Instantly share code, notes, and snippets.

@leifwickland
Last active September 12, 2016 20:36
Show Gist options
  • Save leifwickland/54d87bc6ef5671e4a8183c5813f30674 to your computer and use it in GitHub Desktop.
Save leifwickland/54d87bc6ef5671e4a8183c5813f30674 to your computer and use it in GitHub Desktop.

Contrarian Tastes

Using Hadoop MapReduce, Apache Spark, or another distributed computing technology, analyze the Netflix Prize Dataset. (Click the "Download" link in the upper right corner, not the "uci.edu" URL near the bottom of the page.) The README in that file describes the format of the data.

We're looking for movies that are well-loved by users who dislike movies most users like.

Find the M movies which have been rated the highest across all users (of movies which have been rated by at least R users). (If there's a tie for the Mth spot, prefer most recent publication then alphabetical order of title.) These are the "top movies."

Of users who have rated all top M movies, find the U users which have given the lowest average rating of the M movies. (If there's a tie for the Uth spot, prefer users with the lower ID.) These are the "contrarian users."

For the U contrarian users, find each user's highest ranked movie. (If there's a tie for each user's top spot, prefer most recent publication then alphabetical order of title.)

Prepare a CSV report with the following columns:

  • User ID of contrarian user
  • Title of highest rated movie by contrarian user
  • Year of release of that movie
  • Date of rating of that movie

Note: You will be graded on:

  • Correctness
  • Completeness of solution
  • Unit tests
  • Documentation
  • Clarity of code

Note: M, U, and R should be configurable. The recommended default values for the parameters are M = 5, U = 25, and R = 50.

Note: The dataset that you see is a subset of the production data. The number of movies in in the hundreds of thousands and the number of users is in tens of millions. Design accordingly.

Note: If you are unable to complete the full requirements in time, please document what work remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment