Skip to content

Instantly share code, notes, and snippets.

@helxsz
Last active August 29, 2015 14:14
Show Gist options
  • Save helxsz/4b14ae069e7b495aec2b to your computer and use it in GitHub Desktop.
Save helxsz/4b14ae069e7b495aec2b to your computer and use it in GitHub Desktop.
spark_text.txt
Part 1: Predicting Movie Ratings
One of the most common uses of data is to predict what users want. This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like. In this assignment, you'll explore how to recommend movies to a user. We'll start with some basic methods, and then use machine learning to make more sophisticated predictions.
We'll use Spark for this assignment. In part 1 of the assignment, you'll run Spark on your local machine and on a smaller dataset. The purpose of this part of the assignment is to get everything working before adding the complexities of running on many machines. The interface for running local Spark jobs is exactly the same as the interface for running jobs on a cluster, so you'll be using the same functions we used in lab, and all of the code you write locally can be executed on a cluster. In part 2, which will be released after the midterm, you'll run Spark on a cluster that we have running for you (like in the lab). You'll use the cluster to run your code on a larger dataset, and to predict which movies to recommend to yourself!
As mentioned during the lab, think carefully before calling collect() on any datasets. When you're using a small, local dataset, calling collect() and then using Python to analyze the data locally will work fine, but this will not work when you're using a large dataset that doesn't fit in memory on one machine. Solutions that call collect() and do local analysis that could have been done with Spark will not receive full credit.
We have created a FAQ at the bottom of this page (which is an expanded version of the FAQ from the lab) to help with common problems you may run into. If you run into a problem, please check the FAQ before posting on Piazza!
Exercise 0: Setup
a) As mentioned above, for this part of the assignment, you'll run Spark locally rather than on a cluster, for easier debugging. Begin by downloading Spark from this link. Unzip and untar the file so you have a spark-0.9.1-bin-cdh4 folder; this folder contains all of the code needed to run Spark. We need to do a little bit of setup to tell iPython how to find Spark (we set this up for you on the cluster machines, but you need to do it yourself when running in your own VM). We also need to start your own SparkContext (which we also did for you in the lab; the SparkContext was saved as sc in the lab). The SparkContext is like a master for just your application. It requests some resources from the cluster master, and it also breaks down jobs that you submit into stages of tasks. For example, when you call map() on an resilient distributed dataset (RDD; Spark's name for datasets stored in memory), the SparkContext decides how many map tasks to run, and launches the map tasks on the executors allocated by the cluster master.
Even though you're running Spark locally, Spark still starts the application web UI where you can see your application and what tasks it's running. In a browser in the VM, go to http://localhost:4040 to see the UI for your application. There's no Master UI running here (the UI we saw at port 8080 during the lab) because Spark doesn't use a master when you run in local mode.
b) Next, download the datafiles that you'll need for the assignment from https://github.com/amplab/datascience-sp14/raw/master/hw3/part1files.tar.gz. You'll do all of your analysis on the ratings.dat and movies.dat datasets located in the part1files folder that you just downloaded. These are smaller versions of the datasets we used in lab 8. As in the lab, each entry in the ratings dataset is formatted as UserID::MovieID::Rating::Timestamp and each entry in the movies dataset is formatted as MovieID::Title::Genres. Read these two datasets into memory. You can count the number of entries in each dataset to ensure that you've loaded them correctly; the ratings dataset should have 100K entries and the movies dataset should have 1682 entries.
Note that when you create a new dataset using sc.textFile, you can give an absolute path to the dataset on your filesystem, e.g. `/Users/kay/part1files/ratings.dat'.
Fill in the path to the spark folder you just downloaded in the code below, and then execute it to create a SparkContext to use to run jobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment