twolodzko · July 1, 2023 23:33 · patilvijay23 · Jun 16, 2020 · twolodzko · Jun 16, 2020
diff --git a/ALS Matrix Factorization in Spark.ipynb b/ALS Matrix Factorization in Spark.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using ALS Matrix Factorization for Making Recommendations in Spark (ver 0.3)\n",
    "Tymoteusz Wolodzko"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "from pyspark.sql.session import SparkSession\n",
    "from pyspark.sql.functions import *\n",
    "from pyspark.sql.types import *\n",
    "from pyspark.sql.window import Window\n",
    "from pyspark import StorageLevel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**< here we connect to Spark >**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2.2.1'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "spark.version"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "This tutorial uses [The Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset) hosted on Kaggle that contains metadata on over 45,000 movies and 26 million ratings from over 270,000 users.\n",
    "\n",
    "The data consists of multiple files, but we're going to use only two of them. The first file, `ratings.csv` file (we are actually using the `ratings_small.csv` file, containing only a subset of the original data, to make the computations run smoother for this demo) consists of ratings of movies by users of the MovieLens movie recommendation site. The second file, `movies_metadata.csv` contains metadata about the rated movies, we are goint to use it to extract the information about movie genre(s) for each movie.\n",
    "\n",
    "The Movies dataset is a typical example of data that can be used by a recommender system, where we have *users* $\\times$ *items* matrix of *ratings* $R_{n \\times k}$. The matrix is stored in a sparse form (aka \"long\" format as called by [Wickham in *Tidy Data*](http://vita.had.co.nz/papers/tidy-data.pdf)) of `(user, item, rating)` tuples.\n",
    "\n",
    "In this tutorial, we are going to use different kind of information, since the recommender algorithm for *implicit ratings* will be presented. We are going to recommend genres of the movies, given the data about number of rated movies per genre. To prepare such data, we will aggregate the data by counting the number of rated movies grouped by users and genres, i.e. `(user, genre, count)`. The count is going to serve as an implicit preference indicator. Such transformation reduces the data and most likely is suboptimal given that we have the explicit ratings, however such choice was made to give an example of recommender system used with implicit ratings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>862</td>\n",
       "      <td>[Animation, Comedy, Family]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8844</td>\n",
       "      <td>[Adventure, Fantasy, Family]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>15602</td>\n",
       "      <td>[Romance, Comedy]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>31357</td>\n",
       "      <td>[Comedy, Drama, Romance]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11862</td>\n",
       "      <td>[Comedy]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id                        genres\n",
       "0    862   [Animation, Comedy, Family]\n",
       "1   8844  [Adventure, Fantasy, Family]\n",
       "2  15602             [Romance, Comedy]\n",
       "3  31357      [Comedy, Drama, Romance]\n",
       "4  11862                      [Comedy]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The code below uses some ideas from:\n",
    "# https://www.kaggle.com/rounakbanik/movie-recommender-systems/notebook\n",
    "\n",
    "from ast import literal_eval\n",
    "\n",
    "movies = pd.read_csv('data/movies_metadata.csv', low_memory = False)\n",
    "movies = movies.loc[:, ['id', 'genres']]\n",
    "movies['genres'] = (\n",
    "    movies['genres']\n",
    "    .fillna('[]')\n",
    "    .apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])\n",
    ")\n",
    "movies.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>genres</th>\n",
       "      <th>Romance</th>\n",
       "      <th>Family</th>\n",
       "      <th>Drama</th>\n",
       "      <th>Vision View Entertainment</th>\n",
       "      <th>Adventure</th>\n",
       "      <th>Western</th>\n",
       "      <th>Animation</th>\n",
       "      <th>GoHands</th>\n",
       "      <th>...</th>\n",
       "      <th>Pulser Productions</th>\n",
       "      <th>Aniplex</th>\n",
       "      <th>Crime</th>\n",
       "      <th>Documentary</th>\n",
       "      <th>Horror</th>\n",
       "      <th>Telescene Film Group Productions</th>\n",
       "      <th>The Cartel</th>\n",
       "      <th>Music</th>\n",
       "      <th>Fantasy</th>\n",
       "      <th>Comedy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>862</td>\n",
       "      <td>[Animation, Comedy, Family]</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>8844</td>\n",
       "      <td>[Adventure, Fantasy, Family]</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>15602</td>\n",
       "      <td>[Romance, Comedy]</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>31357</td>\n",
       "      <td>[Comedy, Drama, Romance]</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11862</td>\n",
       "      <td>[Comedy]</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 34 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      id                        genres  Romance  Family  Drama  \\\n",
       "0    862   [Animation, Comedy, Family]        0       1      0   \n",
       "1   8844  [Adventure, Fantasy, Family]        0       1      0   \n",
       "2  15602             [Romance, Comedy]        1       0      0   \n",
       "3  31357      [Comedy, Drama, Romance]        1       0      1   \n",
       "4  11862                      [Comedy]        0       0      0   \n",
       "\n",
       "   Vision View Entertainment  Adventure  Western  Animation  GoHands   ...    \\\n",
       "0                          0          0        0          1        0   ...     \n",
       "1                          0          1        0          0        0   ...     \n",
       "2                          0          0        0          0        0   ...     \n",
       "3                          0          0        0          0        0   ...     \n",
       "4                          0          0        0          0        0   ...     \n",
       "\n",
       "   Pulser Productions  Aniplex  Crime  Documentary  Horror  \\\n",
       "0                   0        0      0            0       0   \n",
       "1                   0        0      0            0       0   \n",
       "2                   0        0      0            0       0   \n",
       "3                   0        0      0            0       0   \n",
       "4                   0        0      0            0       0   \n",
       "\n",
       "   Telescene Film Group Productions  The Cartel  Music  Fantasy  Comedy  \n",
       "0                                 0           0      0        0       1  \n",
       "1                                 0           0      0        1       0  \n",
       "2                                 0           0      0        0       1  \n",
       "3                                 0           0      0        0       1  \n",
       "4                                 0           0      0        0       1  \n",
       "\n",
       "[5 rows x 34 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "genres = set([item for row in movies['genres'].values for item in row])\n",
    "\n",
    "for g in genres:\n",
    "    movies[g] = 0\n",
    "\n",
    "for row in range(movies.shape[0]):\n",
    "    for g in movies.loc[row, 'genres']:\n",
    "        movies.set_value(row, g, 1)\n",
    "    \n",
    "movies.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(359, 2)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies['genres_bitmap'] = movies.iloc[:, 2:].apply(lambda row : ''.join([str(x) for x in row]), axis = 1)\n",
    "counts = movies['genres_bitmap'].value_counts()\n",
    "\n",
    "counts = pd.DataFrame({\n",
    "    'genres_bitmap' : counts.index,\n",
    "    'genres_counts' : counts\n",
    "})\n",
    "\n",
    "# leave only the popular combinations of genres\n",
    "counts = counts.loc[counts['genres_counts'] >= 10, :]\n",
    "counts.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>genres</th>\n",
       "      <th>Romance</th>\n",
       "      <th>Family</th>\n",
       "      <th>Drama</th>\n",
       "      <th>Vision View Entertainment</th>\n",
       "      <th>Adventure</th>\n",
       "      <th>Western</th>\n",
       "      <th>Animation</th>\n",
       "      <th>GoHands</th>\n",
       "      <th>...</th>\n",
       "      <th>Horror</th>\n",
       "      <th>Telescene Film Group Productions</th>\n",
       "      <th>The Cartel</th>\n",
       "      <th>Music</th>\n",
       "      <th>Fantasy</th>\n",
       "      <th>Comedy</th>\n",
       "      <th>genres_bitmap</th>\n",
       "      <th>genres_counts</th>\n",
       "      <th>genreId</th>\n",
       "      <th>movieId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>862</td>\n",
       "      <td>[Animation, Comedy, Family]</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>01000010000000000000000000000001</td>\n",
       "      <td>112</td>\n",
       "      <td>272</td>\n",
       "      <td>862</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>12233</td>\n",
       "      <td>[Animation, Comedy, Family]</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>01000010000000000000000000000001</td>\n",
       "      <td>112</td>\n",
       "      <td>272</td>\n",
       "      <td>12233</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>532</td>\n",
       "      <td>[Family, Animation, Comedy]</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>01000010000000000000000000000001</td>\n",
       "      <td>112</td>\n",
       "      <td>272</td>\n",
       "      <td>532</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>531</td>\n",
       "      <td>[Animation, Comedy, Family]</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>01000010000000000000000000000001</td>\n",
       "      <td>112</td>\n",
       "      <td>272</td>\n",
       "      <td>531</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>40688</td>\n",
       "      <td>[Animation, Comedy, Family]</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>01000010000000000000000000000001</td>\n",
       "      <td>112</td>\n",
       "      <td>272</td>\n",
       "      <td>40688</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 38 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      id                       genres  Romance  Family  Drama  \\\n",
       "0    862  [Animation, Comedy, Family]        0       1      0   \n",
       "1  12233  [Animation, Comedy, Family]        0       1      0   \n",
       "2    532  [Family, Animation, Comedy]        0       1      0   \n",
       "3    531  [Animation, Comedy, Family]        0       1      0   \n",
       "4  40688  [Animation, Comedy, Family]        0       1      0   \n",
       "\n",
       "   Vision View Entertainment  Adventure  Western  Animation  GoHands   ...     \\\n",
       "0                          0          0        0          1        0   ...      \n",
       "1                          0          0        0          1        0   ...      \n",
       "2                          0          0        0          1        0   ...      \n",
       "3                          0          0        0          1        0   ...      \n",
       "4                          0          0        0          1        0   ...      \n",
       "\n",
       "   Horror  Telescene Film Group Productions  The Cartel  Music  Fantasy  \\\n",
       "0       0                                 0           0      0        0   \n",
       "1       0                                 0           0      0        0   \n",
       "2       0                                 0           0      0        0   \n",
       "3       0                                 0           0      0        0   \n",
       "4       0                                 0           0      0        0   \n",
       "\n",
       "   Comedy                     genres_bitmap  genres_counts  genreId  movieId  \n",
       "0       1  01000010000000000000000000000001            112      272      862  \n",
       "1       1  01000010000000000000000000000001            112      272    12233  \n",
       "2       1  01000010000000000000000000000001            112      272      532  \n",
       "3       1  01000010000000000000000000000001            112      272      531  \n",
       "4       1  01000010000000000000000000000001            112      272    40688  \n",
       "\n",
       "[5 rows x 38 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies = movies.merge(counts, on = 'genres_bitmap', how = 'inner')\n",
    "movies.rename(columns = {'movie id' : 'movie_id'}, inplace = True)\n",
    "movies['genreId'] = movies['genres_bitmap'].astype(\"category\").cat.codes\n",
    "movies['movieId'] = movies['id'].astype('int64')\n",
    "\n",
    "movies.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(100004, 4)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings = pd.read_csv('data/ratings_small.csv')\n",
    "\n",
    "ratings.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>rating</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>31</td>\n",
       "      <td>2.5</td>\n",
       "      <td>1260759144</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1029</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1260759179</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1061</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1260759182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1129</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1260759185</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>1172</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1260759205</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   userId  movieId  rating   timestamp\n",
       "0       1       31     2.5  1260759144\n",
       "1       1     1029     3.0  1260759179\n",
       "2       1     1061     3.0  1260759182\n",
       "3       1     1129     2.0  1260759185\n",
       "4       1     1172     4.0  1260759205"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>rating</th>\n",
       "      <th>genreId</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1371</td>\n",
       "      <td>2.5</td>\n",
       "      <td>155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>1371</td>\n",
       "      <td>4.0</td>\n",
       "      <td>155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7</td>\n",
       "      <td>1371</td>\n",
       "      <td>3.0</td>\n",
       "      <td>155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>19</td>\n",
       "      <td>1371</td>\n",
       "      <td>4.0</td>\n",
       "      <td>155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>21</td>\n",
       "      <td>1371</td>\n",
       "      <td>3.0</td>\n",
       "      <td>155</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   userId  movieId  rating  genreId\n",
       "0       1     1371     2.5      155\n",
       "1       4     1371     4.0      155\n",
       "2       7     1371     3.0      155\n",
       "3      19     1371     4.0      155\n",
       "4      21     1371     3.0      155"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings_genres = (\n",
    "    ratings.loc[:, ['userId', 'movieId', 'rating']]\n",
    "    .merge(movies.loc[:, ['movieId', 'genreId']], on = 'movieId', how = 'inner')\n",
    ")\n",
    "\n",
    "ratings_genres.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>genreId</th>\n",
       "      <th>counts</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>155</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>167</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>180</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>300</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   userId  genreId  counts\n",
       "0       1        1       1\n",
       "1       1      155       1\n",
       "2       1      167       1\n",
       "3       1      180       1\n",
       "4       1      300       1"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings_counts = (\n",
    "    ratings_genres.groupby(['userId', 'genreId'], as_index = False)['movieId'].count()\n",
    "    .rename(columns = {'movieId' : 'counts'})\n",
    ")\n",
    "\n",
    "ratings_counts.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>rating</th>\n",
       "      <th>genreId</th>\n",
       "      <th>counts</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>182</th>\n",
       "      <td>1</td>\n",
       "      <td>2294</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1371</td>\n",
       "      <td>2.5</td>\n",
       "      <td>155</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>235</th>\n",
       "      <td>1</td>\n",
       "      <td>2455</td>\n",
       "      <td>2.5</td>\n",
       "      <td>167</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47</th>\n",
       "      <td>1</td>\n",
       "      <td>1405</td>\n",
       "      <td>1.0</td>\n",
       "      <td>180</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>93</th>\n",
       "      <td>1</td>\n",
       "      <td>2105</td>\n",
       "      <td>4.0</td>\n",
       "      <td>300</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>140</th>\n",
       "      <td>1</td>\n",
       "      <td>2193</td>\n",
       "      <td>2.0</td>\n",
       "      <td>324</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2619</th>\n",
       "      <td>2</td>\n",
       "      <td>349</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3004</th>\n",
       "      <td>2</td>\n",
       "      <td>377</td>\n",
       "      <td>3.0</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>951</th>\n",
       "      <td>2</td>\n",
       "      <td>161</td>\n",
       "      <td>3.0</td>\n",
       "      <td>28</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3903</th>\n",
       "      <td>2</td>\n",
       "      <td>500</td>\n",
       "      <td>4.0</td>\n",
       "      <td>28</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1366</th>\n",
       "      <td>2</td>\n",
       "      <td>222</td>\n",
       "      <td>5.0</td>\n",
       "      <td>35</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5046</th>\n",
       "      <td>2</td>\n",
       "      <td>588</td>\n",
       "      <td>3.0</td>\n",
       "      <td>42</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>282</th>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>5.0</td>\n",
       "      <td>49</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2804</th>\n",
       "      <td>2</td>\n",
       "      <td>364</td>\n",
       "      <td>3.0</td>\n",
       "      <td>57</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2431</th>\n",
       "      <td>2</td>\n",
       "      <td>314</td>\n",
       "      <td>4.0</td>\n",
       "      <td>62</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2017</th>\n",
       "      <td>2</td>\n",
       "      <td>296</td>\n",
       "      <td>4.0</td>\n",
       "      <td>102</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4610</th>\n",
       "      <td>2</td>\n",
       "      <td>551</td>\n",
       "      <td>5.0</td>\n",
       "      <td>127</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1640</th>\n",
       "      <td>2</td>\n",
       "      <td>253</td>\n",
       "      <td>4.0</td>\n",
       "      <td>134</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1173</th>\n",
       "      <td>2</td>\n",
       "      <td>168</td>\n",
       "      <td>3.0</td>\n",
       "      <td>140</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>822</th>\n",
       "      <td>2</td>\n",
       "      <td>153</td>\n",
       "      <td>4.0</td>\n",
       "      <td>155</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      userId  movieId  rating  genreId  counts\n",
       "182        1     2294     2.0        1       1\n",
       "0          1     1371     2.5      155       1\n",
       "235        1     2455     2.5      167       1\n",
       "47         1     1405     1.0      180       1\n",
       "93         1     2105     4.0      300       1\n",
       "140        1     2193     2.0      324       1\n",
       "2619       2      349     4.0        1       1\n",
       "3004       2      377     3.0        6       1\n",
       "951        2      161     3.0       28       2\n",
       "3903       2      500     4.0       28       2\n",
       "1366       2      222     5.0       35       1\n",
       "5046       2      588     3.0       42       1\n",
       "282        2       17     5.0       49       1\n",
       "2804       2      364     3.0       57       1\n",
       "2431       2      314     4.0       62       1\n",
       "2017       2      296     4.0      102       1\n",
       "4610       2      551     5.0      127       1\n",
       "1640       2      253     4.0      134       1\n",
       "1173       2      168     3.0      140       1\n",
       "822        2      153     4.0      155       7"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# sanity check\n",
    "(\n",
    "    ratings_genres\n",
    "    .merge(ratings_counts, on = ['userId', 'genreId'], how = 'left')\n",
    "    .sort_values(by = ['userId', 'genreId'])\n",
    ").head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is interesting, is that the number of movies someone has seen does not correlate with his ratings of the movies (so it's not that you watch westerns, because you think they are good, but you watch a lot of westerns nonetheless that some of them appeared to be pretty bad)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>rating</th>\n",
       "      <th>counts</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>rating</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>-0.066247</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>counts</th>\n",
       "      <td>-0.066247</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          rating    counts\n",
       "rating  1.000000 -0.066247\n",
       "counts -0.066247  1.000000"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(\n",
    "    ratings_genres\n",
    "    .merge(ratings_counts, on = ['userId', 'genreId'], how = 'left')\n",
    "    .loc[:, ['rating', 'counts']]\n",
    ").corr()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAD8CAYAAACcjGjIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAFy1JREFUeJzt3X+w3XWd3/Hnq0QpiwuLYO9kk9iwY7TDjy42dyitq3PbbEtWHcGOumFYgUqJDtRqS2cn2D+0dTIj7SpbZmp2olCCtfwY0IUR2C4LnrU7U2CDy8gvqUFCSRphBUu87soa990/zueSw/2ecG/uvXDCyfMxc+Z+z/v7+XzP57wH5pXz/X7vPakqJEka9DdGvQBJ0qHHcJAkdRgOkqQOw0GS1GE4SJI6DAdJUofhIEnqMBwkSR2GgySpY9moF7BQJ5xwQq1evXre43/yk59w9NFHv3ILeo2yL8PZl+Hsy3Cvpb7cf//9P6yqN8017jUbDqtXr2b79u3zHt/r9ZiamnrlFvQaZV+Gsy/D2ZfhXkt9SfLkfMZ5WkmS1GE4SJI6DAdJUofhIEnqmDMckqxK8s0kjyR5OMknWv2NSe5M8r3287iBOZcl2ZHksSRnDtTXJnmw7bsySVr9yCQ3tPq9SVYv/VuVJM3XfD457AMuraqTgDOAS5KcBGwC7qqqNcBd7Tlt3wbgZGA98MUkR7RjbQEuAta0x/pWvxD4UVW9BbgCuHwJ3pskaYHmDIeq2lNV327bPwYeBVYAZwHb2rBtwNlt+yzg+qp6oaqeAHYApydZDhxTVfdU/+vnrp01Z+ZYNwHrZj5VSJJefQf1ew7tdM/bgXuBiara03b9AJho2yuAewam7Wq1n7Xt2fWZOU8BVNW+JM8DxwM/nPX6G4GNABMTE/R6vXmvfXp6+qDGHy7sy3D2ZTj7Mtw49mXe4ZDkDcDNwCerau/gP+yrqpK84l9GXVVbga0Ak5OTdTC/dPJa+iWVV5N9Gc6+DGdfhhvHvswrHJK8jn4wfLWqvtbKTydZXlV72imjZ1p9N7BqYPrKVtvdtmfXB+fsSrIMOBZ4dgHvZ15Wb7ptaH3n597zSr2kJL2mzOdupQBXAY9W1RcGdt0KnN+2zwduGahvaHcgnUj/wvN97RTU3iRntGOeN2vOzLE+ANzdrktIkkZgPp8c3gF8GHgwyQOt9ingc8CNSS4EngQ+BFBVDye5EXiE/p1Ol1TVz9u8i4FrgKOAO9oD+uHzlSQ7gOfo3+0kSRqROcOhqv4EONCdQ+sOMGczsHlIfTtwypD6T4EPzrUWSdKrw9+QliR1GA6SpA7DQZLUYThIkjoMB0lSh+EgSeowHCRJHYaDJKnDcJAkdRgOkqQOw0GS1GE4SJI6DAdJUofhIEnqMBwkSR2GgySpw3CQJHXM5zukr07yTJKHBmo3JHmgPXbOfH1oktVJ/nJg3+8NzFmb5MEkO5Jc2b5HmvZd0ze0+r1JVi/925QkHYz5fHK4Blg/WKiq36yq06rqNOBm4GsDux+f2VdVHxuobwEuAta0x8wxLwR+VFVvAa4ALl/QO5EkLZk5w6GqvgU8N2xf+9f/h4DrXu4YSZYDx1TVPVVVwLXA2W33WcC2tn0TsG7mU4UkaTQWe83hncDTVfW9gdqJ7ZTSHyd5Z6utAHYNjNnVajP7ngKoqn3A88Dxi1yXJGkRli1y/jm89FPDHuDNVfVskrXA7yc5eZGv8aIkG4GNABMTE/R6vXnPnZ6efnH8pafuGzrmYI43Lgb7ov3sy3D2Zbhx7MuCwyHJMuCfAWtnalX1AvBC274/yePAW4HdwMqB6StbjfZzFbCrHfNY4Nlhr1lVW4GtAJOTkzU1NTXv9fZ6PWbGX7DptqFjdp47/+ONi8G+aD/7Mpx9GW4c+7KY00q/Dny3ql48XZTkTUmOaNu/Qv/C8/erag+wN8kZ7XrCecAtbdqtwPlt+wPA3e26hCRpROZzK+t1wP8C3pZkV5IL264NdC9Evwv4Tru19SbgY1U1czH7YuDLwA7gceCOVr8KOD7JDuDfAJsW8X4kSUtgztNKVXXOAeoXDKndTP/W1mHjtwOnDKn/FPjgXOuQJL16/A1pSVKH4SBJ6jAcJEkdhoMkqcNwkCR1GA6SpA7DQZLUYThIkjoMB0lSh+EgSeowHCRJHYaDJKnDcJAkdRgOkqQOw0GS1GE4SJI6DAdJUsd8vib06iTPJHlooPaZJLuTPNAe7x7Yd1mSHUkeS3LmQH1tkgfbvivbd0mT5MgkN7T6vUlWL+1blCQdrPl8crgGWD+kfkVVndYetwMkOYn+d0uf3OZ8MckRbfwW4CJgTXvMHPNC4EdV9RbgCuDyBb4XSdISmTMcqupbwHPzPN5ZwPVV9UJVPQHsAE5Pshw4pqruqaoCrgXOHpizrW3fBKyb+VQhSRqNxVxz+HiS77TTTse12grgqYExu1ptRdueXX/JnKraBzwPHL+IdUmSFmnZAudtAT4LVPv5eeAjS7WoA0myEdgIMDExQa/Xm/fc6enpF8dfeuq+oWMO5njjYrAv2s++DGdfhhvHviwoHKrq6ZntJF8CvtGe7gZWDQxd2Wq72/bs+uCcXUmWAccCzx7gdbcCWwEmJydrampq3mvu9XrMjL9g021Dx+w8d/7HGxeDfdF+9mU4+zLcOPZlQaeV2jWEGe8HZu5kuhXY0O5AOpH+hef7qmoPsDfJGe16wnnALQNzzm/bHwDubtclJEkjMucnhyTXAVPACUl2AZ8GppKcRv+00k7gowBV9XCSG4FHgH3AJVX183aoi+nf+XQUcEd7AFwFfCXJDvoXvjcsxRuTJC3cnOFQVecMKV/1MuM3A5uH1LcDpwyp/xT44FzrkCS9evwNaUlSh+EgSeowHCRJHYaDJKnDcJAkdRgOkqQOw0GS1GE4SJI6DAdJUofhIEnqMBwkSR2GgySpw3CQJHUYDpKkDsNBktRhOEiSOgwHSVLHnOGQ5OokzyR5aKD2n5J8N8l3knw9yS+1+uokf5nkgfb4vYE5a5M8mGRHkivbd0nTvm/6hla/N8nqpX+bkqSDMZ9PDtcA62fV7gROqaq/C/xv4LKBfY9X1Wnt8bGB+hbgImBNe8wc80LgR1X1FuAK4PKDfheSpCU1ZzhU1beA52bV/rCq9rWn9wArX+4YSZYDx1TVPVVVwLXA2W33WcC2tn0TsG7mU4UkaTSW4prDR4A7Bp6f2E4p/XGSd7baCmDXwJhdrTaz7ymAFjjPA8cvwbokSQu0bDGTk/w7YB/w1VbaA7y5qp5Nshb4/SQnL3KNg6+3EdgIMDExQa/Xm/fc6enpF8dfeuq+oWMO5njjYrAv2s++DGdfhhvHviw4HJJcALwXWNdOFVFVLwAvtO37kzwOvBXYzUtPPa1sNdrPVcCuJMuAY4Fnh71mVW0FtgJMTk7W1NTUvNfb6/WYGX/BptuGjtl57vyPNy4G+6L97Mtw9mW4cezLgk4rJVkP/Dbwvqr6i4H6m5Ic0bZ/hf6F5+9X1R5gb5Iz2vWE84Bb2rRbgfPb9geAu2fCRpI0GnN+ckhyHTAFnJBkF/Bp+ncnHQnc2a4d39PuTHoX8B+S/Az4a+BjVTVzMfti+nc+HUX/GsXMdYqrgK8k2UH/wveGJXlnkqQFmzMcquqcIeWrDjD2ZuDmA+zbDpwypP5T4INzrUOS9OrxN6QlSR2GgySpw3CQJHUYDpKkDsNBktRhOEiSOgwHSVKH4SBJ6jAcJEkdhoMkqcNwkCR1GA6SpA7DQZLUYThIkjoMB0lSh+EgSeowHCRJHXOGQ5KrkzyT5KGB2huT3Jnke+3ncQP7LkuyI8ljSc4cqK9N8mDbd2X7LmmSHJnkhla/N8nqpX2LkqSDNZ9PDtcA62fVNgF3VdUa4K72nCQn0f8O6JPbnC8mOaLN2QJcBKxpj5ljXgj8qKreAlwBXL7QNyNJWhpzhkNVfQt4blb5LGBb294GnD1Qv76qXqiqJ4AdwOlJlgPHVNU9VVXAtbPmzBzrJmDdzKcKSdJoLPSaw0RV7WnbPwAm2vYK4KmBcbtabUXbnl1/yZyq2gc8Dxy/wHVJkpbAssUeoKoqSS3FYuaSZCOwEWBiYoJerzfvudPT0y+Ov/TUfUPHHMzxxsVgX7SffRnOvgw3jn1ZaDg8nWR5Ve1pp4yeafXdwKqBcStbbXfbnl0fnLMryTLgWODZYS9aVVuBrQCTk5M1NTU17wX3ej1mxl+w6bahY3aeO//jjYvBvmg/+zKcfRluHPuy0NNKtwLnt+3zgVsG6hvaHUgn0r/wfF87BbU3yRntesJ5s+bMHOsDwN3tuoQkaUTm/OSQ5DpgCjghyS7g08DngBuTXAg8CXwIoKoeTnIj8AiwD7ikqn7eDnUx/TufjgLuaA+Aq4CvJNlB/8L3hiV5Z5KkBZszHKrqnAPsWneA8ZuBzUPq24FThtR/CnxwrnVIkl49/oa0JKnDcJAkdRgOkqQOw0GS1GE4SJI6DAdJUofhIEnqMBwkSR2GgySpw3CQJHUYDpKkDsNBktRhOEiSOgwHSVKH4SBJ6jAcJEkdhoMkqWPB4ZDkbUkeGHjsTfLJJJ9Jsnug/u6BOZcl2ZHksSRnDtTXJnmw7buyfc+0JGlEFhwOVfVYVZ1WVacBa4G/AL7edl8xs6+qbgdIchL974c+GVgPfDHJEW38FuAiYE17rF/ouiRJi7dUp5XWAY9X1ZMvM+Ys4PqqeqGqngB2AKcnWQ4cU1X3VFUB1wJnL9G6JEkLsFThsAG4buD5x5N8J8nVSY5rtRXAUwNjdrXairY9uy5JGpFliz1AktcD7wMua6UtwGeBaj8/D3xksa/TXmsjsBFgYmKCXq8377nT09Mvjr/01H1DxxzM8cbFYF+0n30Zzr4MN459WXQ4AL8BfLuqngaY+QmQ5EvAN9rT3cCqgXkrW213255d76iqrcBWgMnJyZqampr3Inu9HjPjL9h029AxO8+d//HGxWBftJ99Gc6+DDeOfVmK00rnMHBKqV1DmPF+4KG2fSuwIcmRSU6kf+H5vqraA+xNcka7S+k84JYlWJckaYEW9ckhydHAPwE+OlD+j0lOo39aaefMvqp6OMmNwCPAPuCSqvp5m3MxcA1wFHBHe0iSRmRR4VBVPwGOn1X78MuM3wxsHlLfDpyymLVIkpaOvyEtSeowHCRJHYaDJKnDcJAkdRgOkqQOw0GS1GE4SJI6DAdJUofhIEnqMBwkSR2GgySpw3CQJHUYDpKkDsNBktRhOEiSOgwHSVKH4SBJ6lhUOCTZmeTBJA8k2d5qb0xyZ5LvtZ/HDYy/LMmOJI8lOXOgvrYdZ0eSK9t3SUuSRmQpPjn8o6o6raom2/NNwF1VtQa4qz0nyUnABuBkYD3wxSRHtDlbgIuANe2xfgnWJUlaoFfitNJZwLa2vQ04e6B+fVW9UFVPADuA05MsB46pqnuqqoBrB+ZIkkZgseFQwB8luT/JxlabqKo9bfsHwETbXgE8NTB3V6utaNuz65KkEVm2yPm/VlW7k/wt4M4k3x3cWVWVpBb5Gi9qAbQRYGJigl6vN++509PTL46/9NR9Q8cczPHGxWBftJ99Gc6+DDeOfVlUOFTV7vbzmSRfB04Hnk6yvKr2tFNGz7Thu4FVA9NXttrutj27Puz1tgJbASYnJ2tqamrea+31esyMv2DTbUPH7Dx3/scbF4N90X72ZTj7Mtw49mXBp5WSHJ3kF2e2gX8KPATcCpzfhp0P3NK2bwU2JDkyyYn0Lzzf105B7U1yRrtL6byBOZKkEVjMJ4cJ4OvtrtNlwH+vqj9I8qfAjUkuBJ4EPgRQVQ8nuRF4BNgHXFJVP2/Huhi4BjgKuKM9JEkjsuBwqKrvA786pP4ssO4AczYDm4fUtwOnLHQtkqSl5W9IS5I6DAdJUofhIEnqMBwkSR2GgySpw3CQJHUYDpKkDsNBktRhOEiSOgwHSVKH4SBJ6jAcJEkdhoMkqcNwkCR1GA6SpA7DQZLUYThIkjoW8x3Sq5J8M8kjSR5O8olW/0yS3UkeaI93D8y5LMmOJI8lOXOgvjbJg23fle27pCVJI7KY75DeB1xaVd9O8ovA/UnubPuuqKrfGRyc5CRgA3Ay8MvAHyV5a/se6S3ARcC9wO3AevweaUkamQV/cqiqPVX17bb9Y+BRYMXLTDkLuL6qXqiqJ4AdwOlJlgPHVNU9VVXAtcDZC12XJGnxluSaQ5LVwNvp/8sf4ONJvpPk6iTHtdoK4KmBabtabUXbnl2XJI3IYk4rAZDkDcDNwCeram+SLcBngWo/Pw98ZLGv015rI7ARYGJigl6vN++509PTL46/9NR9Q8cczPHGxWBftJ99Gc6+DDeOfVlUOCR5Hf1g+GpVfQ2gqp4e2P8l4Bvt6W5g1cD0la22u23PrndU1VZgK8Dk5GRNTU3Ne629Xo+Z8Rdsum3omJ3nzv9442KwL9rPvgxnX4Ybx74s5m6lAFcBj1bVFwbqyweGvR94qG3fCmxIcmSSE4E1wH1VtQfYm+SMdszzgFsWui5J0uIt5pPDO4APAw8meaDVPgWck+Q0+qeVdgIfBaiqh5PcCDxC/06nS9qdSgAXA9cAR9G/S8k7lSRphBYcDlX1J8Cw30e4/WXmbAY2D6lvB05Z6FokSUvL35CWJHUs+m6lcbL6QBeqP/eeV3klkjRafnKQJHUYDpKkDsNBktRhOEiSOgwHSVKH4SBJ6jAcJEkdhoMkqcNwkCR1GA6SpA7DQZLUYThIkjoMB0lSh3+VdR4O9Ndawb/YKmk8+clBktRxyIRDkvVJHkuyI8mmUa9Hkg5nh0Q4JDkC+C/AbwAn0f8e6pNGuypJOnwdKtccTgd2VNX3AZJcD5wFPDLSVc3Dy12PGMZrFJJeCw6VcFgBPDXwfBfw90e0llfUUoWJoSTplXSohMO8JNkIbGxPp5M8dhDTTwB+uPSremXl8lf8OK/JvrwK7Mtw9mW411Jf/vZ8Bh0q4bAbWDXwfGWrvURVbQW2LuQFkmyvqsmFLW982Zfh7Mtw9mW4cezLIXFBGvhTYE2SE5O8HtgA3DriNUnSYeuQ+ORQVfuS/EvgfwBHAFdX1cMjXpYkHbYOiXAAqKrbgdtfwZdY0Omow4B9Gc6+DGdfhhu7vqSqRr0GSdIh5lC55iBJOoSMfTj4Zzn6kqxK8s0kjyR5OMknWv2NSe5M8r3287hRr3UUkhyR5M+SfKM9P+z7kuSXktyU5LtJHk3yD+wLJPnX7f+hh5Jcl+RvjmNfxjoc/LMcL7EPuLSqTgLOAC5pvdgE3FVVa4C72vPD0SeARwee2xf4z8AfVNXfAX6Vfn8O674kWQH8K2Cyqk6hfwPNBsawL2MdDgz8WY6q+itg5s9yHHaqak9Vfbtt/5j+/+gr6PdjWxu2DTh7NCscnSQrgfcAXx4oH9Z9SXIs8C7gKoCq+quq+n8c5n1plgFHJVkG/ALwfxnDvox7OAz7sxwrRrSWQ0aS1cDbgXuBiara03b9AJgY0bJG6XeB3wb+eqB2uPflRODPgf/aTrd9OcnRHOZ9qardwO8A/wfYAzxfVX/IGPZl3MNBsyR5A3Az8Mmq2ju4r/q3rh1Wt68leS/wTFXdf6Axh2Nf6P/r+O8BW6rq7cBPmHWq5HDsS7uWcBb98Pxl4OgkvzU4Zlz6Mu7hMK8/y3G4SPI6+sHw1ar6Wis/nWR5278ceGZU6xuRdwDvS7KT/mnHf5zkv2FfdgG7qure9vwm+mFxuPfl14EnqurPq+pnwNeAf8gY9mXcw8E/y9EkCf3zx49W1RcGdt0KnN+2zwduebXXNkpVdVlVrayq1fT/+7i7qn4L+/ID4Kkkb2uldfT/hP5h3Rf6p5POSPIL7f+pdfSv341dX8b+l+CSvJv+OeWZP8uxecRLGokkvwb8T+BB9p9b/xT96w43Am8GngQ+VFXPjWSRI5ZkCvi3VfXeJMdzmPclyWn0L9K/Hvg+8M/p/4PycO/Lvwd+k/4dgH8G/AvgDYxZX8Y+HCRJB2/cTytJkhbAcJAkdRgOkqQOw0GS1GE4SJI6DAdJUofhIEnqMBwkSR3/Hy9LCVcf+gBQAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7f5d726b95c0>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "ratings_counts['counts'].hist(bins = 50)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "ratings_counts.dropna(axis = 0, inplace = True)\n",
    "ratings.dropna(axis = 0, inplace = True)\n",
    "\n",
    "ratings_train, ratings_test = train_test_split(ratings_counts, test_size = 0.2, random_state = 42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# exporting to Spark\n",
    "\n",
    "train_df = spark.createDataFrame(ratings_train)\n",
    "test_df = spark.createDataFrame(ratings_test)\n",
    "ratings_df = spark.createDataFrame(ratings_genres)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training\n",
    "\n",
    "### Recommendations with explicit ratings\n",
    "\n",
    "The Matrix Factorization model, as used for making recommendations, assumes that the matrix of ratings $R_{n \\times k}$, where $r_{ui}$ is the rating of $u$-th user of the $i$-th item, can be factorized into two matrices $P_{n \\times m}$ ($m$ latent features for users, the `rank` in the `ALS` Spark class) and $Q_{m \\times k}$ ($m$ latent features for items) , such that $r_{ui} \\approx q_i^T p_u$, what is achieved by minimizing the squared error\n",
    "\n",
    "$$\n",
    "\\min_{p_*, q_*} \\; \\sum_{u,i} (r_{ui} - q_i^T p_u)^2 + \\lambda\\, ( \\| p_u \\|^2 + \\| q_i \\|^2 )\n",
    "$$\n",
    "\n",
    "where $\\lambda \\ge 0$ is the $L_2$ regularization term (`regParam` in the `ALS` Spark class, `regParam=0.1` by default). The model is able to predict ratings for previously unrated item for a user as $\\hat r_{ui} = q_i^T p_u$. The factorization can be achieved using stochastic gradient descent as described in the classic paper by [Koren et al (2009) in *Matrix factorization techniques for recommender systems*](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf) or alternating least squares (as implemented in Apache Spark ML library). Short comparison of the methods was given by [Aberger in *Recommender: An Analysis of Collaborative Filtering Techniques*](http://cs229.stanford.edu/proj2014/Christopher%20Aberger,%20Recommender.pdf).\n",
    "\n",
    "### Recommendations with implicit ratings\n",
    "\n",
    "The basic algorithm assumes explicit ratings, i.e. users directly express their ratings (e.g. five-star rating on movie recommendation site). Alternatively, the algorithm can be adapted for implicit ratings as described by [Hu et al (2008) in *Collaborative Filtering for Implicit Feedback Datasets*](http://yifanhu.net/PUB/cf.pdf). If we don't have explicit ratings, we can use implicit indicators of preferences (e.g. page views, clicks, number of purchases, etc.), that do not have to explicitely reflect the preferences and may depend on other factors as well, but still, can be assumed to be related to preferences. The adapted model is defined in terms of dummy variable for observing any action by user (e.g. any number of clicks on the page)\n",
    "\n",
    "$$\n",
    "d_{ui} = \\begin{cases}\n",
    "1 \\quad r_{ui} > 0 \\\\\n",
    "0 \\quad r_{ui} = 0\n",
    "\\end{cases}\n",
    "$$\n",
    "\n",
    "and the \"implicit ratings\" (the clicks themselves) are used as a weights (there are other possible choices for the weighting function, but the one below was originally proposed by Hu et al and is [used in Spark](https://github.com/apache/spark/blob/3e778f5a91b0553b09fe0e0ee84d771a71504960/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L1682))\n",
    "\n",
    "$$\n",
    "c_{ui} = 1 + \\alpha \\, r_{ui}\n",
    "$$\n",
    "\n",
    "where $\\alpha \\ge 0$ is weight (`alpha` in the `ALS` Spark class, `alpha=1.0` by default). The algorithm minimizes the weighted squared error\n",
    "\n",
    "$$\n",
    "\\min_{p_*, q_*} \\;  \\sum_{u,i} c_{ui}\\,(d_{ui} - q_i^T p_u)^2 + \\lambda \\, ( \\| p_u \\|^2 + \\| q_i \\|^2 )\n",
    "$$\n",
    "\n",
    "using the alternating least squares (ALS) algorithm. The model for implicit ratings *does not* predict the ratings, but instead [returns scores](https://stackoverflow.com/questions/46904078/spark-als-recommendation-system-have-value-prediction-greater-than-1/46913322#46913322) (usually in $[0, 1]$ range, but [can be negative or greater then one](https://stackoverflow.com/questions/44911349/why-does-spark-ml-als-model-returns-nan-and-negative-numbers-predictions/44928131#44928131)), that can be used for ranking the recommendations.\n",
    "\n",
    "Matrix Factorization module is provided in Apache Spark as [`pyspark.ml.recommendation.ALS`](http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#module-pyspark.ml.recommendation) class in the [ML library](https://spark.apache.org/docs/latest/ml-guide.html). Introductory tutorial can be found in the [Spark documentation](https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html) and additionally, nice step-by-step tutorial [*Movie recommender system with Spark machine learning* is provided by Harper and Konstan](https://datascience.ibm.com/exchange/public/entry/view/99b857815e69353c04d95daefb3b91fa). For implicit ratings the algorithm needs setting `implicitPrefs=True` and possibly also `nonnegative=True` (for [non-negative matrix factorization](https://stackoverflow.com/questions/44911349/why-does-spark-ml-als-model-returns-nan-and-negative-numbers-predictions/44928131#44928131), since we are usually dealing here with counts).\n",
    "\n",
    "The Sparks recommendation module works with sparse matrices in form of `(user, item, rating)` tuples and [only the available combinations need to be provided](https://stackoverflow.com/questions/49490872/data-format-for-spark-als-recommendation-system-with-implicit-feedback/49491798#49491798), so if user didn't buy the item, you **don't have to** pass it as `(user, item, 0)` rating."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- userId: long (nullable = true)\n",
      " |-- genreId: long (nullable = true)\n",
      " |-- counts: long (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "train_df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ALS_47c7be005961efdc87e8"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from pyspark.ml.recommendation import ALS\n",
    "\n",
    "# set this if you don't want to get into StackOverflowError's in Java\n",
    "# see: https://stackoverflow.com/questions/31484460/spark-gives-a-stackoverflowerror-when-training-using-als\n",
    "sc.setCheckpointDir('checkpoint/')\n",
    "\n",
    "als = ALS(rank = 5, maxIter = 10, implicitPrefs = True, nonnegative = True,\n",
    "          userCol = \"userId\", itemCol = \"genreId\", ratingCol = \"counts\", seed = 42)\n",
    "mf = als.fit(train_df)\n",
    "mf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predictions on test set\n",
    "\n",
    "For testing, we produce a set containing of all the *user (from the test set)* $\\times$ *genre (all)* pairs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- userId: long (nullable = true)\n",
      " |-- genreId: long (nullable = true)\n",
      " |-- counts: long (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "n_genres = test_df.select('genreId').distinct().count()\n",
    "\n",
    "test_full = (\n",
    "    test_df.select('userId').distinct()\n",
    "    .crossJoin(test_df.select('genreId').distinct())\n",
    "    .join(test_df, on = ['userId', 'genreId'], how = 'left')\n",
    "    .fillna(0, subset = ['counts'])\n",
    "    .cache()\n",
    ")\n",
    "\n",
    "test_full.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we filter out the `null` predictions that appeared when *both* user and genre were absent in the training data. The `ALS` module needs users to rate at least some items, and items to be rated by at least some users, otherwise it cannot make a prediction. Instead of dropping rows with missing values, this can be dealt by setting `coldStartStrategy=\"drop\"` (instead of the default `coldStartStrategy=\"nan\"`). Notice however that returning `null`'s may be safer strategy for the future use in production."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------+-------+------+-----------+\n",
      "|userId|genreId|counts| prediction|\n",
      "+------+-------+------+-----------+\n",
      "|    31|    243|     0| 0.02708055|\n",
      "|   516|    243|     0|  0.5684143|\n",
      "|   101|    243|     0| 0.07596533|\n",
      "|   115|    243|     0|    0.13874|\n",
      "|   126|    243|     0| 0.54502714|\n",
      "|    26|    243|     0| 0.26572812|\n",
      "|    27|    243|     0|0.044224445|\n",
      "|   332|    243|     0| 0.09251146|\n",
      "|   501|    243|     0| 0.24537466|\n",
      "|   577|    243|     0| 0.53813565|\n",
      "|   626|    243|     0|0.106144555|\n",
      "|   111|    243|     0|  0.5827691|\n",
      "|   224|    243|     0|  0.5046779|\n",
      "|   519|    243|     0| 0.07273325|\n",
      "|   654|    243|     1| 0.63475347|\n",
      "|   291|    243|     0|0.048557412|\n",
      "|   325|    243|     0| 0.10034398|\n",
      "|   386|    243|     0| 0.25733843|\n",
      "|   435|    243|     0|0.015305937|\n",
      "|   473|    243|     0| 0.22342081|\n",
      "+------+-------+------+-----------+\n",
      "only showing top 20 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "predictions = mf.transform(test_full).na.drop()\n",
    "predictions.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the results\n",
    "\n",
    "[Li et al (2008) in *Improving One-Class Collaborative Filtering by Incorporating Rich User Information*](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.7135&rep=rep1&type=pdf) and [Hu et al (2008) in *Collaborative Filtering for Implicit Feedback Datasets*](http://yifanhu.net/PUB/cf.pdf) propose using [Mean Percentage Ranking ($MPR$)](https://stackoverflow.com/questions/46462470/how-can-i-evaluate-the-implicit-feedback-als-algorithm-for-recommendations-in-ap/46490352#46490352) as an evaluation metric for Collaborative Filtering recommender systems. As noticed by Li et al,\n",
    "\n",
    "> Because of the nature of the One Class Recommendation, we don’t have reliable feedback for user’s\n",
    "> preference for items. A user who hasn’t bought an item doesn’t necessary mean he doesn’t like it.\n",
    "> On the other hand, due to the money commitment, the purchase behavior is a strong indication of\n",
    "> user preference over the purchased item.\n",
    "\n",
    "So recall-oriented metric called $MPR$ was proposed. It is calculated as an average percentile ranking $rank_{ui}$ over all users $u$ and items $i$. Percentile ranking $rank_{ui} = 0\\%$ means most prefered item and $rank_{ui} = 100\\%$, least prefered, and $d_{ui}$ are binary indicators is user $u$ has bought item $i$.\n",
    "\n",
    "$$\n",
    "MPR = \\frac{\\sum_{u,i} d_{ui} \\times rank_{ui}}{\\sum_{u,i} d_{ui}}\n",
    "$$\n",
    "\n",
    "where the *smaller* $MPR$ values, the better fit, and randomly produced rankings would result in $MPR \\approx 50\\%$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another possible metric is the [Mean Reciprocal Rank ($MRR$)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank), that was used, for example, by Allain in his talk on [recommender systems with Tensorflow](https://youtu.be/vaJOlKxyKhA) during the PyData London 2017 conference. It is defined as\n",
    "\n",
    "$$\n",
    "MRR = \\frac{1}{n} \\sum_u \\min_{i \\in \\{ i \\,\\mid\\, r_{ui} > 0\\}}(\\,rank_{ui}\\,)^{-1}\n",
    "$$\n",
    "\n",
    "i.e. it is an inverse of the rank (where rank=1 is the best choice, rank=2 is second best, etc.) of the first correct answer. The *higher* $MRR$, the better, where $MRR \\approx 0$ would mean that all the predictions were wrong."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+-------------------+-------------------+-----------------+-----------------+\n",
      "|        avg 1-score|                MPR|                MRR|            MPR*k|            1/MRR|\n",
      "+-------------------+-------------------+-------------------+-----------------+-----------------+\n",
      "|0.49771369498925355|0.22664751714078532|0.32356995151122875|52.80887149380298|3.090521834087234|\n",
      "+-------------------+-------------------+-------------------+-----------------+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "(\n",
    "    predictions\n",
    "    .withColumn('rank', row_number().over(Window.partitionBy('userId').orderBy(desc('prediction'))))\n",
    "    .where(col('counts') > 0) # Notice: this excludes users with no actions at all\n",
    "    .groupby('userId')\n",
    "    .agg(\n",
    "        count('*').alias('n'),\n",
    "        sum(1 - col('prediction')).alias('sum_pred'),\n",
    "        sum(col('rank') / n_genres).alias('sum_perc_rank'),\n",
    "        min('rank').alias('min_rank')\n",
    "    )\n",
    "    .agg(\n",
    "        (sum('sum_pred') / sum('n')).alias('avg 1-score'),\n",
    "        (sum('sum_perc_rank') / sum('n')).alias('MPR'), # the lower the better\n",
    "        mean(1 / col('min_rank')).alias('MRR')          # the higher the better\n",
    "    )\n",
    "    .withColumn('MPR*k', col('MPR') * n_genres)\n",
    "    .withColumn('1/MRR', 1/col('MRR'))\n",
    ").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For comparison, below we look at $RMSE$ and $MAE$ of the scores and the percentile rankings (here: higher rank = better). Such comparison **can be misleading** since scores do not have any intuitive interpretation and cannot be interpreted as probabilities, so they do not have to reflect the percentile rankings. Moreover, we assume the scores to reflect implicit, latent preferences, that do not have to exactly match the explicit behavior.\n",
    "\n",
    "In general, the standard metrics for evaluating continuous predictions **are not recommended** for evaluating recommender systems with implicit ratings. The metrics might however be considered when dealing with explicit ratings, where missing ratings are treated as missing data (to be predicted). When predicting explicit ratings, we expect the model to accurately predict the ratings and this can be judged using metrics such as $RMSE$ and $MAE$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------+------------------+\n",
      "|              RMSE|               MAE|\n",
      "+------------------+------------------+\n",
      "|0.3603028496866236|0.3054604930089585|\n",
      "+------------------+------------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "(\n",
    "    predictions\n",
    "    .withColumn('perc_rank', row_number().over(Window.partitionBy('userId').orderBy('prediction')) / n_genres)\n",
    "    .withColumn('squared_error', (col('prediction') - col('perc_rank'))**2)\n",
    "    .withColumn('absolute_error', abs(col('prediction') - col('perc_rank')))\n",
    "    .agg(\n",
    "        sqrt(mean('squared_error')).alias('RMSE'),\n",
    "        mean('absolute_error').alias('MAE')\n",
    "    )\n",
    ").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Notice:** All the error estimates presented above were biased since we had a leak: some of the `(user, genre)` pairs in the test set were also present in the training set. To correct this, we filter out such cases from the test set using anti join."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "145053"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictions.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "126978"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "predictions.join(train_df, on = ['userId', 'genreId'], how = 'left_anti').count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------------------+-------------------+------------------+-----------------+-----------------+\n",
      "|        avg 1-score|                MPR|               MRR|            MPR*k|            1/MRR|\n",
      "+-------------------+-------------------+------------------+-----------------+-----------------+\n",
      "|0.49771369498925355|0.16006303743330938|0.6696293289814791|37.29468772196108|1.493363502343934|\n",
      "+-------------------+-------------------+------------------+-----------------+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "(\n",
    "    predictions\n",
    "    .join(train_df, on = ['userId', 'genreId'], how = 'left_anti') # filter out the train set\n",
    "    .withColumn('rank', row_number().over(Window.partitionBy('userId').orderBy(desc('prediction'))))\n",
    "    .withColumn('count_rank', count('*').over(Window.partitionBy('userId')))\n",
    "    .where(col('counts') > 0) # Notice: this excludes users with no actions at all\n",
    "    .groupby('userId')\n",
    "    .agg(\n",
    "        count('*').alias('n'),\n",
    "        sum(1 - col('prediction')).alias('sum_pred'),\n",
    "        sum(col('rank') / col('count_rank')).alias('sum_perc_rank'),\n",
    "        min('rank').alias('min_rank')\n",
    "    )\n",
    "    .agg(\n",
    "        (sum('sum_pred') / sum('n')).alias('avg 1-score'),\n",
    "        (sum('sum_perc_rank') / sum('n')).alias('MPR'), # the lower the better\n",
    "        mean(1 / col('min_rank')).alias('MRR')          # the higher the better\n",
    "    )\n",
    "    .withColumn('MPR*k', col('MPR') * n_genres)\n",
    "    .withColumn('1/MRR', 1/col('MRR'))\n",
    ").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's next?\n",
    "\n",
    "Suprisingly, the algorithm gets even better evaluation scores when tested on all-unseen data! This means that the model can be underfitting, so we can try tuning the parameters:\n",
    "\n",
    " - `regParam` regularization, higher penalizes the parameters more, so they get shrinked towards zero unless having significant impact on the results,\n",
    " - `alpha` weights, higher gives more credibility to the observed implicit ratings,\n",
    " - `rank` number of latent dimensions, higher could lead to greater overfitting, since using more complicated model with more parameters,\n",
    " - `maxIter` the number of iterations, increasing could lead to greater fit to the training data (since `ALS` performs matrix factorization, we could expect to see near perfect fit on the training set<sup>&dagger;</sup> after some number of iterations). When using the ALS algorithm, it does not need a huge number of iterations to converge, yet when using SGD (not implemented in Spark ML library), larger number of iterations may be needed.\n",
    "\n",
    "<small><i><sup>&dagger;</sup> for explicit ratings, since implicit ratings algorithm does not predict the observed outcomes, but returns a score.</i></small>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using log-weights\n",
    "\n",
    "Alternatively, [Hu et al (2008) in *Collaborative Filtering for Implicit Feedback Datasets*](http://yifanhu.net/PUB/cf.pdf), suggested also an alternative method for calculating weights for the implicit ratings,\n",
    "\n",
    "$$\n",
    "c_{ui} = 1 + \\alpha \\, \\log(1 + r_{ui}\\,/\\,\\epsilon)\n",
    "$$\n",
    "\n",
    "It may be reasonable to use it when the implicit ratings have very skewed distribution. One example could be, that in your data you have users who watch movies very often and users who are very selective and watch movies rarely, in such case you might not want to give equal weight to single watched movie for both of the users. By taking log of the counts, so that weight for users who watch many movies would increase slower, then among those who watch movies rarely. The above weighting can be easily implemented by transforming the raw counts using the `log1p` function in Python or Spark, and passing the log-transformed counts into the model (this *may* have significant effect with some the real-life data). Notice that we do not have to transform the missing ratings, since $\\log(1 + 0) = \\log(1) = 0$.\n",
    "\n",
    "Below an example of using such weighting is shown, yet in this case, it does not seem to influence the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- userId: long (nullable = true)\n",
      " |-- genreId: long (nullable = true)\n",
      " |-- counts: long (nullable = true)\n",
      " |-- log_counts: double (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "train_df = (\n",
    "    train_df\n",
    "    .withColumn('log_counts', log1p('counts'))\n",
    "    .cache()\n",
    ")\n",
    "\n",
    "train_df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- userId: long (nullable = true)\n",
      " |-- genreId: long (nullable = true)\n",
      " |-- counts: long (nullable = true)\n",
      " |-- log_counts: double (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "test_df = (\n",
    "    test_df\n",
    "    .withColumn('log_counts', log1p('counts'))\n",
    "    .cache()\n",
    ")\n",
    "\n",
    "test_df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ALS_4ee9bd8796e9803e0ac8"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sc.setCheckpointDir('checkpoint/')\n",
    "\n",
    "als2 = ALS(rank = 5, maxIter = 10, implicitPrefs = True, nonnegative = True,\n",
    "           userCol = \"userId\", itemCol = \"genreId\", ratingCol = \"log_counts\", seed = 42)\n",
    "mf2 = als2.fit(train_df)\n",
    "mf2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "root\n",
      " |-- userId: long (nullable = true)\n",
      " |-- genreId: long (nullable = true)\n",
      " |-- counts: long (nullable = true)\n",
      " |-- log_counts: double (nullable = false)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "test_full = (\n",
    "    test_df.select('userId').distinct()\n",
    "    .crossJoin(test_df.select('genreId').distinct())\n",
    "    .join(test_df, on = ['userId', 'genreId'], how = 'left')\n",
    "    .fillna(0, subset = ['counts', 'log_counts'])\n",
    "    .cache()\n",
    ")\n",
    "\n",
    "test_full.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions2 = (\n",
    "    mf2.transform(test_full)\n",
    "    .na.drop()\n",
    "    .join(train_df, on = ['userId', 'genreId'], how = 'left_anti')\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------------------+-------------------+------------------+-----------------+-----------------+\n",
      "|       avg 1-score|                MPR|               MRR|            MPR*k|            1/MRR|\n",
      "+------------------+-------------------+------------------+-----------------+-----------------+\n",
      "|0.5428068000717147|0.15910291090884712|0.6850956209482943|37.07097824176138|1.459650258187057|\n",
      "+------------------+-------------------+------------------+-----------------+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "(\n",
    "    predictions2\n",
    "    .withColumn('rank', row_number().over(Window.partitionBy('userId').orderBy(desc('prediction'))))\n",
    "    .withColumn('count_rank', count('*').over(Window.partitionBy('userId')))\n",
    "    .where(col('counts') > 0) # Notice: this excludes users with no actions at all\n",
    "    .groupby('userId')\n",
    "    .agg(\n",
    "        count('*').alias('n'),\n",
    "        sum(1 - col('prediction')).alias('sum_pred'),\n",
    "        sum(col('rank') / col('count_rank')).alias('sum_perc_rank'),\n",
    "        min('rank').alias('min_rank')\n",
    "    )\n",
    "    .agg(\n",
    "        (sum('sum_pred') / sum('n')).alias('avg 1-score'),\n",
    "        (sum('sum_perc_rank') / sum('n')).alias('MPR'), # the lower the better\n",
    "        mean(1 / col('min_rank')).alias('MRR')          # the higher the better\n",
    "    )\n",
    "    .withColumn('MPR*k', col('MPR') * n_genres)\n",
    "    .withColumn('1/MRR', 1/col('MRR'))\n",
    ").show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Last updated: 2018-05-16 11:09:22.587906\n"
     ]
    }
   ],
   "source": [
    "import datetime\n",
    "\n",
    "print('Last updated: ' + str(datetime.datetime.now()))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
No results found