Table capturing ratings that users gave to items, with schema: user_id, item_id, rating, date
- group data by
user_id
- remove groups with less than 10 items
- if a user has rated more than 500 items, just include the 500 latest items
- create a binary interaction matrix from the data (users correspond to rows, items correspond to columns, cell (i,j) is 1 if user i interacted with item j, 0 otherwise
Item-Based Collaborative Filtering: compute item similarity matrix using Jaccard similarity.
Input: binary interaction matrix A
Steps:
// vector containing number of interactions per item
val sums = colsums(A)
// cooccurrence matrix A'A between items
val C = A.transposeTimes(A)
broadcast(sums)
// compute matrix S containing jaccard similarity between items
val S = C.mapNonZeroEntries { case (itemA, itemB, numCooccurrences) =>
numCooccurrences / (sums(itemA) + sums(itemB) - numCooccurrences)
}