-
-
Save charanpald/ce73d9511994bc9f472f to your computer and use it in GitHub Desktop.
# Run some recommendation experiments using MovieLens 100K | |
import pandas | |
import numpy | |
import scipy.sparse | |
import scipy.sparse.linalg | |
import matplotlib.pyplot as plt | |
from sklearn.metrics import mean_absolute_error | |
data_dir = "data/ml-100k/" | |
data_shape = (943, 1682) | |
df = pandas.read_csv(data_dir + "ua.base", sep="\t", header=-1) | |
values = df.values | |
values[:, 0:2] -= 1 | |
X_train = scipy.sparse.csr_matrix((values[:, 2], (values[:, 0], values[:, 1])), dtype=numpy.float, shape=data_shape) | |
df = pandas.read_csv(data_dir + "ua.test", sep="\t", header=-1) | |
values = df.values | |
values[:, 0:2] -= 1 | |
X_test = scipy.sparse.csr_matrix((values[:, 2], (values[:, 0], values[:, 1])), dtype=numpy.float, shape=data_shape) | |
# Compute means of nonzero elements | |
X_row_mean = numpy.zeros(data_shape[0]) | |
X_row_sum = numpy.zeros(data_shape[0]) | |
train_rows, train_cols = X_train.nonzero() | |
# Iterate through nonzero elements to compute sums and counts of rows elements | |
for i in range(train_rows.shape[0]): | |
X_row_mean[train_rows[i]] += X_train[train_rows[i], train_cols[i]] | |
X_row_sum[train_rows[i]] += 1 | |
# Note that (X_row_sum == 0) is required to prevent divide by zero | |
X_row_mean /= X_row_sum + (X_row_sum == 0) | |
# Subtract mean rating for each user | |
for i in range(train_rows.shape[0]): | |
X_train[train_rows[i], train_cols[i]] -= X_row_mean[train_rows[i]] | |
test_rows, test_cols = X_test.nonzero() | |
for i in range(test_rows.shape[0]): | |
X_test[test_rows[i], test_cols[i]] -= X_row_mean[test_rows[i]] | |
X_train = numpy.array(X_train.toarray()) | |
X_test = numpy.array(X_test.toarray()) | |
ks = numpy.arange(2, 50) | |
train_mae = numpy.zeros(ks.shape[0]) | |
test_mae = numpy.zeros(ks.shape[0]) | |
train_scores = X_train[(train_rows, train_cols)] | |
test_scores = X_test[(test_rows, test_cols)] | |
# Now take SVD of X_train | |
U, s, Vt = numpy.linalg.svd(X_train, full_matrices=False) | |
for j, k in enumerate(ks): | |
X_pred = U[:, 0:k].dot(numpy.diag(s[0:k])).dot(Vt[0:k, :]) | |
pred_train_scores = X_pred[(train_rows, train_cols)] | |
pred_test_scores = X_pred[(test_rows, test_cols)] | |
train_mae[j] = mean_absolute_error(train_scores, pred_train_scores) | |
test_mae[j] = mean_absolute_error(test_scores, pred_test_scores) | |
print(k, train_mae[j], test_mae[j]) | |
plt.plot(ks, train_mae, 'k', label="Train") | |
plt.plot(ks, test_mae, 'r', label="Test") | |
plt.xlabel("k") | |
plt.ylabel("MAE") | |
plt.legend() | |
plt.show() |
@charanpald The learning curve I see is flat for the test set: See your live notebook here: https://github.com/DSE-capstone-sharknado/main/blob/master/SVD%20RecSys.ipynb
The SVD routine is simply giving you U, s, V which are the components of R, the original sparse matrix w/ missing values. All you are doing is reconstructing the original R w/ as an approximation as k increases. The empty values in R will still be empty in the reconstruction. You are not learning anything.
In summary:
The r^k_ij
element of the matrix R_k
is an approximation of the r_ij
element of the original matrix R
. In particular, if you were to use the U
, s
, and V
matrices of expansion as they are (without the dimension reduction) you would have r^k_ij = r^ij
. As a corollary, it's impossible to use the R_k
as a prediction matrix, because you already have this information in matrix R
. For example, if the p'th user did not rate the q'th product, r_pq
will be 0
. At the same time r^k_pq
will be 0
too. Not a very insightful prediction, is it?
Your statement "The empty values in R will still be empty in the reconstruction." is wrong unfortunately. Are you able to prove it for k < rank(R)?
Numpy SVD is quite different from the "Funk SVD" that you need in recommender systems.
Why ? See
https://github.com/aaw/IncrementalSVD.jl#great-but-julia-already-has-an-svd-function-ill-just-use-that
(admirably clear)
cheers
Thanks for your comments. Agreed that the model is overfitting. I wouldn't say the test curve is flat though, it clearly dips rapidly until k=7, then much more slowly until k=17, rising again afterwards.
I'm not sure I understand your point about the "model isn't filling in the blanks in R or learning anything". The k-rank SVD is different from R, unless you are assuming R is low rank? Clearly it's not the case. Also, I'm not sure how that means I could have made the same predictions using the original R?