First of all, make sure that your Treasure Data cluster is HDP2, not CDH4. Matrix Factorization is only supported in the up-to-date HDP2 cluster. HDP2 is allocated for users who signed Treasure Data after Feb 2015. CDH4 is allcoated for the others.

NOTE: please ask our customer support to use HDP2 if you get an error.

Data preparation

Download ml-20m.zip and unzip it.

Create database and importing raw rating data into TreasureData from CSV.

Note: --time-value is used to add a dummy time column, that is required in Treasure Data, for each row.

td db:create movielens20
td table:create movielens20m ratings

td import:auto --format csv --column-header --time-value `date +%s` --auto-create movielens20m.ratings ./ratings.csv

Preparation for making training/testing data from the original table.

td table:create movielens20m ratings2

td query -w -x \
-d movielens20m \
"INSERT OVERWRITE TABLE ratings2 \
select \
  rand(31) as rnd, \
  userid, \
  movieid, \
  rating \
from \
  ratings"

Made 80% for training and 20% for testing.

td table:create movielens20m training

td query -x \
--type hive \
-d movielens20m \
"INSERT OVERWRITE TABLE training \
select * from ratings2 \
order by rnd DESC \
limit 16000000"

Note: Please set v columns off for ratings2 through Web console to use SELECT * in the INSERT INTO.

td table:create movielens20m testing

td query -x \
--type hive \
-d movielens20m \
"INSERT OVERWRITE TABLE testing \
select * from ratings2 \
order by rnd ASC \
limit 4000263"

Training

Calculate the mean rating in the training dataset.

td query -w \
--type presto \
-d movielens20m \
"select avg(rating) from training"

3.52560165625

The above average value used in the following queries.

Run training of Matrix Factorization.

Note: TD result output (--result) is not recommended for issuing the following query. Use INSERT INTO table statement instead.

td table:create movielens20m sgd_model_f20

td query -x \
--type hive \
-d movielens20m \
"WITH training_amplified as ( \
select \
   amplify(3, cast(userid as int), cast(movieid as int), rating) as (userid, movieid, rating) \
from \
   training \
) \
INSERT OVERWRITE TABLE sgd_model_f20 \
select \
  idx, \
  array_avg(u_rank) as Pu, \
  array_avg(m_rank) as Qi, \
  avg(u_bias) as Bu, \
  avg(m_bias) as Bi \
from ( \
  select train_mf_sgd(cast(userid as int), cast(movieid as int), rating, '-factor 20 -mu 3.52560165625 -iter 50') as (idx, u_rank, m_rank, u_bias, m_bias) \
  from \
  	training_amplified \
) t \
group by idx"

Note: Currently, the signature of train_mf_sgd is train_mf_sgd(int userid, int itemid, numeric rating [, string options]). We will accept any numeric types in the next version (v0.3.2) of Hivemall.

Note: Please avoid ERROR metastore.RetryingHMSHandler: MetaException(message:NoSuchObjectException(message:Function xxxxx.train_mf_sgd does not exist)) in logs. Its not a logging issue (Hive 0.13 bug) and not a problem.

Predict

td table:create movielens20m sgd_predict_f20

td query -x \
--type hive \
-d movielens20m \
"INSERT OVERWRITE TABLE sgd_predict_f20 \
select \
    t2.actual, \
    mf_predict(if(size(t2.Pu)=0,null,t2.Pu),if(size(p2.Qi)=0,null,Qi), t2.Bu, p2.Bi, 3.52560165625) as predicted \
from ( \
    select \
      t1.userid, \
      t1.movieid, \
      t1.rating as actual, \
      p1.Pu, \
      p1.Bu \
    from \
      testing t1 LEFT OUTER JOIN sgd_model_f20 p1 \
      ON (t1.userid = p1.idx) \
) t2 \
LEFT OUTER JOIN sgd_model_f20 p2 \
ON (t2.movieid = p2.idx)"

Note: if(size(xxx)=0,null,xxx) is just a workaround and will be fixed to avoid it.

Evaluate (computes MAE and RMSE)

td query -w \
--type hive \
-d movielens20m \
"select \
  mae(predicted, actual) as mae, \
  rmse(predicted, actual) as rmse \
from \
  sgd_predict_f20"

+--------------------+--------------------+       
| mae                | rmse               |
+--------------------+--------------------+
| 0.6123907679836259 | 0.8027164481776642 |
+--------------------+--------------------+

myui/movielens_20m.md

Data preparation

Training

Predict

Evaluate (computes MAE and RMSE)