Makoto YUI myui

This article introduce how to find outliers using Local Outlier Detection (LOF) on Hivemall.

Data Preparation

create database lof;
use lof;

create external table hundred_balls (
  rowid int,

First of all, make sure that your Treasure Data cluster is HDP2, not CDH4. Matrix Factorization is only supported in the up-to-date HDP2 cluster. HDP2 is allocated for users who signed Treasure Data after Feb 2015. CDH4 is allcoated for the others.

NOTE: please ask our customer support to use HDP2 if you get an error.

Data preparation

Download ml-20m.zip and unzip it.

Hivemall provides a batch learning scheme that builds prediction models on Apache Hadoop. The learning process itself is a batch process; however, an online/real-time prediction can be achieved by carrying a prediction on a transactional relational DBMS.

In this article, we explain how to run a real-time prediction using a relational DBMS. We assume that you have already run the a9a binary classification task.

Online/Offline Matrix of Machine Learning

The following table shows the type matrix of machine learning schemes and applications.

$ diff build.xml build.xml.orig
41,42c41,42
<       <echo message="Use Hadoop 2.6.0 by default" />
<       <property name="hadoopversion" value="260" />
---
>       <echo message="Use Hadoop 2.x by default" />
>       <property name="hadoopversion" value="200" />
188,201d187
<

	create table similarities
	as
	WITH test_rnd as (
	select
	rand(31) as rnd,
	id,
	features
	from
	test_hivemall
	),

	create table similarities
	as
	SELECT
	each_top_k(
	10, t2.id, angular_similarity(t2.features, t1.features),
	t2.id,
	t1.id,
	t1.y
	) as (rank, similarity, base_id, neighbor_id, y)
	FROM

	/*
	* Hivemall: Hive scalable Machine Learning Library
	*
	* Copyright (C) 2015 Makoto YUI
	* Copyright (C) 2013-2015 National Institute of Advanced Industrial Science and Technology (AIST)
	*
	* Licensed under the Apache License, Version 2.0 (the "License");
	* you may not use this file except in compliance with the License.
	* You may obtain a copy of the License at
	*

	HivemallのMatrix Factorization学習のパラメタの説明です。
	http://qiita.com/myui/items/dccb4f58799f080e24ab#%E3%83%90%E3%82%A4%E3%82%A2%E3%82%B9%E3%82%92%E8%80%83%E6%85%AE%E3%81%97%E3%81%9F-matrix-factorization

	factor, mu, iterations以外は通常指定不要です。指定順序は関係ありません。
	etaは場合によっては指定したほうがよいケースもあります。

	1) "-factor 10"
	The number of latent factor [default: 10]
	潜在変数の数

	#cloud-config

	hostname: dcXX
	fqdn: dcXX.ec2.internal

	mounts:
	- [ xvdb, /mnt/disk1, "auto", "defaults,nobootwait,comment=cloudconfig", 0, 2]
	- [ xvdc, /mnt/disk2, "auto", "defaults,nobootwait,comment=cloudconfig", 0, 2]

	runcmd:

	#!/bin/bash

	# Licensed to the Apache Software Foundation (ASF) under one or more
	# contributor license agreements. See the NOTICE file distributed with
	# this work for additional information regarding copyright ownership.
	# The ASF licenses this file to You under the Apache License, Version 2.0
	# (the "License"); you may not use this file except in compliance with
	# the License. You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0