This article introduce how to find outliers using Local Outlier Detection (LOF) on Hivemall.
create database lof;
use lof;
create external table hundred_balls (
rowid int, | create table similarities | |
| as | |
| WITH test_rnd as ( | |
| select | |
| rand(31) as rnd, | |
| id, | |
| features | |
| from | |
| test_hivemall | |
| ), |
| create table similarities | |
| as | |
| SELECT | |
| each_top_k( | |
| 10, t2.id, angular_similarity(t2.features, t1.features), | |
| t2.id, | |
| t1.id, | |
| t1.y | |
| ) as (rank, similarity, base_id, neighbor_id, y) | |
| FROM |
| /* | |
| * Hivemall: Hive scalable Machine Learning Library | |
| * | |
| * Copyright (C) 2015 Makoto YUI | |
| * Copyright (C) 2013-2015 National Institute of Advanced Industrial Science and Technology (AIST) | |
| * | |
| * Licensed under the Apache License, Version 2.0 (the "License"); | |
| * you may not use this file except in compliance with the License. | |
| * You may obtain a copy of the License at | |
| * |
This article introduce how to find outliers using Local Outlier Detection (LOF) on Hivemall.
create database lof;
use lof;
create external table hundred_balls (
rowid int, First of all, make sure that your Treasure Data cluster is HDP2, not CDH4. Matrix Factorization is only supported in the up-to-date HDP2 cluster. HDP2 is allocated for users who signed Treasure Data after Feb 2015. CDH4 is allcoated for the others.
NOTE: please ask our customer support to use HDP2 if you get an error.
Download ml-20m.zip and unzip it.
| HivemallのMatrix Factorization学習のパラメタの説明です。 | |
| http://qiita.com/myui/items/dccb4f58799f080e24ab#%E3%83%90%E3%82%A4%E3%82%A2%E3%82%B9%E3%82%92%E8%80%83%E6%85%AE%E3%81%97%E3%81%9F-matrix-factorization | |
| factor, mu, iterations以外は通常指定不要です。指定順序は関係ありません。 | |
| etaは場合によっては指定したほうがよいケースもあります。 | |
| 1) "-factor 10" | |
| The number of latent factor [default: 10] | |
| 潜在変数の数 |
Hivemall provides a batch learning scheme that builds prediction models on Apache Hadoop. The learning process itself is a batch process; however, an online/real-time prediction can be achieved by carrying a prediction on a transactional relational DBMS.
In this article, we explain how to run a real-time prediction using a relational DBMS. We assume that you have already run the a9a binary classification task.
The following table shows the type matrix of machine learning schemes and applications.
$ diff build.xml build.xml.orig
41,42c41,42
< <echo message="Use Hadoop 2.6.0 by default" />
< <property name="hadoopversion" value="260" />
---
> <echo message="Use Hadoop 2.x by default" />
> <property name="hadoopversion" value="200" />
188,201d187
< | #cloud-config | |
| hostname: dcXX | |
| fqdn: dcXX.ec2.internal | |
| mounts: | |
| - [ xvdb, /mnt/disk1, "auto", "defaults,nobootwait,comment=cloudconfig", 0, 2] | |
| - [ xvdc, /mnt/disk2, "auto", "defaults,nobootwait,comment=cloudconfig", 0, 2] | |
| runcmd: |
| #!/bin/bash | |
| # Licensed to the Apache Software Foundation (ASF) under one or more | |
| # contributor license agreements. See the NOTICE file distributed with | |
| # this work for additional information regarding copyright ownership. | |
| # The ASF licenses this file to You under the Apache License, Version 2.0 | |
| # (the "License"); you may not use this file except in compliance with | |
| # the License. You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 |