This is a pretty much self-noted guide to setting up PredictionIO in a cluster of 3 servers (for this guide in particular).
You can also setup more servers and distribute the services mentioned here differently, but for the scope of this guide I won't explain how to do that, although you might use the references here to guide yourself into doing that.
- Requirements: --
Note: In this guide, all servers share all services, except PredictionIO, which runs only under the master server.
If you wanna distribute PIO, you need to setup a load balancer on top of each Eventserver.
- Hadoop 2.6.2 (Fully distributed mode)
- Spark 1.5.2 (Fully distributed mode)
- Elasticsearch 1.7.4 (Multi master cluster)
- HBase 1.1.2 (Multi master cluster)
- PredictionIO 0.9.6
- Universal Recommender Template Engine (Provided by ActionML)
- Setup User: --
1.1 Create user for PredictionIO pio
in each server
adduser pio # Give it some password
1.2 Give the pio
user sudoers permissions
usermod -a -G sudo pio
1.3 Setup paswordless ssh between all servers of the cluster (a.k.a: Add pub key to authorized_keys)
1.4 Modify /etc/hosts
file and name each server
-
Note: Avoid using "localhost" or "127.0.0.1".
# Change IPs where it corresponds. 10.0.0.1 master 10.0.0.2 slave-1 10.0.0.3 slave-2
- Download services in all servers:
--
Note: Download everything to a temp folder like
/tmp/downloads
, we will later move them to the final destinations.
2.1 Download Hadoop 2.6.2 (http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.2/hadoop-2.6.2.tar.gz)
2.2 Download Spark 1.5.2 (http://www.us.apache.org/dist/spark/spark-1.5.2/spark-1.5.2-bin-hadoop2.6.tgz)
2.3 Download Elasticsearch 1.7.4 (https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.4.tar.gz)
- DON'T USE 2.0 UNTIL PIO WORKS WITH IT. (Pat said there were some issues).
2.4 Download HBase 1.1.2 (https://www.apache.org/dist/hbase/1.1.2/hbase-1.1.2-src.tar.gz)
2.5 Clone PIO Enterprise
git clone https://github.com/actionml/PredictionIO-Enterprise.git predictionio
2.6 Clone Universal Recommender Template
# cd into /tmp/downloads/predictionio/templates (create it if it doesn't exist)
mkdir -p /tmp/downloads/predictionio/templates && cd $_
git clone https://github.com/pferrel/template-scala-parallel-universal-recommendation universal
- Setup Java 1.7 or 1.8 (OpenJDK): -- (Using Debian-based distros)
3.1 Install Java.
sudo apt-get install openjdk-7-jdk
3.2 Check which versions of Java are installed and pick one (Ideally OpenJDK, PIO has issues with Oracle Java.)
sudo update-alternatives --config java
3.3 Set JAVA_HOME env var.
-
Note: Don't include the
/bin
folder in the route.vim /etc/environment export JAVA_HOME=/path/to/open/jdk/jre
- Create Folders for services: --
4.1 Create folders in /opt
(You might as well place these services wherever your distro recommends, it's up to you to stick to this convention or not)
mkdir /opt/hadoop
mkdir /opt/spark
mkdir /opt/elasticsearch
mkdir /opt/hbase
mkdir /opt/pio
chown pio:pio /opt/hadoop
chown pio:pio /opt/spark
chown pio:pio /opt/elasticsearch
chown pio:pio /opt/hbase
chown pio:pio /opt/pio
- Extract Services: --
5.1 Inside the /tmp/downloads
folder, extract all downloaded services.
5.2 Move extracted services to their folders
sudo mv /tmp/downloads/hadoop-2.6.2 /opt/hadoop/
sudo mv /tmp/downloads/spark-1.5.2 /opt/spark/
sudo mv /tmp/downloads/elasticsearch-1.7.4 /opt/elasticsearch/
sudo mv /tmp/downloads/hbase-1.1.2 /opt/hbase/
sudo mv /tmp/downloads/predictionio /opt/pio/
5.3 NOTE: Keep version numbers, if we want to upgrade in the future without losing previous versions, we just need to re-symlink.
5.4 Symlink Folders
sudo ln -s /opt/hadoop/hadoop-2.6.2 /usr/local/hadoop
sudo ln -s /opt/spark/spark-1.5.2 /usr/local/sparl
sudo ln -s /opt/elasticsearch/elasticsearch-1.7.4 /usr/local/elasticsearch
sudo ln -s /opt/hbase/hbase-1.1.2 /usr/local/hbase
sudo ln -s /opt/pio/predictionio /usr/local/pio
- Setup clusterized services: --
-
Read: http://www.tutorialspoint.com/hadoop/hadoop_multi_node_cluster.htm
-
Files config:
-
etc/hadoop/core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> </configuration>
-
etc/hadoop/hadoop/hdfs-site.xml
<configuration> <property> <name>dfs.data.dir</name> <value>file:///usr/local/hadoop/dfs/name/data</value> <final>true</final> </property> <property> <name>dfs.name.dir</name> <value>file:///usr/local/hadoop/dfs/name</value> <final>true</final> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
-
etc/hadoop/mapred-site.xml
<configuration> <property> <name>mapred.job.tracker</name> <value>master:9001</value> </property> </configuration>
-
etc/hadoop/masters
master
-
etc/hadoop/slaves
slave-1 slave-2
-
etc/hadoop/hadoop-env.sh
export JAVA_HOME=${JAVA_HOME} export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"} for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do if [ "$HADOOP_CLASSPATH" ]; then export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f else export HADOOP_CLASSPATH=$f fi done export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true" export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS" export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS" export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS" export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS" export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS" export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS" export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER} export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER} export HADOOP_PID_DIR=${HADOOP_PID_DIR} export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR} export HADOOP_IDENT_STRING=$USER
-
-
Format Namenode
`bin/hadoop namenode -format`
-
Start just dfs servers (from master)
`sbin/start-dfs.sh`
-
Create
/hbase
and/zookeper
folders (from master)bin/hdfs dfs -mkdir /hbase /zookeeper
NOTES:
- Follow instructions in the guide and just set up master and slaves, then start only the hdfs ("hadoop distrubuted file system"), DO NOT DO
start-all.sh
-
Pretty straight forward: http://spark.apache.org/docs/latest/spark-standalone.html
-
Start all services (from master)
sbin/start-all.sh
-
Start just master
sbin/start-master.sh
-
Start just slaves
sbin/start-slave.sh <master-spark-URL>
-
Change the
conf/elasticsearch.yml
file to reflect this:cluster.name: elasticsearch-pio-poc node.name: "master" # Change to the name of the slave if the server is a slave. node.master: true # SET TO TRUE ONLY IN MASTER, OTHERS IS FALSE node.data: true discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: ["master", "slave-1", "slave-2"] # ADD ALL THE SERVERS
-
THIS IS THE BEST GUIDE FOR THIS VERSION, TRUST NO OTHER (This is actually the official guide) https://hbase.apache.org/book.html#quickstart_fully_distributed
-
Files config:
-
conf/hbase-site.xml
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://master:9000/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>hdfs://master:9000/zookeeper</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>master,slave-1,slave-2</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>2181</value> </property> </configuration>
-
conf/regionservers
master slave-1 slave-2
-
conf/backupmasters
slave-1
-
conf/hbase-env.sh
export JAVA_HOME=${JAVA_HOME} export HBASE_OPTS="-XX:+UseConcMarkSweepGC" export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m" export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m" export HBASE_PID_DIR=/var/hbase/pids export HBASE_MANAGES_ZK=true
-
-
Start HBase (from master)
bin/start-hbase.sh
-
NOTE: I strongly recommend setting all these files just in the master
conf
folder and just copying the wholeconf/*
folder to the slaves.
- Setup PIO --
7.1 Add /usr/local/pio/bin
to the PATH
echo "export PATH=$PATH:/usr/local/pio/bin" >> ~/.bashrc
7.2 Start everything and check it works
pio-start-all # Starts basic services.
pio status # Should return all statuses ok.
7.2 Build
pio build
7.3 Create App
pio app new handmade
7.4 Install pip
, follow this guide
7.5 Get Python SDK
sudo pip install predictionio
7.6 Test everything works. (Be sure to have all services running)
$ ./usr/local/pio/examples/integration-test
Thanks for good documentation.