Skip to content

Instantly share code, notes, and snippets.

@shivaram
Last active May 17, 2017 09:09
Show Gist options
  • Save shivaram/9240335 to your computer and use it in GitHub Desktop.
Save shivaram/9240335 to your computer and use it in GitHub Desktop.
Installing SparkR + CDH5 on EC2

On master node

wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm
sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
sudo yum clean all
sudo yum install hadoop-hdfs-namenode
sudo yum install R git
sudo yum install spark-core spark-master spark-python

cd
wget http://cran.cnr.berkeley.edu/src/contrib/rJava_0.9-6.tar.gz
sudo R CMD INSTALL rJava_0.9-6.tar.gz
git clone https://github.com/amplab-extras/SparkR-pkg.git
cd SparkR-pkg
./install-dev.sh

Copy SparkR from master to slave node

#Need to add ssh key in slave's authorized keys
ssh-kegen # This will generate keys
cat ~/.ssh/id_rsa.pub # Copy this to ~/.ssh/authorized_keys on the slave machine
cd
rsync -az SparkR-pkg ec2-user@<slave-hostname>:~/

On slave node

# This is to run SparkR from Spark
chmod a+rx /home/ec2-user
wget http://archive.cloudera.com/cdh5/one-click-install/redhat/6/x86_64/cloudera-cdh-5-0.x86_64.rpm
sudo yum --nogpgcheck localinstall cloudera-cdh-5-0.x86_64.rpm
sudo yum -y clean all
sudo yum -y install hadoop-hdfs-datanode
sudo yum -y install R git
sudo yum -y install spark-core spark-worker spark-python

cd
wget http://cran.cnr.berkeley.edu/src/contrib/rJava_0.9-6.tar.gz
sudo R CMD INSTALL rJava_0.9-6.tar.gz

Start HDFS

On master

sudo cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster
# Copy core-site.xml and hdfs-site.xml from my gist
wget https://gist.github.com/shivaram/9240335#file-core-site-xml
sudo mv 9240335 /etc/hadoop/conf.my_cluster/core-site.xml
wget https://gist.github.com/shivaram/9240335#file-hdfs-site.xml
sudo mv 9240335 /etc/hadoop/conf.my_cluster/hdfs-site.xml


sudo mkdir -p /mnt/ephemeral-hdfs/data
sudo chown -R hdfs:hdfs /mnt/ephemeral-hdfs
rsync -az /etc/hadoop/conf.my_cluster ec2-user@<slave-hostname>:~/

sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

sudo -s
sudo -u hdfs hadoop namenode -format
sudo service hadoop-hdfs-namenode start

On slave

sudo mkdir -p /mnt/ephemeral-hdfs/data
sudo chown -R hdfs:hdfs /mnt/ephemeral-hdfs

sudo mv conf.my_cluster /etc/hadoop/
sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50 
sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster
sudo service hadoop-hdfs-datanode start

Configure Spark

On master node

Edit /etc/spark/conf/spark-env.sh -- Fill in ec2 hostname for STANDALONE_MASTER_HOST
Also add a line export SPARK_LOCAL_IP=`wget -q -O - http://169.254.169.254/latest/meta-data/public-hostname`
rsync -az /etc/spark/conf/spark-env.sh ec2-user@<slave-hostname>:~/
sudo service spark-master start

On slave node

sudo mv spark-conf.sh /etc/spark/conf/
sudo service spark-worker start

Run Spark - check if it works

Launch spark-shell and run

val a = sc.parallelize(1 to 100, 2)
a.count

Check SparkR

On master run

source /etc/spark/conf/spark-env.sh
SPARK_HOME=/usr/lib/spark ./sparkR

Inside R console run

a <- parallelize(sc, 1:100, 2L)
count(a)
q()

Test pi.R

SPARK_HOME=/usr/lib/spark ./sparkR examples/pi.R <spark_master_url>
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/mnt/ephemeral-hdfs</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://ec2-54-197-12-119.compute-1.amazonaws.com:9000</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>
</configuration>
<?xml version="1.0"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/mnt/ephemeral-hdfs</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/mnt/ephemeral-hdfs/data</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/run/hadoop-hdfs/dn._PORT</value>
</property>
</configuration>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment