Skip to content

Instantly share code, notes, and snippets.

@dapangmao
Last active August 29, 2015 14:17
Show Gist options
  • Save dapangmao/2a455639cdce4ab6d84c to your computer and use it in GitHub Desktop.
Save dapangmao/2a455639cdce4ab6d84c to your computer and use it in GitHub Desktop.
Deploy a minimal Spark cluster

###Why a minimal cluster

  1. Testing:

  2. Prototyping

###Requirements

I need a cluster that lives short time and handles ad-hoc requests of data analysis, or more specificly, running Spark. I want it to be quickly created to load data to memory. And I don't want to keep the cluster perpetually. Therefore, a public cloud may be the best fit for my demand.

  1. Intranet speed

    The cluster should easily copy the data from one server to another. Hadoop always have a large chunk of data shuffling in the HDFS. The hard disk should be SSD.

  2. Elasticity and scalability

    Before scaling the cluter out to more machines, the cloud should have some elasicity to size up or size down

  3. Locality of Hadoop

    Most importantly, the Hadoop cluster and the Spark cluter should have one-to-one mapping relationship.

Hadooop Cluster Manager Spark MapReduce
Name Node Master Driver Job Tracker
Data Node Slave Executor Task Tracker

###Choice of public cloud: I simply compare two cloud service provider: AWS and DigitalOcean. Both have Python-based monitoring tools(Boto for AWS and python-digitalocean for DigitalOcean ) .

  1. From storage to computation

  2. DevOps tools:

    • AWS: spark-ec2.py

      • With default setting after running it, you will get
        • 2 HDFSs: one persistent and one ephemeral
        • Spark 1.3 or any earlier version
        • Spark's stand-alone cluster manager
      • A minimal cluster with 1 master and 3 slaves will be consist of 4 m1.xlarge instances by default
        • Pros: large memory with each node having 15 GB memory
        • Cons: not SSD; expensive (cost $0.35 * 6 = $2.1 per hour)
    • Digital Ocean: https://digitalocean.mesosphere.com/

      • With default setting after runnning it, you will get
        • HDFS
        • no Spark
        • Mesos
        • VPN: openvpn plays a significant role to assure the security
        • Pros: 0.12 per hour
        • Cons: small memory(each as 2GB memory)

###Add Spark to DigitalOcean cluster

TOKEN = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
import digitalocean
##droplet = digitalocean.Droplet(token=TOKEN,
## name='Example',
## region='nyc3', # New York 2
## image='ubuntu-14-04-x64', # Ubuntu 14.04 x64
## size_slug='512mb', # 512MB
## backups=True)
##droplet.create()
manager = digitalocean.Manager(token = TOKEN)
my_droplets = manager.get_all_droplets()
for x in my_droplets:
print x
current = my_droplets[0]
print current.ip_address
ssh = manager.get_all_sshkeys()
for x in ssh:
print x
from __future__ import with_statement
from fabric.api import cd, env, sudo, run, settings, local, abort
from fabric.contrib.files import exists
IP = 'unknown'
if env.hosts:
IP = env.hosts[0]
env.user = 'root'
CURRENT = 'spark-1.3.0-bin-hadoop2.4'
CURRENT_URL = 'http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz'
SPARK_MASTER = 'mesos://zk://{}:2181/mesos'.format(IP)
SPARK_EXECUTOR_URI = 'hdfs://{0}/spark/{1}.tgz'.format(IP, CURRENT)
SPARK_ENV_PARAMETER = \
"""
export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI={0}
export MASTER={1}
export SPARK_LOCAL_IP={2}
export SPARK_PUBLIC_DNS={3}
""".format(SPARK_EXECUTOR_URI, SPARK_MASTER, IP, IP)
MASTER_ENV_PARAMETER = \
"""
export PATH={0}/bin:$PATH
export SPARK_MASTER={1}
""".format(CURRENT, SPARK_MASTER)
def set_openvpn(ovpn_file):
with settings(warn_only = True):
reply = local('locate openvpn')
if reply.failed:
local('sudo apt-get install openvpn bridge-utils')
local('sudo openvpn --config {}'.format(ovpn_file))
def dismiss_openvpn():
local('sudo killall openvpn')
def install_spark():
if IP == 'unknown':
abort('Master IP has to be specified')
current_file = CURRENT + '.tgz'
if not exists('/root/{}'.format(CURRENT)):
run('wget {}'.format(CURRENT_URL))
run('tar xf {}'.format(current_file))
with settings(warn_only = True):
response = run('hdoop fs -ls /spark')
if response.failed:
run('hadoop fs -mkdir /spark')
if exists('/root/{}'.format(CURRENT)):
run('hadoop fs -put {} /spark'.format(current_file))
def configure_spark():
with cd('/root/{}/conf'.format(CURRENT)):
sudo('cp spark-env.sh.template spark-env.sh')
sudo('cat >> spark-env.sh <<EOF {}'.format(SPARK_ENV_PARAMETER))
sudo('cp spark-defaults.conf.template spark-defaults.conf')
sudo('echo spark.executor.uri {} >> spark-defaults.conf'\
.format(SPARK_EXECUTOR_URI))
run('sed "s/log4j.rootCategory=INFO/log4j.rootCategory=WARN/" \
< log4j.properties.template > log4j.properties')
def install_ipython():
sudo('apt-get install -y vim')
sudo('pip install ipython')
with cd('/root'):
run('cat >> .profile <<EOF {}'.format(MASTER_ENV_PARAMETER))
run('echo alias spark-ipython=\"IPYTHON=1 ~/{}/bin/pyspark\" >> .bashrc'\
.format(CURRENT))
def deploy_spark():
install_spark()
configure_spark()
install_ipython()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment