Skip to content

Instantly share code, notes, and snippets.

@ianchen06
Last active April 13, 2017 11:38
Show Gist options
  • Save ianchen06/5da0e5b2fabb753542754c56fe9c9874 to your computer and use it in GitHub Desktop.
Save ianchen06/5da0e5b2fabb753542754c56fe9c9874 to your computer and use it in GitHub Desktop.
notes on setting up spark 1.6.1 cluster with vagrant
{
"display_name": "pySpark (Spark 1.6.1)",
"language": "python",
"argv": [
"python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/Users/ian/Code/spark-1.6.1/",
"PYTHONPATH": "/Users/ian/Code/spark-1.6.1/python/:/Users/ian/Code/spark-1.6.1/python/lib/py4j-0.9-src.zip",
"PYTHONSTARTUP": "/Users/ian/Code/spark-1.6.1/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master spark://192.168.34.10:7077 pyspark-shell"
}
}
  1. Make sure you have enough RAM (2GB at least for the Java runtime to not throw OOM error)
  2. Configure your conf/slave and conf/spark-env.sh
  3. Use sbin/start-all.sh to start the cluster
  • sqlContext.read.text() needs something that is available on all nodes, i.e. shared storage
  • place kernel.json in /Users/ian/.ipython/kernels/pyspark/kernel.json

References

# -*- mode: ruby -*-
# vi: set ft=ruby :
Vagrant.configure(2) do |config|
config.vm.box = "ubuntu/trusty64"
config.vm.define "worker1" do |worker|
worker.vm.hostname = "worker1"
worker.vm.network "private_network", ip: "192.168.34.10"
worker.vm.provider "virtualbox" do |vb|
vb.memory = "2048"
end
worker.vm.network "forwarded_port", guest: 8080, host: 8080
end
config.vm.define "worker2" do |worker|
worker.vm.hostname = "worker2"
worker.vm.network "private_network", ip: "192.168.34.11"
worker.vm.provider "virtualbox" do |vb|
vb.memory = "2048"
end
end
config.vm.define "worker3" do |worker|
worker.vm.hostname = "worker3"
worker.vm.network "private_network", ip: "192.168.34.12"
worker.vm.provider "virtualbox" do |vb|
vb.memory = "2048"
end
end
config.vm.provision "shell", inline: <<-SHELL
sudo apt-get update
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections
sudo apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev
sudo apt-get install -y oracle-java8-installer git
curl -L https://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer | bash
pyenv install 2.7.11
SHELL
# Disable automatic box update checking. If you disable this, then
# boxes will only be checked for updates when the user runs
# `vagrant box outdated`. This is not recommended.
# config.vm.box_check_update = false
# Create a forwarded port mapping which allows access to a specific port
# within the machine from a port on the host machine. In the example below,
# accessing "localhost:8080" will access port 80 on the guest machine.
# config.vm.network "forwarded_port", guest: 80, host: 8080
# Create a private network, which allows host-only access to the machine
# using a specific IP.
# config.vm.network "private_network", ip: "192.168.33.10"
# Create a public network, which generally matched to bridged network.
# Bridged networks make the machine appear as another physical device on
# your network.
# config.vm.network "public_network"
# Share an additional folder to the guest VM. The first argument is
# the path on the host to the actual folder. The second argument is
# the path on the guest to mount the folder. And the optional third
# argument is a set of non-required options.
# config.vm.synced_folder "../data", "/vagrant_data"
# Provider-specific configuration so you can fine-tune various
# backing providers for Vagrant. These expose provider-specific options.
# Example for VirtualBox:
#
# config.vm.provider "virtualbox" do |vb|
# # Display the VirtualBox GUI when booting the machine
# vb.gui = true
#
# # Customize the amount of memory on the VM:
# vb.memory = "1024"
# end
#
# View the documentation for the provider you are using for more
# information on available options.
# Define a Vagrant Push strategy for pushing to Atlas. Other push strategies
# such as FTP and Heroku are also available. See the documentation at
# https://docs.vagrantup.com/v2/push/atlas.html for more information.
# config.push.define "atlas" do |push|
# push.app = "YOUR_ATLAS_USERNAME/YOUR_APPLICATION_NAME"
# end
# Enable provisioning with a shell script. Additional provisioners such as
# Puppet, Chef, Ansible, Salt, and Docker are also available. Please see the
# documentation for more information about their specific syntax and use.
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment