Skip to content

Instantly share code, notes, and snippets.

@slgithub
Forked from ashrithr/spark_on_yarn.md
Last active September 3, 2015 12:08
Show Gist options
  • Save slgithub/ac7bef6c8db093c44a2e to your computer and use it in GitHub Desktop.
Save slgithub/ac7bef6c8db093c44a2e to your computer and use it in GitHub Desktop.
spark 0.9 on yarn (hadoop-2.2)

##Using yarn as the resource manager you can deploy spark application in two modes:

  1. yarn-standalone mode, in which your driver program is running as a thread of the yarn application master, which itself runs on one of the node managers in the cluster. The Yarn client just pulls status from the application master. This mode is same as a mapreduce job, where the MR application master coordinates the containers to run the map/reduce tasks.

With this mode, your application is actually run on the remote machine where the Application Master is run upon. Thus application that involve local interaction will not work well, e.g. spark-shell.

  1. yarn-client mode, in which your driver program is running on the yarn client where you type the command to submit the spark application (may not be a machine in the yarn cluster). In this mode, although the drive program is running on the client machine, the tasks are executed on the executors in the node managers of the YARN cluster.

Simply putting to gether:

With yarn-client mode, your spark application is running in your local machine. With yarn-standalone mode, your spark application would be submitted to YARN's ResourceManager as yarn ApplicationMaster, and your application is running in a yarn node where ApplicationMaster is running. In both case, yarn serve as spark's cluster manager. Your application(SparkContext) send tasks to yarn.

More info here

##Download pre-built spark-0.9 for hadoop 2.2.0:

wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.1-bin-hadoop2.tgz
tar xzf spark-0.9.1-bin-hadoop2.tgz
ln -s spark-0.9.1-bin-hadoop2 spark

(or)

Manually build spark for a specific hadoop version, in this case 2.2.0:

wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.1.tgz
tar xzf spark-0.9.1.tgz
ln -s spark-0.9.1.tgz spark
cd spark
sbt/sbt clean assembly
SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt clean assembly

##Running exmaple spark job against YARN:

On single worker:

SPARK_JAR=./assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar \
HADOOP_CONF_DIR=/etc/hadoop/conf \
./bin/spark-class org.apache.spark.deploy.yarn.Client \
--jar examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar \
--class org.apache.spark.examples.SparkPi \
--args yarn-standalone \
--num-workers 1 \
--master-memory 1g \
--worker-memory 2g \
--worker-cores 1

On multiple workers:

SPARK_JAR=./assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar \
HADOOP_CONF_DIR=/etc/hadoop/conf \
./bin/spark-class org.apache.spark.deploy.yarn.Client \
--jar examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar \
--class org.apache.spark.examples.SparkPi \
--args yarn-standalone \
--num-workers 3 \
--master-memory 1g \
--worker-memory 2g \
--worker-cores 1

To look at the output replace the APPLICATION_ID with the application id that got alloted for the launched spark application:

yarn logs -applicationId APPLICATION_ID

##Using yarn client mode to start spark-shell

SPARK_YARN_MODE=true \
HADOOP_CONF_DIR=/etc/hadoop/conf \
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar \
SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar \
MASTER=yarn-client ./bin/spark-shell

When running in yarn-client mode, it's important to specify local file URIs with file://. This is because in this mode, spark assumes that files are present in HDFS (in the /user/<username>) directory by default.

#Spark 1.0.0 on Hadoop 2.4.0

##Getting spark 1.0 and building it againse hadoop version 2.4.0:

git clone https://github.com/apache/spark.git
cd spark
sbt/sbt clean assembly
SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true sbt/sbt clean assembly

##Running example spark job against YARN:

SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar \
HADOOP_CONF_DIR=/etc/hadoop/conf \
./bin/spark-submit --master yarn \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--num-executors 1 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop2.4.0.jar

##Running spark shell on YARN:

SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop2.4.0.jar \
HADOOP_CONF_DIR=/etc/hadoop/conf \
MASTER=yarn-client \
./bin/spark-shell
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment