https://lipn.univ-paris13.fr/bigdata/index.php/How_to_use_Spark_on_Grid5000
https://github.com/mliroz/hadoop_g5k/wiki
https://github.com/mliroz/hadoop_g5k/wiki/spark_g5k
Prepare the needed files by downloading:
- Spark, e.g.: https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
- Compatible Hadoop, e.g.: https://archive.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
You will need them in their archives, so don't extract their content.
Install Execo Using Pip
(No Proxy Needed unlike what told in old tutorial, easy_install not supported anymore on g5k?)
(frontend)$ python -m pip install --user execoRetrieve hadoop_g5k sources from GitHub and then unzip it.
(frontend)$ wget https://github.com/mliroz/hadoop_g5k/archive/master.zip .
unzip master.zipUpdate util.py to avoid a Python error whenever checking java version (which has changed since the package release).
(frontend)$ nano hadoop_g5k-master/hadoop_g5k/util/util.pythen make the check_java_version return True, commenting out all the function's code. It works with the current version of OpenJDK installed by default anyway so checking is unnecessary.
... Edit utils.py / check_java_version code to return True all the time ...
Inside the hadoop_g5k_master folder, launch the python setup command.
python setup.py install --userDepending on your Python configuration, the scripts will be installed in a different directory. You may add this directory to the PATH in order to be able to call them from any directory.
To automatically add it to the PATH whenever connecting to g5k, add the following lines to your .bash_profile file.
PATH="/home/$USER/.local/bin:$PATH"
export PATH
From a frontend, reserve your nodes as usual. For example:
$ oarsub -I -t allow_classic_ssh -l nodes=4,walltime=2Then, from inside your reservation, create and initialize the hadoop cluster.
#--version 2 says we are working on an Hadoop 2.x.y version
$ hg5k --create $OAR_NODEFILE --version 2
#Change the hadoop archive path to yours
$ hg5k --bootstrap /home/$USER/hadoop-2.7.7.tar.gz
$ hg5k --initialize --startNow create the STANDALONE based Spark cluster (hadoop_g5k does not work well anymore in YARN mode)
$ spark_g5k --create STANDALONE --hid 1Then, install Spark (with compatible Hadoop dependency as Hadoop version installed in previous steps) on every cluster node.
#Change the Spark archive path to yours, ensuring the -hadoopX.Y version is the one deployed before
$ spark_g5k --bootstrap /home/$USER/spark-2.4.5-bin-hadoop2.7.tgzFinally, initialize the Spark cluster and start it to make it available to process jobs.
$ spark_g5k --initialize --startYou are ready to submit a job from its assembly jar. For example:
$ spark_g5k --scala_job /home/$USER/some-spark-assembly.jar --main_class Main
After all your jobs are done, you shoud clean all temporary files created during previous phases.
$ spark_g5k --delete
$ hg5k --delete