mkdir spark_install && cd spark_install
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
tar -xzvf spark-2.1.0-bin-hadoop2.7.tgz
cd spark-2.1.0-bin-hadoop2.7/
./bin/spark-shell
If you want to install python under your home dir, get the tarball from here and use ./configure --prefix=any/dir/of/your/choice/where/you/have/write/access
. Then, you need to make install
and add python's bin
to the $PATH
environment variable.
To install virtualenv
pip install virtualenv
cd ~
virtualenv jupyter_pyspark
source jupyter_pyspark/bin/activate
pip install numpy
pip install scipy
pip install scikit-learn
pip install pandas
nano ~/.bashrc
paste the following in spark-2.1.0-bin-hadoop2.7/conf/spark-env.sh (this file doesn't originally exist, you have to create it)
export SPARK_HOME=/path/to/spark-2.1.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=/path/to/virtualenv/python27/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --NotebookApp.open_browser=False --NotebookApp.ip='*' --NotebookApp.port=8880"
export PYSPARK_PYTHON=/path/to/virtualenv/python27/bin/python
cd spark_install/spark-2.1.0-bin-hadoop2.7
./bin/pyspark --master local[4]
You can open a ssh tunnel as follows. This way, you can open the jupyter notebook in your local browser instead of having to use the browser on the remote machine via ssh -X
. In case of the following tunnel, you need to open your local browser at http://localhost:8889
and enter the token printed in your terminal in the previous step.
ssh -N -f -L localhost:8889:localhost:8888 yourusername@remotehost
(Above gist has been successfully tested with Ubuntu 14.04 LTS on Intel Xeon E5-2620 and Intel Celeron N3160)
Hi, thanks for this. Please could you tell me how to do:
park-2.1.0-bin-hadoop2.7/conf/spark-env.sh (this file doesn't originally exist, you have to create it)
?