Version 1.0 2016/11/14
Pierdomenico Fiadino | pierdomenico.fiadino@eurecat.org
Install Jupyter Notebook in a dedicated Python virtualenv and integrate with Spark terminal pyspark on a cluster client (for this example, we will use the tourism-lab node).
This how-to assumes that we have SSH access to the machine and pyspark already installed and configure (try executing it and see if the variables sc, HiveContext and sqlContext are already installed).
Log-in on the target node my-cluster-node (running Centos) and install the following packages (requires SUDO):
sudo yum install gcc make git Cython blas-devel blas-static liblas-devel liblas-libs lapack-devel
Note: blas and lapack libraries are needed to compile scipy.
From now on we work in user space using a dedicated Python virtual environment. All Python modules will be installed in the virtualenv.
Create virtual environment called venvtest (or whatever you like):
virtualenv venvtest
"Enter" the virtual environment:
source venvtest/bin/activate
All python-related commands from now on will be run in the virtual environment
Upgrade pip and install setup tools:
pip install --upgrade pip
pip install --upgrade setuptools
Install dependencies:
pip install numpy scipy pandas pyzmq seaborn matplotlib
Note: this might take a while, becuase numpy and scipy will be compiled from sources.
Install Jupyter:
pip install jupyter
Install IPython kernel for Jupyter using inside the virtual environment:
jupyter kernelspec install-self --user
this will place the kernel configuration file in /home/piero/.local/share/jupyter/kernels/python2. Note: this is a simple python2 kernel without integration with Spark.
Export PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS variables. Place the following two lines in .bashrc:
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=127.0.0.1 --no-browser"
By this, we are telling the terminal pyspark to use jupyter as default driver.
As simple as:
pyspark
Now Jupyter is running on port 8888. Suggestion: run this command inside a screen for leaving the Jupyter server running on the backgroud.
By defaulta, Jupyter only accepts connections from localhost. However, now we are running Jupyter remotely. In order to access it, we need to tunnel a local port (e.g., 8889) to the remote machine's port 8888 through SSH.
Run this on local machine:
ssh -L 8889:localhost:8888 piero@my-cluster-node
Now local port 8889 maps to remote port 8888. We can now access Jupyter on http://localhost:8889.
Enjoy.