Jupyter + Pyspark how-to

Version 1.0 2016/11/14
Pierdomenico Fiadino | pierdomenico.fiadino@eurecat.org

Synopysis

Install Jupyter Notebook in a dedicated Python virtualenv and integrate with Spark terminal pyspark on a cluster client (for this example, we will use the tourism-lab node).

This how-to assumes that we have SSH access to the machine and pyspark already installed and configure (try executing it and see if the variables sc, HiveContext and sqlContext are already installed).

CentOS prerequisites (admin-part)

Log-in on the target node my-cluster-node (running Centos) and install the following packages (requires SUDO):

sudo yum install gcc make git Cython blas-devel blas-static liblas-devel liblas-libs lapack-devel

Note: blas and lapack libraries are needed to compile scipy.

Set-up virtual environment (user-part)

From now on we work in user space using a dedicated Python virtual environment. All Python modules will be installed in the virtualenv.

Create virtual environment called venvtest (or whatever you like):

virtualenv venvtest

"Enter" the virtual environment:

source venvtest/bin/activate

All python-related commands from now on will be run in the virtual environment

Install Python modules

Upgrade pip and install setup tools:

pip install --upgrade pip
pip install --upgrade setuptools

Install dependencies:

pip install numpy scipy pandas pyzmq seaborn matplotlib

Note: this might take a while, becuase numpy and scipy will be compiled from sources.

Install and configure Jupyter

Install Jupyter:

pip install jupyter

Install IPython kernel for Jupyter using inside the virtual environment:

jupyter kernelspec install-self --user

this will place the kernel configuration file in /home/piero/.local/share/jupyter/kernels/python2. Note: this is a simple python2 kernel without integration with Spark.

Hook Pyspark and Python

Export PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS variables. Place the following two lines in .bashrc:

export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=127.0.0.1 --no-browser"

By this, we are telling the terminal pyspark to use jupyter as default driver.

Execute Spark-ready Jupyter

As simple as:

pyspark

Now Jupyter is running on port 8888. Suggestion: run this command inside a screen for leaving the Jupyter server running on the backgroud.

By defaulta, Jupyter only accepts connections from localhost. However, now we are running Jupyter remotely. In order to access it, we need to tunnel a local port (e.g., 8889) to the remote machine's port 8888 through SSH.

Run this on local machine:

ssh -L 8889:localhost:8888 piero@my-cluster-node

Now local port 8889 maps to remote port 8888. We can now access Jupyter on http://localhost:8889.

Enjoy.

Laxman-SM/jupyter_pyspark.md

Select an option

No results found