Skip to content

Instantly share code, notes, and snippets.

@Laxman-SM
Forked from pierdom/jupyter_pyspark.md
Created January 28, 2020 21:35
Show Gist options
  • Select an option

  • Save Laxman-SM/6392a34dd47ccd4a41fcdf7fb6a18a8f to your computer and use it in GitHub Desktop.

Select an option

Save Laxman-SM/6392a34dd47ccd4a41fcdf7fb6a18a8f to your computer and use it in GitHub Desktop.
[Run Jupyter with Pyspark integration] This how-to assumes that pyspark is installed and correctly configured to access the cluster (or the stand-alone configuration). Jupyter and other Python packages are executed in a virtualenv. #python #spark #bigdata #sysadmin

Jupyter + Pyspark how-to

Version 1.0 2016/11/14
Pierdomenico Fiadino | pierdomenico.fiadino@eurecat.org

Synopysis

Install Jupyter Notebook in a dedicated Python virtualenv and integrate with Spark terminal pyspark on a cluster client (for this example, we will use the tourism-lab node).

This how-to assumes that we have SSH access to the machine and pyspark already installed and configure (try executing it and see if the variables sc, HiveContext and sqlContext are already installed).

CentOS prerequisites (admin-part)

Log-in on the target node my-cluster-node (running Centos) and install the following packages (requires SUDO):

sudo yum install gcc make git Cython blas-devel blas-static liblas-devel liblas-libs lapack-devel

Note: blas and lapack libraries are needed to compile scipy.

Set-up virtual environment (user-part)

From now on we work in user space using a dedicated Python virtual environment. All Python modules will be installed in the virtualenv.

Create virtual environment called venvtest (or whatever you like):

virtualenv venvtest

"Enter" the virtual environment:

source venvtest/bin/activate

All python-related commands from now on will be run in the virtual environment

Install Python modules

Upgrade pip and install setup tools:

pip install --upgrade pip
pip install --upgrade setuptools

Install dependencies:

pip install numpy scipy pandas pyzmq seaborn matplotlib

Note: this might take a while, becuase numpy and scipy will be compiled from sources.

Install and configure Jupyter

Install Jupyter:

pip install jupyter

Install IPython kernel for Jupyter using inside the virtual environment:

jupyter kernelspec install-self --user

this will place the kernel configuration file in /home/piero/.local/share/jupyter/kernels/python2. Note: this is a simple python2 kernel without integration with Spark.

Hook Pyspark and Python

Export PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS variables. Place the following two lines in .bashrc:

export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=127.0.0.1 --no-browser"

By this, we are telling the terminal pyspark to use jupyter as default driver.

Execute Spark-ready Jupyter

As simple as:

pyspark

Now Jupyter is running on port 8888. Suggestion: run this command inside a screen for leaving the Jupyter server running on the backgroud.

By defaulta, Jupyter only accepts connections from localhost. However, now we are running Jupyter remotely. In order to access it, we need to tunnel a local port (e.g., 8889) to the remote machine's port 8888 through SSH.

Run this on local machine:

ssh -L 8889:localhost:8888 piero@my-cluster-node

Now local port 8889 maps to remote port 8888. We can now access Jupyter on http://localhost:8889.

Enjoy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment