Internet Analytics lab environment setup

1. Which machines to use?

You can use your the computers in BC 07-08 or your own computer to do the labs.

How to use an EPFL-provided computer?

If you want to use the computers in BC 07-08, they run virtual machines; the image that contains the software that you will use during the course is

IC-CO-IN-SC

Many programs that will be useful during the course (such as Python, Jupyter) are not the "default" ones found in $PATH. You can find them in /opt/anaconda3/bin. (The same holds for the cluster.)

In particular, you will find these two commands useful:

/opt/anaconda3/bin/jupyter console   # Launch a command-line interpreter
/opt/anaconda3/bin/jupyter notebook  # Launch a notebook server

Be careful: running jupyter notebook (without the absolute path as described above) seems to work at first, but many of the libraries that we use in the course will not be available (e.g., matplotlib).

How to use your own computer?

You only support the BC07-08 machines. However, you are free to use your own machines.

To work on your own computer, you should have at least Python 3 installed with jupyter and any library you feel like using for the labs. A simple way to do so is to install anaconda.

2. Set up environment to work on the Spark cluster

Step 1: Change your default shell to bash

The cluster uses LDAP settings. You can change your EPFL-wide default shell on https://cadiwww.epfl.ch/cgi-bin/accountprefs/login. We recommend that you use bash, as this is the shell that we support in this course.

Note that the change may take some time to get synchronized. To manually change your current shell to bash, run:

exec /bin/bash

Step 2: Connect to the cluster

The cluster can be accessed at iccluster040.iccluster.epfl.ch. You need to be on the EPFL network to be able to reach it (either on-campus or connected via VPN). You can connect to the server using your GASPAR credentials. On the command line, type:

ssh -l USERNAME iccluster040.iccluster.epfl.ch

You need to replace USERNAME by your GASPAR username.

Step 3: Configure Jupyter on the cluster

Edit the file ~/.bashrc using your favorite text editor (e.g. vim), and add the following lines at the end:

export PYSPARK_PYTHON=/opt/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/anaconda3/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
alias pyspark='pyspark --conf spark.ui.port=xxxx0'

where xxxx0 is the last 4 four digits of you SCIPER number followed by 0. If the first digit starts with 0 or if the overall number is larger than 6553, then the port number will not be set properly. In this case, replace xxxx with any random number between 1024 and 6553.

Then, reload the file using

source ~/.bashrc

or by logging out and in again on the cluster.

Now, we will need to generate a Jupyter confguration. Use the command

/opt/anaconda3/bin/jupyter notebook --generate-config

We will also protect the Jupyter notebook server with a password. To generate a password hash, use the following command.

/opt/anaconda3/bin/python -c "from notebook.auth import passwd; print(passwd())"

Remember the password that you chose. The last line (starting with sha1:...) will be used in the next step. Edit the file ~/.jupyter/jupyter_notebook_config.py, and add the following lines at the top:

c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = 'yyyy'
c.NotebookApp.port = xxxx

where yyyy is the password hash string that you generated previously (including sha1:), and xxxx are the four last digits of your SCIPER number.

Step 4: Test that your configuration works

Now, you should be all set. To test your configuration, run

pyspark

It should launch a new Jupyter notebook server. You can access it from any computer inside EPFL at the address http://iccluster040.iccluster.epfl.ch:xxxx (where xxxx are the four last digits of your SCIPER number).

3. Run `pyspark`

By default, pyspark does not run in cluster mode, and all computations happen locally. Most of the time, this is not what you want. In order to make Spark run in cluster mode, you need to add --master yarn to the command line.

You will also need to specify the number of executors (think of them as distributed processes that run in parallel across servers), the number of cores per executor, and the amount of memory per executor.

Keep in mind that the cluster comprises 10 machines, each of which has 24 cores and 182GB RAM. As a starting point, we recommend that you use this command line to run pyspark.

pyspark --master yarn --num-executors 1 --executor-cores 1 --executor-memory 10G

This will allocate very few resources for yourself, but your classmates will thank you---they will be able to launch pyspark as well :-) If you are working on a bigger task, you can start increasing the number of executors, and the cores & memory per executor. For example:

pyspark --master yarn --num-executors 7 --executor-cores 4 --executor-memory 20G

trouleau/ix-2019-setup-step-by-step-instructions.md