You can use your the computers in BC 07-08 or your own computer to do the labs.
If you want to use the computers in BC 07-08, they run virtual machines; the image that contains the software that you will use during the course is
IC-CO-IN-SC
Many programs that will be useful during the course (such as Python, Jupyter) are not the "default" ones found in $PATH
. You can find them in /opt/anaconda3/bin. (The same holds for the cluster.)
In particular, you will find these two commands useful:
/opt/anaconda3/bin/jupyter console # Launch a command-line interpreter
/opt/anaconda3/bin/jupyter notebook # Launch a notebook server
Be careful: running jupyter notebook
(without the absolute path as
described above) seems to work at first, but many of the libraries that we use
in the course will not be available (e.g., matplotlib
).
You only support the BC07-08 machines. However, you are free to use your own machines.
To work on your own computer, you should have at least Python 3 installed with jupyter
and any library you feel like using for the labs. A simple way to do so is to install anaconda.
The cluster uses LDAP settings. You can change your EPFL-wide default shell on
https://cadiwww.epfl.ch/cgi-bin/accountprefs/login. We recommend that you use
bash
, as this is the shell that we support in this course.
Note that the change may take some time to get synchronized. To manually change your current shell to bash
, run:
exec /bin/bash
The cluster can be accessed at iccluster040.iccluster.epfl.ch
. You need to be on the EPFL network to be able to reach it (either on-campus or connected via VPN). You can connect to the server using your GASPAR credentials. On the command line, type:
ssh -l USERNAME iccluster040.iccluster.epfl.ch
You need to replace USERNAME by your GASPAR username.
Edit the file ~/.bashrc
using your favorite text editor (e.g. vim
), and add the following lines at the end:
export PYSPARK_PYTHON=/opt/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/anaconda3/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
alias pyspark='pyspark --conf spark.ui.port=xxxx0'
where xxxx0
is the last 4 four digits of you SCIPER number followed by 0
.
If the first digit starts with 0
or if the overall number is larger than 6553
, then the port number will not be set properly. In this case, replace xxxx
with any random number between 1024
and 6553
.
Then, reload the file using
source ~/.bashrc
or by logging out and in again on the cluster.
Now, we will need to generate a Jupyter confguration. Use the command
/opt/anaconda3/bin/jupyter notebook --generate-config
We will also protect the Jupyter notebook server with a password. To generate a password hash, use the following command.
/opt/anaconda3/bin/python -c "from notebook.auth import passwd; print(passwd())"
Remember the password that you chose. The last line (starting with sha1:...
)
will be used in the next step. Edit the file
~/.jupyter/jupyter_notebook_config.py
, and add the following lines at the
top:
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = 'yyyy'
c.NotebookApp.port = xxxx
where yyyy
is the password hash string that you generated previously (including sha1:
), and
xxxx
are the four last digits of your SCIPER number.
Now, you should be all set. To test your configuration, run
pyspark
It should launch a new Jupyter notebook server. You can access it from any computer inside EPFL at the address http://iccluster040.iccluster.epfl.ch:xxxx
(where xxxx
are the four last digits of your SCIPER number).
By default, pyspark does not run in cluster mode, and all computations happen
locally. Most of the time, this is not what you want. In order to make Spark
run in cluster mode, you need to add --master yarn
to the command line.
You will also need to specify the number of executors (think of them as distributed processes that run in parallel across servers), the number of cores per executor, and the amount of memory per executor.
Keep in mind that the cluster comprises 10 machines, each of which has 24 cores and 182GB RAM. As a starting point, we recommend that you use this command line to run pyspark.
pyspark --master yarn --num-executors 1 --executor-cores 1 --executor-memory 10G
This will allocate very few resources for yourself, but your classmates will thank you---they will be able to launch pyspark as well :-) If you are working on a bigger task, you can start increasing the number of executors, and the cores & memory per executor. For example:
pyspark --master yarn --num-executors 7 --executor-cores 4 --executor-memory 20G