This a collection of information about how to setup a working environment using Python (and ROOT) for data analysis on HPC cluster.
By effectively using the paths you can accelerate your data analysis and developement. As a first rule, please do not use or put anything in the main user home, e.g.:
/u/<USERNAME>
Anaconda or miniconda python will be installed in the user home, you don't need to change anything there manually:
/u/<USERNAME>/anaconda3
or
/u/<USERNAME>/miniconda3
There are some libraries that are of interest to everyone and can be used globally in conjuction with the anaconda installation. Their git
repositories go here:
/u/<USERNAME>/git
for your personal data of any kind you can use:
/u/<USERNAME>/personal_directories/<YOUR_NAME>
for your analysis code on HPC/Lustre you can also use a path on Lustre somethhing like the following. Here you can also put additional project specific git
repositories.
/lustre/astrum/<YOUR_NAME>
Analysis data can be found on:
/lustre/ap/<USERNAME>/...
The steps in the following section need to be done only once. And it is probably done already, so check please before repeating these steps.
For the python installation it is recommended to go straight for the Anaconda or the smaller version miniconda. Just grab the latest version and run the installer in the home directory of your user. In our case:
/u/<USERNAME>/anaconda3
or
/u/<USERNAME>/miniconda3
After the installation finishes, you can proceed further to install further libraries. Some of these libraries can be installed directly via pip
, some other can be installed using conda
. In principle I think it is better to go for the conda
variant as long as possible, then later use the pip
to add missing packages. This would ensure a better compatibility with the conda
echosystem. So a good starting point on a freshly installed anaconda/miniconda is basically the following line, which will also simplify the later installations of libraries:
conda install -c conda-forge numpy scipy matplotlib uproot3 pyqt && pip install pytdms
For data analysis of Schottky detectors you might need iqtools, iqgui and probably also barion. The best way to install the first two is to follow the instructions provided here. IQGui does not need any special installation but you can check the info here: here. If you already installed the main libraries as described in the previous step using conda
then you don't need to repeat the same steps with pip
. Barion either does not need an installation, it just needs to be pulled from GitHUB.
In general, such libraries of interest are / can be installed on the user git directory:
/u/<USERNAME>/git
But additional stuff can go into your personal directory, e.g.
/lustre/astrum/<YOUR_NAME>
An example of how to use iqtools and ROOT can be found in this jupyter notebook, and also this page. For Barion you can find an example usage in this notebook.
The easiest way to have CERN ROOT library is to install it via conda
inside a so called virtual environment, as described in this tutorial. But it basically boils down to running this command only once:
conda create -n my_root_env root -c conda-forge
that is basically it. Now you can go in and out of the virtual environment by:
conda activate my_root_env
and
conda deactivate
remember, whenever you are inside an environment, you may need to install the libraries in that specific environment again for your specific project, since the idea of virtual environment is to create isolated spaces. The concept of virtual environments are indeed quite cool and very effective in programming across many projects each with different dependencies.
Lustre file system is the "hard disk" of the HPC cluster, a super nice place to store data, with fast read and write cycle etc.. you can also use for data analysis, so no need to copy data around, all nicely in one place.
But, lustre has one single disadvantage: "directory listing" is super slow on lustre. This means commands like "ls", "tree" etc... will fail and any code, including your own scripts or GUI programs that try to "open" the directory on lustre issue a directory listing command. This is a problem with data from experiments with a lot of single files, like the E143 experiment.
But there is a simple trick to circumvent this problem. You can get a directory listing into a file by using this command:
find "$PWD" -iname *.tiq -type f > ~/listing.txt
alternatively you can truncate echo
:
echo * | tr ' ' '\n'
or any other variants such as xargs -n 1
etc.. as can be found here.
by doing this you can control exactly which kind of file like TIQ, TDMS etc. you take for analysis, basically by taking the individual files of that file (in this case listing.txt) for direct insertion into GUI or just by iterating in your script.
So the data are on the Lustre. If you are working on a local LXG computers, then from your LXG Linux computer make a Single jump (when you are at GSI, using the GSI computer):
ssh -X <USERNAME>@lustre.hpc.gsi.de
Now you can activate the ROOT environment
conda activate my_root_env
now you have access to ROOT and also IQTools and IQGui
You can use IQTools in your code:
from iqtools import *
also mixed with ROOT
from ROOT import TCanvas, TH2D, ...
or
from iqtools import *
import matplotlib.pyplot as plt
%matplotlib inline
from ROOT import TGraph, TFile, TCanvas, TH2F
%jsroot off
If you need, in this environment you can run the TBrowser directly in the command line:
root --web=off -e 'new TBrowser()'
Some more examples:
https://github.com/xaratustrah/iqtools/blob/main/doc/quick_introduction_iqtools.ipynb
You don't need to use iqtools with ROOT, only if you like. In which case I suggest looking at examples:
https://gist.github.com/xaratustrah/474404d56b7664ab6ad2f8130eb1331e https://github.com/xaratustrah/iqtools/blob/main/doc/rootpy_iqtools_example.ipynb
so here you have all you need. The libraries, the data and powerful computers on the HPC. Due to some version conflicts, you might need to run iqgui
in a new environment. Just create a new clean Python3.9 environment and start from there.
Using Jupyter-Notebook for testing and analysing the data is very convenient. It is mainly suitable for testing the procedures, whereas the long term data analysis is probably better done outside of Jupyter-Notebook inside dedicated scripts. Nevertheless if you like to use Jupyter-Notebook on the data, which are stored on the HPC cluster, you will notice that the remote connection will not be visible on your local machine. So that is where the SSH hopping comes into play.
So after activating the ROOT+Conda environment, you can run it by:
jupyter notebook --no-browser --port=8889
Note that by running this command, you will see a long string token printed on the screen which you are going to need later as described below.
The classic way is to create tunnels:
You open a new terminal window on your local LXG machine:
ssh -N -f -L localhost:8888:localhost:8889 <USERNAME>@lustre.hpc.gsi.de
This creates a tunnel between the local computer and the HPC machine. Note that this tunnel stays there for ever on your local machine. You can see that it is running by doing:
ps ax | grep ssh
If you by mistake create several tunnels, you can kill
their processes by entering their corresponding process ID which is printed on the leftmost column:
kill <PID>
So now at this stage you know that Jupyter-Notebook is running on the HPC computer on port 8889 and you have created a SSH-tunnel on your local LXG machine which connects port 8889 of the HPC machine to your local 8888. Now if you open browser on your local machine, you type:
localhost:8888
you can see the Jupyter-Notebook working. You just need to type the token for authentication which was printed on the screen before. This means that now you have access to all analysis files, scripts, ROOT and other libraries from your local machine's browser.
The alternative way is to use ProxyJump
:
Single jump:
ssh -L 8888:localhost:8889 <USERNAME>@lx-pool.gsi.de jupyter notebook --no-browser --port=8889
Double jump:
ssh -L 8888:localhost:8889 -o ProxyJump=<USERNAME>@lx-pool.gsi.de <USERNAME>@lustre.hpc.gsi.de jupyter notebook --no-browser --port=8889
then you paste the URL+Token in your browser.
btw. instead of doing analysis inside the browser, I highly recommend using the free VSCodium text editor (free fork of VSCode), which has a super nice integrated interface for all programming languages, LaTeX etc., but also can deal with such remote Jupyter-Servers as mentioned above and also different Python environtments at the same time.
If you need to acticate an environment before, you may need to include it in the SSH command:
ssh -L 8888:localhost:8889 <USERNAME>@lx-pool.gsi.de "conda activate my_root_env; jupyter notebook --no-browser --port=8889"
or
ssh -L 8888:localhost:8889 -o ProxyJump=<USERNAME>@lx-pool.gsi.de <USERNAME>@lustre.hpc.gsi.de "conda activate my_root_env; jupyter notebook --no-browser --port=8889"
Thanks to the free / open source program PuTTY you can repeat the steps above on a Windows machine, but this machine needs to have access to the same network as the HPC, i.e. it should be a GSI device:
like above:
So after activating the ROOT+Conda environment, you can run it by:
jupyter notebook --no-browser --port=8889
Note that by running this command, you will see a long string token printed on the screen which you are going to need later as described below.
- Open PuTTY, enter the server URL or IP address as the hostname
- go to SSH on the bottom of the left pane to expand the menu and then click on Tunnels
- Enter the port number which you want to use to access Jupyter on your local machine, in this case 8888, and set the destination as localhost:8889 where :8889 is the number of the port that Jupyter Notebook is running on.
- Now click the Add button, and the ports should appear in the Forwarded ports list.
- Click Open button to open the terminal
- Open browser, enter localhost:8888 to see the jupyter and then enter the token
A screen shot cof the settings an be found here.
It is possible to do the analysis from Windows or Mac computers inside our outside of institute.
Mac inside institute:
- you can use
ssh -X
Mac or Windows outside of the institute, or Windows inside of the institute:
- You can use CITRIX
For that you have to apply for the activation of your CITRIX account by the IT-deparment and follow the instructions on the IT page for installing the CITRIX receiver.
After that, on the CITRIX host, you can directly connect to the HPC cluster using either X2GO
or the XWin-32
clients. There is no need to use remote desktop connection to another local windows machine. But you need to hop once over a linux machine as described in the section above, since HPC machines do not seem to be reachable to the receiver machines.
So the same applies here: inside of the CITRIX receiver, you start either X2GO
or the XWin-32
clients and connect to lx-pool
or your own local LXG machine (if you have the permissions to). From there you do the rest like above. This also applies to SSH hopping and Jupyter-Notebook.
TBD.