Sometimes it is easier to start off a docker image that contains all the things you need. You can use it at the beginning, then add all other stuff you need, for example:
bootstrap: docker
from: rootproject/root
... also all the python stuff + pip installs and envs
The next step is to build the image, which needs priviliges:
docker run --rm --privileged -v ${PWD}:/PWD quay.io/singularity/singularity:v3.7.4-slim build /PWD/myimage.sif /PWD/myimage.def
Of course the docker image should contain only the exec code, not the analysis scripts themselves, otherwise you can not so easily debug your own code.
First log in to pool, then from there log in to virgo, from there you can go on lustre
. For GPU nodes you need to choose differently as stated in the documentation. Next, you need to make variables and set up the environment vars, for example:
cd /lustre/some_experiment
export LUSTRE_HOME=$PWD
For data input, output files and scripts, imagine you have the following structure:
mkdir $LUSTRE_HOME/source_dir
mkdir $LUSTRE_HOME/output_dir
mkdir $LUSTRE_HOME/scripts_dir
Make sure LUSTRE_HOME/output_dir
is empty before running the job on the farm. Then, in order to work with SLURM arrays you need to pass files as arrays into the scripts. So assuming your data files are *.dat
files, you need to create a file list first:
cd $LUSTRE_HOME/source_dir
find $PWD -type f -name "*.dat" | sort > $LUSTRE_HOME/scripts/filelist.txt
In the scripts directory, you need at least three scripts:
- The actual analysis script in python
- The worker
- The submitter script
- The image file
myimage.sif
Here is the number_cruncher.py
, it needs to have command line args for file input and output, one should not use static assignment.
#!/usr/bin/env python
"""
This is the number cruncher script in python
"""
import argparse
import sys
import os
...
def process():
pass # this one does the job
...
def main():
parser = argparse.ArgumentParser()
parser.add_argument('filenames', nargs='+', type=str,
help='Name of the input files.')
parser.add_argument('-o', '--outdir', type=str, default='.',
help='output directory.')
....
for file in args.filenames:
# call the process function here for each, this avoids loading the Python VM many times
process(file)
if __name__ == '__main__':
main()
Now you need a worker script, we call it work.slurm
:
#!/bin/bash
if [ $# -eq 0 ]; then
echo "No file list provided, aborting."
exit 1
fi
FILELISTNAME=$1
FILES=($(cat $FILELISTNAME))
SOME_PARAMETER=107
OUTDIR=$LUSTRE_HOME/output_dir
singularity run $LUSTRE_HOME/scripts/myimage.sif /bin/sh -c "python3 $LUSTRE_HOME/scripts/number_cruncher.py -d $SOME_PARAMETER -o $OUTDIR -i ${FILES[$SLURM_ARRAY_TASK_ID]}"
here you can set also SOME_PARAMETER
to pass to your calculation script. Now the submitter script arraysubmit.slurm
looks like this:
#!/bin/bash
if [ $# -eq 0 ]; then
echo "No file list provided, aborting."
exit 1
fi
FILELISTNAME=$1
NUM_TASKS=($(wc -l $FILELISTNAME)) # or one can use ${#FILES[*]}
MAX_ID=$((NUM_TASKS - 1))
sbatch --array=0-$MAX_ID --mem=60G --time=15:00 --partition=main --ntasks=1 --job-name=calc work.slurm $FILELISTNAME
In the last line of the submitter script, you see that you can change the amount of RAM, calculation time and also the partition. Here also the number of tasks per core and other related settings can be done.
Worker and submitter may look similar with respect to the file list input, but one is for single task, the other is for submiting many jobs as an array.
Noe you can submit the job. You can run:
cd $LUSTRE_HOME/scripts_dir
./arraysubmit.slurm filelist.txt
There are many SLURM commands that can be used to check the status of calculations, like this one:
sacct -u USERNAME
or check a specific JOB ID:
squeue --user USERNAME && sacct --format=Elapsed -j JOB_ID
And of course many more commands.
In order to avoid trouble, please make sure your code is actually working well with single arguments. Then create a file list that contains only 2 or 3 files, and submit that as array, before starting with larger chunks.