SLURM arrays with Python and Singularity containers

Preparation

Making a Singularity SIF image

Sometimes it is easier to start off a docker image that contains all the things you need. You can use it at the beginning, then add all other stuff you need, for example:

bootstrap: docker
from: rootproject/root

... also all the python stuff + pip installs and envs

The next step is to build the image, which needs priviliges:

docker run --rm --privileged -v ${PWD}:/PWD quay.io/singularity/singularity:v3.7.4-slim build /PWD/myimage.sif /PWD/myimage.def

Of course the docker image should contain only the exec code, not the analysis scripts themselves, otherwise you can not so easily debug your own code.

Setting up directories and ENVs

First log in to pool, then from there log in to virgo, from there you can go on lustre. For GPU nodes you need to choose differently as stated in the documentation. Next, you need to make variables and set up the environment vars, for example:

cd /lustre/some_experiment
export LUSTRE_HOME=$PWD

For data input, output files and scripts, imagine you have the following structure:

mkdir $LUSTRE_HOME/source_dir
mkdir $LUSTRE_HOME/output_dir
mkdir $LUSTRE_HOME/scripts_dir

Make sure LUSTRE_HOME/output_dir is empty before running the job on the farm. Then, in order to work with SLURM arrays you need to pass files as arrays into the scripts. So assuming your data files are *.dat files, you need to create a file list first:

cd $LUSTRE_HOME/source_dir
find $PWD -type f -name "*.dat" | sort > $LUSTRE_HOME/scripts/filelist.txt

Scripts

In the scripts directory, you need at least three scripts:

The actual analysis script in python
The worker
The submitter script
The image file myimage.sif

Here is the number_cruncher.py, it needs to have command line args for file input and output, one should not use static assignment.

#!/usr/bin/env python
"""
This is the number cruncher script in python
"""

import argparse
import sys
import os

...

def process():
	pass # this one does the job

...

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('filenames', nargs='+', type=str,
                        help='Name of the input files.')
    parser.add_argument('-o', '--outdir', type=str, default='.',
                        help='output directory.')
	....
	
    for file in args.filenames:
		# call the process function here for each, this avoids loading the Python VM many times
    	process(file) 


if __name__ == '__main__':
    main()

Now you need a worker script, we call it work.slurm:

#!/bin/bash

if [ $# -eq 0 ]; then
    echo "No file list provided, aborting."
    exit 1
fi

FILELISTNAME=$1

FILES=($(cat $FILELISTNAME))

SOME_PARAMETER=107
OUTDIR=$LUSTRE_HOME/output_dir

singularity run $LUSTRE_HOME/scripts/myimage.sif /bin/sh -c "python3 $LUSTRE_HOME/scripts/number_cruncher.py -d $SOME_PARAMETER -o $OUTDIR -i ${FILES[$SLURM_ARRAY_TASK_ID]}"

here you can set also SOME_PARAMETER to pass to your calculation script. Now the submitter script arraysubmit.slurm looks like this:

#!/bin/bash

if [ $# -eq 0 ]; then
    echo "No file list provided, aborting."
    exit 1
fi

FILELISTNAME=$1

NUM_TASKS=($(wc -l $FILELISTNAME)) # or one can use ${#FILES[*]}
MAX_ID=$((NUM_TASKS - 1))

sbatch --array=0-$MAX_ID --mem=60G --time=15:00 --partition=main --ntasks=1 --job-name=calc work.slurm $FILELISTNAME

In the last line of the submitter script, you see that you can change the amount of RAM, calculation time and also the partition. Here also the number of tasks per core and other related settings can be done.

Worker and submitter may look similar with respect to the file list input, but one is for single task, the other is for submiting many jobs as an array.

Submitting the job

Noe you can submit the job. You can run:

cd $LUSTRE_HOME/scripts_dir
./arraysubmit.slurm filelist.txt

Checking the progress of the jobs

There are many SLURM commands that can be used to check the status of calculations, like this one:

sacct -u USERNAME

or check a specific JOB ID:

squeue --user USERNAME && sacct --format=Elapsed -j JOB_ID

And of course many more commands.

Final remarks

In order to avoid trouble, please make sure your code is actually working well with single arguments. Then create a file list that contains only 2 or 3 files, and submit that as array, before starting with larger chunks.

xaratustrah/slurm_array.md