Sometimes it is easier to start off a docker image that contains all the things you need. You can use it at the beginning, then add all other stuff you need, for example:
bootstrap: docker
from: rootproject/root
... also all the python stuff + pip installs and envs
The next step is to build the image, which needs priviliges:
docker run --rm --privileged -v ${PWD}:/PWD quay.io/singularity/singularity:v3.7.4-slim build /PWD/myimage.sif /PWD/myimage.defOf course the docker image should contain only the exec code, not the analysis scripts themselves, otherwise you can not so easily debug your own code.
First log in to pool, then from there log in to virgo, from there you can go on lustre. For GPU nodes you need to choose differently as stated in the documentation. Next, you need to make variables and set up the environment vars, for example:
cd /lustre/some_experiment
export LUSTRE_HOME=$PWDFor data input, output files and scripts, imagine you have the following structure:
mkdir $LUSTRE_HOME/source_dir
mkdir $LUSTRE_HOME/output_dir
mkdir $LUSTRE_HOME/scripts_dirMake sure LUSTRE_HOME/output_dir is empty before running the job on the farm. Then, in order to work with SLURM arrays you need to pass files as arrays into the scripts. So assuming your data files are *.dat files, you need to create a file list first:
cd $LUSTRE_HOME/source_dir
find $PWD -type f -name "*.dat" | sort > $LUSTRE_HOME/scripts/filelist.txtIn the scripts directory, you need at least three scripts:
- The actual analysis script in python
- The worker
- The submitter script
- The image file
myimage.sif
Here is the number_cruncher.py, it needs to have command line args for file input and output, one should not use static assignment.
#!/usr/bin/env python
"""
This is the number cruncher script in python
"""
import argparse
import sys
import os
...
def process():
pass # this one does the job
...
def main():
parser = argparse.ArgumentParser()
parser.add_argument('filenames', nargs='+', type=str,
help='Name of the input files.')
parser.add_argument('-o', '--outdir', type=str, default='.',
help='output directory.')
....
for file in args.filenames:
# call the process function here for each, this avoids loading the Python VM many times
process(file)
if __name__ == '__main__':
main()
Now you need a worker script, we call it work.slurm:
#!/bin/bash
if [ $# -eq 0 ]; then
echo "No file list provided, aborting."
exit 1
fi
FILELISTNAME=$1
FILES=($(cat $FILELISTNAME))
SOME_PARAMETER=107
OUTDIR=$LUSTRE_HOME/output_dir
singularity run $LUSTRE_HOME/scripts/myimage.sif /bin/sh -c "python3 $LUSTRE_HOME/scripts/number_cruncher.py -d $SOME_PARAMETER -o $OUTDIR -i ${FILES[$SLURM_ARRAY_TASK_ID]}"
here you can set also SOME_PARAMETER to pass to your calculation script. Now the submitter script arraysubmit.slurm looks like this:
#!/bin/bash
if [ $# -eq 0 ]; then
echo "No file list provided, aborting."
exit 1
fi
FILELISTNAME=$1
NUM_TASKS=($(wc -l $FILELISTNAME)) # or one can use ${#FILES[*]}
MAX_ID=$((NUM_TASKS - 1))
sbatch --array=0-$MAX_ID --mem=60G --time=15:00 --partition=main --ntasks=1 --job-name=calc work.slurm $FILELISTNAME
In the last line of the submitter script, you see that you can change the amount of RAM, calculation time and also the partition. Here also the number of tasks per core and other related settings can be done.
Worker and submitter may look similar with respect to the file list input, but one is for single task, the other is for submiting many jobs as an array.
Noe you can submit the job. You can run:
cd $LUSTRE_HOME/scripts_dir
./arraysubmit.slurm filelist.txt There are many SLURM commands that can be used to check the status of calculations, like this one:
sacct -u USERNAMEor check a specific JOB ID:
squeue --user USERNAME && sacct --format=Elapsed -j JOB_IDAnd of course many more commands.
In order to avoid trouble, please make sure your code is actually working well with single arguments. Then create a file list that contains only 2 or 3 files, and submit that as array, before starting with larger chunks.