Skip to content

Instantly share code, notes, and snippets.

@infotroph
Created March 11, 2025 21:27
Show Gist options
  • Save infotroph/4ad8a8411ddbd72b929ad23ac8f5e596 to your computer and use it in GitHub Desktop.
Save infotroph/4ad8a8411ddbd72b929ad23ac8f5e596 to your computer and use it in GitHub Desktop.
model launcher script to run PEcAn jobs as Slurm arrays
#!/bin/bash
launchdir=$(dirname "$1")
logfile="$launchdir"/slurm_submit_log.txt
if [[ -z ${SLURM_ARRAY_TASK_ID} ]]; then
echo "SLURM_ARRAY_TASK_ID not set. Exiting." >> "$logfile"
exit 1
fi
# joblist.txt has job script name on line 1, invocation dirs on lines 2-n
# => add 1 to each task ID to get its line number
jobscript=$(head -n1 "$launchdir"/joblist.txt)
task_line=$((SLURM_ARRAY_TASK_ID + 1))
taskdir=`tail -n+"$task_line" "$launchdir"/joblist.txt | head -n1`
"$taskdir"/"$jobscript" >> "$logfile" 2>&1
if [[ "$?" != "0" ]]; then
echo "ERROR IN MODEL RUN" >> "$logfile"
exit 1
fi
@infotroph
Copy link
Author

I'm using this with a <host> section that looks like this:

 <host>
  <name>localhost</name>
  <outdir>output/out</outdir>
  <rundir>output/run</rundir>
  <qsub>sbatch -J @NAME@ -o @STDOUT@ -e @STDERR@</qsub>
  <qsub.jobid>.*job ([0-9]+).*</qsub.jobid>
  <qstat>squeue -j @JOBID@ || echo DONE</qstat>
  <modellauncher>
    <binary>tools/slurm_array_submit.sh</binary>
    <qsub.extra>-a 1-@NJOBS@</qsub.extra>
  </modellauncher>
 </host>

@infotroph
Copy link
Author

Next steps

  • When submitting multiple batches (ie when number of jobs in run > settings$host$modellauncher$Njobmax), all batches currently get Njobmax slots even if the last batch isn't full. I don't know how much wasted overhead this causes -- maybe the extras just exit immediately? If needed, we could precalculate how many jobs are needed per batch and adjust array sizes accordingly.
  • Feels a little silly (though maybe nice for debugging?) to do all the work of writing out separate joblists rather than just reading lines from rundir/runs.txt, which PEcAn always generates upstream. If making that switch, we should consider whether to make the existing modellauncher work the same way.
  • Calling it this way makes slurm_array_submit.sh a wrapper around the launcher.sh wrapper that PEcAn already writes. Can we build array support into, say, setup_modellauncher() instead?
  • Seems wise to have some guardrails to avoid collision between standard qsub and array mode, lest we launch N arrays of N jobs each.

@infotroph
Copy link
Author

Also: Before getting too far into the weeds on editing this, evaluate whether we can adopt an existing framework - see especially future.batchtools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment