In addition to Harvard's fantastic list, we list some other convenient SLURM commands.
Useful for letting other enqueued jobs run without having to kill/re-run already running jobs. To delay for 7 days:
scontrol update JobID=<JOB ID> StartTime=now+7days
The NODELIST(REASON)
field reported by squeue
for the delayed jobs will become (BeginTime)
.
when
suspend
andhold
don't seem to do anything!
You may want to stop running jobs and requeue them further down the queue (i.e. avoid immediately re-runing them). This is useful for freeing up nodes to let other jobs run without having to resubmit your running jobs.
To requeue a job and delay it for one day:
(export jobid=<JOB ID>; scontrol requeue $jobid; scontrol update JobID=$jobid StartTime=now+1day)
If you have many jobs (with unique job-ids), you'll want to type out a list of jobs to requeue and delay using a for
loop:
for jobid in <SPACE SEPARATED LIST OF JOB IDS>; do scontrol requeue $jobid; scontrol update JobID=$jobid StartTime=now+1day; done
If many of your jobs share a common prefix which you don't want to retype; export it!
(export prefix=<COMMON JOB ID PREFIX>; for suffix in <SPACE SEPARATED LIST OF JOB ID SUFFIXES>; do scontrol requeue ${prefix}${suffix}; scontrol update JobID=${prefix}${suffix} StartTime=now+1day; done)
For example, of the following job id list...
1234567_10
1234567_11
1234567_12
1234567_13
1234567_14
if you want to requeue + delay jobs 1234567_11
and 1234567_12
for 2 days, you'd call
(export prefix=1234567_1; for suffix in 1 2; do scontrol requeue ${prefix}${suffix}; scontrol update JobID=${prefix}${suffix} StartTime=now+2days; done)
Note that SLURM will often not list the re-queued jobs in
squeue
, but rest assured, they're still enqueued!
Take care to ensure your jobs have everything they need (e.g. files) when they're eventually re-run.
Keep in mind re-queued jobs may behave differently when re-run. Think carefully e.g. about your random seeding!