- ~60 nodes
- mixture of: ** 24 cores X 250GB memory / node ** 32 cores x 252GB memory / node
- No Swap
- Dedicated 10Gbps connection to Bucket / node
- IB Backend for MPI - not available on 'redshirt' machines
- Slurm Scheduler is used for resource management
- Treats CPU, Memory, and Time as consumable resources. Time is divided into a handful of bands and job priority is skewed based on the requested amount of time for the job. (TEST_MINS = 5, NORMAL_MINS = 240, LONG_MINS = 2880, VLONG_MINS = 5760 - shorter jobs get higher priority, jobs can not run more than 5760 minutes / 4 days)
This tool is used for the actual submission of batch tasks to the cluster, supplied with a correctly formatted shell script it will execute the contents of the shell script on a cluster node. While the flags to control sbatch can be supplied on the command line, we strong recommend you include them in the submitted file. Doing so makes it much easier for us to help debug issues you are seeing as all the information is in one place.
Example sbatch enabled script:
#!/usr/bin/env bash
#name the job pybench33 and place it's output in a file named slurm-<jobid>.out
# allow 40 minutes to run (it should not take 40 minutes however)
# set partition to 'all' so it runs on any available node on the cluster
#SBATCH -J 'pybench33'
#SBATCH -o slurm-%j.out
#SBATCH -p all
#SBATCH -t 40
#SBATCH --mail-user=gmcgrath@princeton
#SBATCH --mail-type=END,FAIL
#SBATCH -c 4
module load anacondapy/3.4
. activate pybench33
./pybench33.py
This will submit a single task to the scheduler to run a python based benchmark and then quit.
In it's simplest form this provides the current state / load of the cluster upon request. It will tell you how many nodes are off line, in use, fully utilized, and idle at the point in time when you run the command.
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug up infinite 1 idle spock-c0-1
all* up infinite 32 mix spock-c0-[2,4-16],spock-c1-[1,3-12,14-16],spock-c2-[1-4]
all* up infinite 3 alloc spock-c0-3,spock-c1-[2,13]
This displays job status information as seen by the scheduler itself. This tool is extremely useful when trying to determine the current status of your individual jobs and the current status of the cluster as a whole. It has a million options so reviewing the main documentation can prove very helpful. For example a list of all pending jobs and their associated priority scores can be retrieved with:
squeue -t PENDING -o "%.6i %p %u"
JOBID PRIORITY USER
522875 0.00000245543197 adamsc
522874 0.00000245543197 adamsc
516646 0.00000241841190 anqiw
516645 0.00000241841190 anqiw
516647 0.00000241841190 anqiw
...
522876 0.00000110408291 gmcgrath
510384 0.00000011664815 janc
513010 0.00000011664815 janc
513013 0.00000011664815 janc
Adding the '-u ' flag will filter the list down to just jobs by that user. This is extremely useful to help predict where you are in line for running a job (my job will be there for a while).
Not being talked about today indepth but this tool exists to run an array of tasks simultaneously. You can supply an sbatch script that creates multiple tasks at once and srun will run the subtasks simultaneously for you for example. ie,
multiprog.conf
0 echo 'I am the Primary Process'
1-3 bash -c 'printenv SLURM_PROCID'
could be invoked with srun --multi-prog multiprog.conf
to run all tasks in parallel as one job via sbatch.
srun is capable of many more things, this is just a quick example of what it can do.
- Basic Resource Management
- Matlab on Spock
- Python on Spock
- Julia on Spock
- Module usage on Spock and Scotty
Many times the software isn't actually missing but it's availability isn't announced either. This software typically requires a module load command to enable it, as we've limitted the amount of 'default' tools in the path. This helps avoid collisions and surprises when we update things, like suddenly getting a new version of matlab that doesn't quite work the way you expected.
Unlike memory the time limit isn't quite as aggressive, you get a small grace period after hitting your time limit. During this time a generic stop signal will be sent to the process to tell it to shut down. If the job doesn't shut down however, at the end of the grace period the cgroup is terminated and all processes go with it.
This is the most common reason a job is killed. It is also the largest source of confusion among end users because the memory reports may not reflect the actual maximum amount of ram used. Unfortunately the individual system kernels can still see that info even if the accounting system can not and they will kill jobs because of this.