- Check status of licenses with
lstc_qrun
:
$ ./lstc_qrun
Defaulting to server 1 specified by LSTC_LICENSE_SERVER variable
Running Programs
All or nothing scaling is useful when you need to run MPI jobs that can't start until all N
instances have joined the cluster.
The way Slurm launches instances is in a best-effort fashion, i.e. if you request 10
instances but it can only get 9
, it'll provision 9
then keep trying to get the last instance. This incurs cost for jobs that need all 10 instances before starting.
For example, if you submit a job like:
sbatch -N 10 ...
You can dynamically create a filesystem per-job, this is useful for jobs that require a fast filesystem but don't want to pay to have the filesystem running 24/7. It's also useful to create a filesystem per-job.
In order to accomplish this without wasting time waiting for the filesystem to create (~15 mins), we've seperated this into three seperate jobs:
The following Slurm Prolog starts the CUDA MPS server on each compute node before the job is started.
cat << EOF > /opt/slurm/etc/prolog.sh
#!/bin/sh
# start mps
nvidia-cuda-mps-control -d
So naturally the first thing I wanted to do when we got fiber internet was to rename the wifi network to something sexier than "CenturyLink0483". I decided on 🚀.
To do so I navigated to the router setup page at 192.168.0.1
, cringing with all the 90's tech it employs.
Then I added 🚀 and tried to update.
In AWS ParallelCluster you can setup a cluster with two queues, one for Spot pricing and one for On-demand. When a job fails, due to a spot reclaimation, you can automatically requeue that job to OnDemand.
To set that up, first create a cluster with a Spot and OnDemand queue:
- Name: od
ComputeResources:
- Name: c6i-od-c6i32xlarge
Due to the EPYC architecture, it makes more sense to disable specific cores rather than let the scheduler choose which cores to run on. This is because each ZEN 3 core is attached to a compute complex that's made up of 4 cores, L2 and L3 cache, by disabling 1, 2 or 3 cores from the same compute complex, we increase the memory bandwidth of the remaining cores.
To do this, you can run the attached disable-cores.sh
script on each instance: