sge | slurm | torque | |
---|---|---|---|
Submit Interactive Job | qlogin | srun | qsub -I |
Submit Batch Job | qsub | sbatch | qsub |
Number of Slots | -pe mpi [n] | -n [n] | -l ppn=[n] |
Number of Nodes | -pe mpi [slots * n] | -N [n] | -l nodes=[n] |
Cancel Job | qdel | scancel | qdel |
See Queue | qstat | squeue | qstat |
- Setup a cluster with DCV
- Install the Native client: NICE DCV | Download
- Create a script pcluster-dcv-connect.py with the contents as shown below:
- Execute that script
# make sure you have pcluster installed
$ pcluster list --color
This binary cache is a subset of the Exascale Computing Project's Extreme-Scale Scientific Software Stack (E4S) (https://oaciss.uoregon.edu/ecp/).
package | install command | working? |
---|---|---|
openfoam | spack install --no-check-signature --cache-only openfoam |
✅ |
gromacs | spack install --no-check-signature --cache-only gromacs |
✅ |
gromacs without SLURM/PMI support | spack install --no-check-signature --cache-only gromacs ^openmpi~pmi schedulers=none |
✅ |
ior | spack install --no-check-signature --cache-only ior |
✅ |
osu-micro-benchmarks | spack install --no-check-signature --cache-only osu-micro-benchmarks |
✅ |
#!/usr/bin/env python3 | |
import json | |
from base64 import b64decode, b64encode | |
from pprint import pprint | |
import boto3 | |
import botocore | |
import yaml | |
import requests | |
def sigv4_request(method, host, path, params, headers, body): |
When DCV is enabled, the default behaviour of AWS ParallelCluster is to run a single DCV session on the head node, this is a quick and easy way to visualize the results of your simulations or run a desktop application such as StarCCM+.
A common ask is to run DCV sessions on a compute queue instead of the head node. This has several advantages, namely:
- Run multiple sessions on the same instance (possibly with different users per-session)
- Run a smaller head node and only spin up more-expensive DCV instances when needed. We set a 12 hr timer below that automatically kills sessions after we leave.
In this example we're going to setup an HPC environment with AWS ParallelCluster and connect it to Microsoft AD, an AWS service that allows you to create managed Active Directory user pools. You can read more about it in the AD Tutorial.
You have three different options for AD provider, we're going to go with Microsoft AD due to the regional availibility. This allows us to use it in the same region (Ohio) as our hpc6a.48xlarge instances.
Type | Description |
---|---|
Simple AD | Open AD protocol, supported in only a [few](https://docs.aws.amazon.com/directoryservice/ |
FSx Netapp is a multi-protocol filesystem. It mounts on Windows as SMB, Linux as NFS and Mac. This allows cluster users to bridge their Windows and Linux machines with the same filesystem, potentially running both windows and linux machines for a post-processing workflow.
Pros
- Multi-Protocol
- Hybrid support
- Multi-AZ (for High Availibility)
This guide describes how to mount FSx Lustre filesystem. I give an example cloudformation stack to create the AWS Batch resources.
I loosely follow this guide.
For the parameters, it's important that the Subnet, Security Group, FSx ID and Fsx Mount Name follow the guidelines below:
Parameter | Description |
---|
Spot termination gives a 2-minute warning before terminating the instance. This time period allows you to gracefully save data in order to resume later.
In the following I describe how this can be done with StarCCM+ in AWS ParallelCluster 3.X:
- Create a post-install script
spot.sh
like so: