In this tutorial we will work through setting up Slurm Accounting. This enables many features within slurm, including job resource tracking and providing a necessary building block to slurm federation.
The first requirement is to setup an external database that Slurm can use to store the accounting data.
Use the following CloudFormation Quick-Create link to create the database in
your AWS account. Note that if you would like to create the databas in a
different region, change the value of the region
parameter in the URL to the
region of your choice and reload the page.
When you're creating the stack, be sure to specify the VPC ID
and Subnets
parameters to correspond to the VPC where you are creating the stack. All other
values should be suitable as defaults, however feel free to change the database
instance type depending on your workload needs.
Change the region in the URL to create the stack in a region separate from
us-east-1
.
Once the stack has reached a Completed state. You will need to go to the Outputs
tab of the stack and make note of the properties as they will be used in the creation of your cluster.
In order to allow our cluster access to secrets we need to add an additional IAM policy.
- Go to the Lambda Console (deeplink) and search for
PclusterManagerFunction
- Select the function then
Configuration
>Permissions
> Click on the role underRole name
. - Select
Add permissions
>Attach policies
> search forSecretsManagerPolicy
- Click
Attach policies
Next, go to Pcluster Manager and choose the Create option to create a new cluster.
- Select Wizard option and click next
Choose a suitable name for your cluster, and then in the Cluster Properties window, be sure to choose the VPC that you used when creating the slurm-accounting CloudFormation stack.
You will need to enable the Virtual Console
option as that allows Pcluster Manager to interact with the cluster directly:
Be sure to also enable the Security Group referenced in the CloudFormation outputs so that the HeadNode can access the database.
Next we'll enable a known script that will install slurm accounting on the HeadNode.
- Choose the advanced options
- Under the
On Configured
option, Choose theMulti-Script Runner
which has some pre-programmed scripts in it - In the search box choose
Slurm Accounting
- Fill in the values for the
Secret ARN
andRDS Endpoint
from the CloudFormation output - Under
IAM Policies
add the arn from the CloudFormation Stack outputSecretsManagerPolicy
so that the HeadNode can access the password to the database. Be sure to actually clickAdd
so that it is added to the list.
After you've configured the HeadNode, Filesystem and Queues, you'll be asked to review the config. The following parameters must be set:
Parameter | Description |
---|---|
AdditionalSecurityGroups | SlurmDbSecurityGroupId (CloudFormation) |
AdditionalIamPolicies | AmazonSSMManagedInstanceCore , SecretsManagerPolicy (CloudFormation) |
CustomActions/OnNodeConfigured | multi-runner.py |
Arg 0: Accounting Script | slurm-accounting.py |
Arg 1: SECRET_ARN | SlurmDbPasswordSecretArn (CloudFormation) |
Arg 2: RDS Endpoint | SlurmDbEndpoint (CloudFormation) |
Arg 3: Port | Default is 3306 |
Here's an example config file to reference, take a look a the comments to see what's required:
HeadNode:
InstanceType: t2.micro
Networking:
SubnetId: subnet-12345678910
AdditionalSecurityGroups:
- sg-12345678910 # Security Group `SlurmDbSecurityGroup`
Iam:
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
- Policy: arn:aws:iam::12345678910:policy/accounting-SecretsManagerPolicy-1ULXG84GWOZ05 # Policy `SecretsManagerPolicy`
CustomActions:
OnNodeConfigured:
Script: >-
https://raw.githubusercontent.com/aws-samples/pcluster-manager/main/resources/scripts/multi-runner.py
Args:
- >-
https://raw.githubusercontent.com/aws-samples/pcluster-manager/main/resources/scripts/slurm-accounting.sh
- '-arn:aws:secretsmanager:us-east-2:1234567890:secret:DbPasswdSecret' # `SlurmDbPasswordSecretArn`
- '-slurmdb-rds-instance.c123456.us-east-2.rds.amazonaws.com' # RDS Endpoint `SlurmDbEndpoint`
- '-3306' # Default Port 3306
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: queue0
ComputeResources:
- Name: queue0-t2-micro
MinCount: 0
MaxCount: 4
InstanceType: t2.micro
Networking:
SubnetIds:
- subnet-12345678910
Region: us-east-2
Image:
Os: alinux2
Once the cluster has been successfully created, go to the Scheduling
tab and select Submit Job
Choose a name for your job, a number of nodes to run under, choose to Run a command
and provide a simple sleep 30
command.
Once you've submitted a job, you can see the job information under the Accounting tab
You can use any of the filters at the top to narrow down the number of jobs in the view to select specific jobs.
If you choose the Job ID in the left column you can see further detials about the job.