AWS ParallelCluster + AWS Batch

Today I'm going to demonstrate running High Performance Conjucate Grandients (HPCG) in a containerized workload. This takes advantage of AWS ParallelCluster, AWS Batch, and OpenMPI.

First install aws-parallelcluster:

$ pip install aws-parallelcluster

Edit the file to include the awsbatch cluster configuration:

$ vim ~/.parallelcluster/config

Add to this file the following, you'll need a public and private subnet, see Public Private Networking for instructions on how to set that up.

[global]
update_check = true
sanity_check = true
cluster_template = awsbatch

[aws]
aws_region_name = us-east-1

[cluster awsbatch]
scheduler = awsbatch
key_name = [your key]
min_vcpus = 72
desired_vcpus = 72
max_vcpus = 288
vpc_settings = public-private
master_instance_type = c5.xlarge
compute_instance_type = c5n.18xlarge

[vpc public-private]
vpc_id = vpc-00d2e489741609bc2
master_subnet_id = subnet-0152608e422c75189
compute_subnet_id = subnet-0baadf9781f59a6a1

Now, create the cluster:

$ pcluster create awsbatch-cluster
Creating stack named: parallelcluster-hpcg
Status: parallelcluster-hpcg - CREATE_COMPLETE
ClusterUser: ec2-user
MasterPublicIP: 54.35.249.0
MasterPrivateIP: 10.0.0.35

Once that's completed, ssh in. You may have to specify the keypath with the -i flag if you're not using a default key.

$ pcluster ssh awsbatch -i ~/.ssh/id_rsa

Running awsbhosts shows you the hosts that are running:

[ec2-user@ip-10-0-0-182 ~]$ awsbhosts
ec2InstanceId        instanceType    privateIpAddress    publicIpAddress      runningJobs
-------------------  --------------  ------------------  -----------------  -------------
i-07148c539c09ae9b8  c5n.18xlarge    10.0.1.171          -                              0

You can see there's one c5n.18xlarge instance running, this is because we set min_vcpus = 72, had we set min_vcpus = 0, there would be no hosts running.

Now let's run through a basic hello world example to demonstrate how it works:

https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/03_batch_mpi.html

Now, on the master instance clone the parallelcluster repo:

$ git clone https://github.com/aws/aws-parallelcluster.git
$ cd aws-parallelcluster/cli/pcluster/resources/batch/docker/

Create a Makefile with the following contents:

# Makefile
distro=alinux
uri=[URI from ECR console]

build:
        docker build -f $(distro)/Dockerfile -t pcluster-$(distro) .
        docker build -t $(uri) .

tag:
        docker tag $(uri) $(uri):$(distro)

push: build tag
        docker push $(uri):$(distro)

To get that URI, go to the ECR Console and find an image with a name similar to paral-docke-t6ayh0ia49nm (you can sort by latest created)

Grab that URI, it should look like: 112850485306.dkr.ecr.us-east-1.amazonaws.com/paral-docke-t6ajh0ia39nm

Install docker

$ sudo yum install -y docker
$ sudo service docker start

Add the AmazonEC2ContainerRegistryFullAccess IAM Policy to the Master EC2 instance:

Now, create a Dockerfile with the following contents:

FROM pcluster-alinux:latest

# Set the working directory to /app
WORKDIR /work

# Copy the current directory contents into the container at /app
COPY . /work
ENV PATH=$PATH:/usr/lib64/openmpi/bin/

# Install any needed packages specified in requirements.txt
RUN yum -y install awscli wget unzip gzip tar gcc gcc-g++ make
RUN yum -y install openmpi openmpi-devel
RUN yum groupinstall "Development Tools" -y

RUN wget https://github.com/hpcg-benchmark/hpcg/archive/master.zip

RUN unzip master.zip
RUN hpcg-master/configure Linux_MPI
RUN make
RUN chmod 755 /work/run.s

# Define environment variable
ENV INSTANCETYPE c5n.18xlarge
ENV CASE_CORES 36
ENV CASE_NAME run1
ENV CASE_SIZE 16
ENV CASE_TIME 20


ENTRYPOINT ["/parallelcluster/bin/entrypoint.sh"]

And a file run.s with the following contents:

#!/bin/sh

echo "case time, size and cores"
echo "CASE_NAME, $CASE_NAME"
echo "CASE_TIME, $CASE_TIME"
echo "CASE_SIZE, $CASE_SIZE"
echo "CASE_CORES, $CASE_CORES"

export PATH=.:$PATH
export OMPI_MCA_btl_vader_single_copy_mechanism=none

/usr/lib64/openmpi/bin/mpirun --allow-run-as-root -np $CASE_CORES -hostfile ${HOME}/hostfile /work/bin/xhpcg --nx=$CASE_SIZE --ny=$CASE_SIZE --nz=$CASE_SIZE --rt=$CASE_TIME

rating_string=$( grep "with a GFLOP/s rating" HPCG*)

length=${#rating_string}
rating=$(echo $rating_string | cut -c62-$length )

echo "rating=, $rating"
middle="_"
filename=$CASE_NAME$middle$CASE_CORES$middle$CASE_SIZE
echo "$CASE_NAME, $CASE_CORES, $CASE_SIZE, $CASE_TIME, $rating" > $filename
echo $filename
cat $filename

Build and push that dockerfile with

$ $(aws ecr get-login --no-include-email --region us-east-1) # login w/ ecr
$ make push

Now you can submit an HPCG run like:

$ awsbsub -e CASE_CORES=36 -n 2 -jn hpcg /work/run.s

Watch the job to see when it transitions into running:

$ watch awsbstat
...
jobId                                 jobName    status    startedAt    stoppedAt    exitCode
------------------------------------  ---------  --------  -----------  -----------  ----------
222e21bb-a955-42c8-a45a-6d195db740b6  hpcg       RUNNABLE  -            -            -

And get the output, after it transitions to RUNNING, with:

$ awsbout 222e21bb-a955-42c8-a45a-6d195db740b6

sean-smith/hpcg.md

AWS ParallelCluster + AWS Batch