Today I'm going to demonstrate running High Performance Conjucate Grandients (HPCG) in a containerized workload. This takes advantage of AWS ParallelCluster, AWS Batch, and OpenMPI.
First install aws-parallelcluster
:
$ pip install aws-parallelcluster
Edit the file to include the awsbatch cluster configuration:
$ vim ~/.parallelcluster/config
Add to this file the following, you'll need a public and private subnet, see Public Private Networking for instructions on how to set that up.
[global]
update_check = true
sanity_check = true
cluster_template = awsbatch
[aws]
aws_region_name = us-east-1
[cluster awsbatch]
scheduler = awsbatch
key_name = [your key]
min_vcpus = 72
desired_vcpus = 72
max_vcpus = 288
vpc_settings = public-private
master_instance_type = c5.xlarge
compute_instance_type = c5n.18xlarge
[vpc public-private]
vpc_id = vpc-00d2e489741609bc2
master_subnet_id = subnet-0152608e422c75189
compute_subnet_id = subnet-0baadf9781f59a6a1
Now, create the cluster:
$ pcluster create awsbatch-cluster
Creating stack named: parallelcluster-hpcg
Status: parallelcluster-hpcg - CREATE_COMPLETE
ClusterUser: ec2-user
MasterPublicIP: 54.35.249.0
MasterPrivateIP: 10.0.0.35
Once that's completed, ssh in. You may have to specify the keypath with the -i
flag if you're not using a default key.
$ pcluster ssh awsbatch -i ~/.ssh/id_rsa
Running awsbhosts
shows you the hosts that are running:
[ec2-user@ip-10-0-0-182 ~]$ awsbhosts
ec2InstanceId instanceType privateIpAddress publicIpAddress runningJobs
------------------- -------------- ------------------ ----------------- -------------
i-07148c539c09ae9b8 c5n.18xlarge 10.0.1.171 - 0
You can see there's one c5n.18xlarge
instance running, this is because we set min_vcpus = 72
, had we set min_vcpus = 0
, there would be no hosts running.
Now let's run through a basic hello world example to demonstrate how it works:
https://aws-parallelcluster.readthedocs.io/en/latest/tutorials/03_batch_mpi.html
Now, on the master instance clone the parallelcluster repo:
$ git clone https://github.com/aws/aws-parallelcluster.git
$ cd aws-parallelcluster/cli/pcluster/resources/batch/docker/
Create a Makefile with the following contents:
# Makefile
distro=alinux
uri=[URI from ECR console]
build:
docker build -f $(distro)/Dockerfile -t pcluster-$(distro) .
docker build -t $(uri) .
tag:
docker tag $(uri) $(uri):$(distro)
push: build tag
docker push $(uri):$(distro)
To get that URI, go to the ECR Console and find an image with a name similar to paral-docke-t6ayh0ia49nm
(you can sort by latest created)
Grab that URI, it should look like: 112850485306.dkr.ecr.us-east-1.amazonaws.com/paral-docke-t6ajh0ia39nm
Install docker
$ sudo yum install -y docker
$ sudo service docker start
Add the AmazonEC2ContainerRegistryFullAccess
IAM Policy to the Master EC2 instance:
Now, create a Dockerfile
with the following contents:
FROM pcluster-alinux:latest
# Set the working directory to /app
WORKDIR /work
# Copy the current directory contents into the container at /app
COPY . /work
ENV PATH=$PATH:/usr/lib64/openmpi/bin/
# Install any needed packages specified in requirements.txt
RUN yum -y install awscli wget unzip gzip tar gcc gcc-g++ make
RUN yum -y install openmpi openmpi-devel
RUN yum groupinstall "Development Tools" -y
RUN wget https://github.com/hpcg-benchmark/hpcg/archive/master.zip
RUN unzip master.zip
RUN hpcg-master/configure Linux_MPI
RUN make
RUN chmod 755 /work/run.s
# Define environment variable
ENV INSTANCETYPE c5n.18xlarge
ENV CASE_CORES 36
ENV CASE_NAME run1
ENV CASE_SIZE 16
ENV CASE_TIME 20
ENTRYPOINT ["/parallelcluster/bin/entrypoint.sh"]
And a file run.s
with the following contents:
#!/bin/sh
echo "case time, size and cores"
echo "CASE_NAME, $CASE_NAME"
echo "CASE_TIME, $CASE_TIME"
echo "CASE_SIZE, $CASE_SIZE"
echo "CASE_CORES, $CASE_CORES"
export PATH=.:$PATH
export OMPI_MCA_btl_vader_single_copy_mechanism=none
/usr/lib64/openmpi/bin/mpirun --allow-run-as-root -np $CASE_CORES -hostfile ${HOME}/hostfile /work/bin/xhpcg --nx=$CASE_SIZE --ny=$CASE_SIZE --nz=$CASE_SIZE --rt=$CASE_TIME
rating_string=$( grep "with a GFLOP/s rating" HPCG*)
length=${#rating_string}
rating=$(echo $rating_string | cut -c62-$length )
echo "rating=, $rating"
middle="_"
filename=$CASE_NAME$middle$CASE_CORES$middle$CASE_SIZE
echo "$CASE_NAME, $CASE_CORES, $CASE_SIZE, $CASE_TIME, $rating" > $filename
echo $filename
cat $filename
Build and push that dockerfile with
$ $(aws ecr get-login --no-include-email --region us-east-1) # login w/ ecr
$ make push
Now you can submit an HPCG run like:
$ awsbsub -e CASE_CORES=36 -n 2 -jn hpcg /work/run.s
Watch the job to see when it transitions into running:
$ watch awsbstat
...
jobId jobName status startedAt stoppedAt exitCode
------------------------------------ --------- -------- ----------- ----------- ----------
222e21bb-a955-42c8-a45a-6d195db740b6 hpcg RUNNABLE - - -
And get the output, after it transitions to RUNNING
, with:
$ awsbout 222e21bb-a955-42c8-a45a-6d195db740b6