Parallel Jobs using AWS Batch

Summary

This page outlines how to configure AWS Batch to create a LSF/Slurm like setup for running several jobs in parallel on spot instances
Components
- Create a custom AmazonECS AMI (controls available diskspace)
- Create a custom Docker Image in Amazon ECR (Contains all needed software - we use Ubuntu as starting point)
  - This contains a generic /run_job.sh which will be our entry point
- Create a EFS instance (This provides space for input/output data for your jobs)

VPC Setup

Enable DNS hostname and DNS resolution on your VPC
Create separate multiple subnets in your VPC dedicated to AWS Batch (eg: aws_batch_subnet_1a, aws_batch_subnet_1b)
Each availability-zone should have only one subnet for AWS Batch

EFS Setup

Create an EFS instance
- Use 'General Purpose' and 'Bursting' configuration
- Enable 'Lifecycle management' to optimize costs.
Create mount targets in all availability zones
Setup crons to clear older data - EFS costs can be significant

EC2 Setup

Create a separate key for spot instances aws_batch_spot_key
- EC2->Key Pairs->Create Key
Create a separate security group aws_batch_spot_sg
- This should allow access to traffic on port 2049 for accessing EFS

Custom AmazonECS AMI AWS Batch

EC2->Launch Instance->Community AMIs->amzn-ami-2018.03.a-amazon-ecs-optimized (any latest should be ok)
- Select t2.micro
- Add storage (use 8GB + 500GB configuration)
  - Make sure both of these EBS instances are marked for 'delete on termination'
  - 500 GB come from ==> maximum 16 core spot instance ==> 16 parallel jobs ==> 30 GB disk for each job
  - You can reduce this based on your use-case.
- Specify aws_batch_spot_key as key
Launch and connect to the instance using ssh -i aws_batch_spot_key.pem ec2-user@<ip>
- sudo yum update - Upgrade system
- sudo yum install telnet - Useful later for debugging
- Append --storage-opt dm.basesize=30G to
  - DOCKER_STORAGE_OPTIONS in /etc/sysconfig/docker-storage
  - EXTRA_DOCKER_STORAGE_OPTIONS in /etc/sysconfig/docker-storage-setup
Goto EC2 console and save the AMI as aws_batch_ami
- Make sure both of these EBS instances are marked for 'delete on termination'

Create Docker Image in AWS ECR

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html
Create a repo: aws ecr create-repository --repository-name batch-base-docker
You can either pull an existing image from AWS ECR or create a new one
- To pull existing one: docker push <old_account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker
Create a docker image locally (Use 16.04 Ubuntu image + additional stuff as needed.)
- docker pull ubuntu:16.04
- docker run -it ubuntu:16.04 /bin/bash
- Install required tools/packages inside the container
- Create /run_job.sh
  - This can contains sudo su -c "$RUN_COMMAND" dummy_user where RUN_COMMAND is set up during job submissions
  - You can set this up to download scripts from S3 and run them to initialize the instance
  - This can also be used to ensure jobs do not run as root/superuser
  - You can generate custom job statistics and upload to S3 from this wrapper script
- docker commit <commit_hash> batch-base-docker
Push to AWS ECR
- You will need to edit following commands to select the correct AWS region
- docker tag <image_hash> <account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker
- docker tag <image_hash> <account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker
- Login to ECR: $(aws ecr get-login --region us-east-1 | sed "s:\-e none::")
- Push image to ECR: docker push <account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker

Setting up IAM for AWS Batch

Create AWSBatchServiceRole
- IAM->Create Role->Service: Batch->Use case: Batch->Next->Next->Next->Role name: AWSBatchServiceRole->Create
Create AmazonEC2ContainerServiceforEC2Role
- IAM->Create Role->Service: Elastic Container Service->Use case: EC2 Role for Elastic Container Service->Next-Next->Next->Role name: AmazonEC2ContainerServiceforEC2Role->Create
- Attached S3 access roles and other policies which the spot instance would need
Create AmazonEC2SpotFleetRole
- IAM->Create Role->Service: EC2->Use case: EC2 Spot Fleet Role->Next->Next->Next->Role name: AmazonEC2SpotFleetRole->Create
Create role AwsBatchJobRole
- IAM->Create Role->Service: Elastic Container Service->Use case: Elastic Container Service Task->Next->Next->Next->Role name: AwsBatchJobRole->Create
- Attached S3 access roles and other policies which the spot instance would need

Setting up AWS Batch

Create Batch Compute Environment - AWS-Batch-M5CE1
- Managed setup
- Spot instances with 100% of on-demand price
- Try to use latest family instances and avoid 16/32 core machines.
  - As of 2020, M5 seems most optimal in terms of cost
- Size: min 0, max 1024
- Use roles created in IAM in previous steps as applicable
- Select AMI as aws_batch_ami which was created in previous steps
- Use aws_batch_spot_key as the key
- Use aws_batch_spot_sg as the security group
- Add tag Name = AWS_BATCH_SPOT for easy tracking later
- Select all batch subnets created in previous steps. (eg: aws_batch_subnet_1a, aws_batch_subnet_1b)
  - Ensure you have only one subnet in each availabilty zone - otherwise spot launches will fail
Create a Batch Job Definition
- Use AwsBatchJobRole
- Point to ECR image created previously. eg: <account_id>.dkr.ecr.us-east-1.amazonaws.com/aws-batch-base
- Add environment variables as needed: eg: USER_NAME, USER_RUN_COMMAND filled with dummy values
  - This can be used by /run_job.sh to run with correct environment/commands/credentials
- Enable Privileged mode
Create Batch Job Queue
- Name: common-queue, Priority: 50, Select compute env = AWS-Batch-M5CE1
Spot Usage Datafeed subscriptions
- aws ec2 create-spot-datafeed-subscription --bucket AWSDOC-EXAMPLE-BUCKET1
- This will push daily spot usage statistics to the specified bucket

Cloudwatch

A log group is created /aws/batch/job - Set appropriate expiry time for this group.

Debugging

When jobs are launched, ssh to the spot instance
- ssh -i aws_batch_spot_key ec2-user@10.x.x.x
- docker ps
- `tail -F /var/log/ecs/ecs-agent.log
- amazon/amazon-ecs-agent:latest docker image launches first. This, in turn, launches batch-base-docker docker image. /var/log/ecs/ecs-agent.log will help to identify errors in setup (especially IAM permissions)

Performance optimizations

Parallel execution
- Although unlimited jobs can run in parallel, unlimited jobs cannot be triggered in parallel
- Typically 300-400 jobs runs can be triggered in parallel
- This constraint arises from API throttling (ECR pull requests) in AWS
- Solutions:
  - Avoid array size more than 300
  - Stagger job submissions to avoid hitting the API throttling limit.
  - Create a compute environment with 300-400 VCPUs (this will limit your parallel job limit to 300-400)
  - Use multiple accounts - throttling is at account level.
EFS I/O throughput
- Use Cloudwatch to track your available EFS I/O bandwidth.
- If you are running several jobs in parallel and each is writing large files to EFS, it is easy to run out of I/O bandwidth
- Solutions
  - Temporarily turn on 'Provisioned' mode instead of 'Bursting' in EFS configuration - this can be toggled only once a day.
  - Change your job to write data in the main drive instead of EFS. Tar.gz the data to minimize EFS traffic.
EC2 instance utilizations
- AWS BATCH manages instances based on pending jobs automatically.
  - eg: If you fire 16 jobs, it launches a 16 core machine and runs all 16 jobs in parallel on that box
- However, depending on usecase such an approach might not be optimal. It usually turns out better to avoid 16/32 core machines in your compute environment.

csghone/aws_batch_setup.md