Summary
- This page outlines how to configure AWS Batch to create a LSF/Slurm like setup for running several jobs in parallel on spot instances
- Components
- Create a custom AmazonECS AMI (controls available diskspace)
- Create a custom Docker Image in Amazon ECR (Contains all needed software - we use Ubuntu as starting point)
- This contains a generic /run_job.sh which will be our entry point
- Create a EFS instance (This provides space for input/output data for your jobs)
VPC Setup
- Enable DNS hostname and DNS resolution on your VPC
- Create separate multiple subnets in your VPC dedicated to AWS Batch (eg: aws_batch_subnet_1a, aws_batch_subnet_1b)
- Each availability-zone should have only one subnet for AWS Batch
EFS Setup
- Create an EFS instance
- Use 'General Purpose' and 'Bursting' configuration
- Enable 'Lifecycle management' to optimize costs.
- Create mount targets in all availability zones
- Setup crons to clear older data - EFS costs can be significant
EC2 Setup
- Create a separate key for spot instances aws_batch_spot_key
- EC2->Key Pairs->Create Key
- Create a separate security group aws_batch_spot_sg
- This should allow access to traffic on port 2049 for accessing EFS
Custom AmazonECS AMI AWS Batch
- EC2->Launch Instance->Community AMIs->amzn-ami-2018.03.a-amazon-ecs-optimized (any latest should be ok)
- Select t2.micro
- Add storage (use 8GB + 500GB configuration)
- Make sure both of these EBS instances are marked for 'delete on termination'
- 500 GB come from ==> maximum 16 core spot instance ==> 16 parallel jobs ==> 30 GB disk for each job
- You can reduce this based on your use-case.
- Specify aws_batch_spot_key as key
- Launch and connect to the instance using
ssh -i aws_batch_spot_key.pem ec2-user@<ip>sudo yum update- Upgrade systemsudo yum install telnet- Useful later for debugging- Append
--storage-opt dm.basesize=30Gto- DOCKER_STORAGE_OPTIONS in
/etc/sysconfig/docker-storage - EXTRA_DOCKER_STORAGE_OPTIONS in
/etc/sysconfig/docker-storage-setup
- DOCKER_STORAGE_OPTIONS in
- Goto EC2 console and save the AMI as aws_batch_ami
- Make sure both of these EBS instances are marked for 'delete on termination'
Create Docker Image in AWS ECR
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html
- Create a repo:
aws ecr create-repository --repository-name batch-base-docker - You can either pull an existing image from AWS ECR or create a new one
- To pull existing one:
docker push <old_account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker
- To pull existing one:
- Create a docker image locally (Use 16.04 Ubuntu image + additional stuff as needed.)
docker pull ubuntu:16.04docker run -it ubuntu:16.04 /bin/bash- Install required tools/packages inside the container
- Create /run_job.sh
- This can contains
sudo su -c "$RUN_COMMAND" dummy_userwhere RUN_COMMAND is set up during job submissions - You can set this up to download scripts from S3 and run them to initialize the instance
- This can also be used to ensure jobs do not run as root/superuser
- You can generate custom job statistics and upload to S3 from this wrapper script
- This can contains
docker commit <commit_hash> batch-base-docker
- Push to AWS ECR
- You will need to edit following commands to select the correct AWS region
docker tag <image_hash> <account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-dockerdocker tag <image_hash> <account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker- Login to ECR:
$(aws ecr get-login --region us-east-1 | sed "s:\-e none::") - Push image to ECR:
docker push <account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker
Setting up IAM for AWS Batch
- Create AWSBatchServiceRole
- IAM->Create Role->Service: Batch->Use case: Batch->Next->Next->Next->Role name: AWSBatchServiceRole->Create
- Create AmazonEC2ContainerServiceforEC2Role
- IAM->Create Role->Service: Elastic Container Service->Use case: EC2 Role for Elastic Container Service->Next-Next->Next->Role name: AmazonEC2ContainerServiceforEC2Role->Create
- Attached S3 access roles and other policies which the spot instance would need
- Create AmazonEC2SpotFleetRole
- IAM->Create Role->Service: EC2->Use case: EC2 Spot Fleet Role->Next->Next->Next->Role name: AmazonEC2SpotFleetRole->Create
- Create role AwsBatchJobRole
- IAM->Create Role->Service: Elastic Container Service->Use case: Elastic Container Service Task->Next->Next->Next->Role name: AwsBatchJobRole->Create
- Attached S3 access roles and other policies which the spot instance would need
Setting up AWS Batch
- Create Batch Compute Environment - AWS-Batch-M5CE1
- Managed setup
- Spot instances with 100% of on-demand price
- Try to use latest family instances and avoid 16/32 core machines.
- As of 2020, M5 seems most optimal in terms of cost
- Size: min 0, max 1024
- Use roles created in IAM in previous steps as applicable
- Select AMI as aws_batch_ami which was created in previous steps
- Use aws_batch_spot_key as the key
- Use aws_batch_spot_sg as the security group
- Add tag Name = AWS_BATCH_SPOT for easy tracking later
- Select all batch subnets created in previous steps. (eg: aws_batch_subnet_1a, aws_batch_subnet_1b)
- Ensure you have only one subnet in each availabilty zone - otherwise spot launches will fail
- Create a Batch Job Definition
- Use AwsBatchJobRole
- Point to ECR image created previously. eg:
<account_id>.dkr.ecr.us-east-1.amazonaws.com/aws-batch-base - Add environment variables as needed: eg: USER_NAME, USER_RUN_COMMAND filled with dummy values
- This can be used by /run_job.sh to run with correct environment/commands/credentials
- Enable Privileged mode
- Create Batch Job Queue
- Name: common-queue, Priority: 50, Select compute env = AWS-Batch-M5CE1
- Spot Usage Datafeed subscriptions
aws ec2 create-spot-datafeed-subscription --bucket AWSDOC-EXAMPLE-BUCKET1- This will push daily spot usage statistics to the specified bucket
Cloudwatch
- A log group is created
/aws/batch/job- Set appropriate expiry time for this group.
Debugging
- When jobs are launched, ssh to the spot instance
ssh -i aws_batch_spot_key ec2-user@10.x.x.xdocker ps- `tail -F /var/log/ecs/ecs-agent.log
- amazon/amazon-ecs-agent:latest docker image launches first. This, in turn, launches batch-base-docker docker image.
/var/log/ecs/ecs-agent.logwill help to identify errors in setup (especially IAM permissions)
Performance optimizations
- Parallel execution
- Although unlimited jobs can run in parallel, unlimited jobs cannot be triggered in parallel
- Typically 300-400 jobs runs can be triggered in parallel
- This constraint arises from API throttling (ECR pull requests) in AWS
- Solutions:
- Avoid array size more than 300
- Stagger job submissions to avoid hitting the API throttling limit.
- Create a compute environment with 300-400 VCPUs (this will limit your parallel job limit to 300-400)
- Use multiple accounts - throttling is at account level.
- EFS I/O throughput
- Use Cloudwatch to track your available EFS I/O bandwidth.
- If you are running several jobs in parallel and each is writing large files to EFS, it is easy to run out of I/O bandwidth
- Solutions
- Temporarily turn on 'Provisioned' mode instead of 'Bursting' in EFS configuration - this can be toggled only once a day.
- Change your job to write data in the main drive instead of EFS. Tar.gz the data to minimize EFS traffic.
- EC2 instance utilizations
- AWS BATCH manages instances based on pending jobs automatically.
- eg: If you fire 16 jobs, it launches a 16 core machine and runs all 16 jobs in parallel on that box
- However, depending on usecase such an approach might not be optimal. It usually turns out better to avoid 16/32 core machines in your compute environment.
- AWS BATCH manages instances based on pending jobs automatically.