Skip to content

Instantly share code, notes, and snippets.

@csghone
Last active September 14, 2020 05:22
Show Gist options
  • Select an option

  • Save csghone/f46a77837b72fc3b538d016ead6e134d to your computer and use it in GitHub Desktop.

Select an option

Save csghone/f46a77837b72fc3b538d016ead6e134d to your computer and use it in GitHub Desktop.
Parallel Jobs using AWS Batch

Summary

  • This page outlines how to configure AWS Batch to create a LSF/Slurm like setup for running several jobs in parallel on spot instances
  • Components
    • Create a custom AmazonECS AMI (controls available diskspace)
    • Create a custom Docker Image in Amazon ECR (Contains all needed software - we use Ubuntu as starting point)
      • This contains a generic /run_job.sh which will be our entry point
    • Create a EFS instance (This provides space for input/output data for your jobs)

VPC Setup

  • Enable DNS hostname and DNS resolution on your VPC
  • Create separate multiple subnets in your VPC dedicated to AWS Batch (eg: aws_batch_subnet_1a, aws_batch_subnet_1b)
  • Each availability-zone should have only one subnet for AWS Batch

EFS Setup

  • Create an EFS instance
    • Use 'General Purpose' and 'Bursting' configuration
    • Enable 'Lifecycle management' to optimize costs.
  • Create mount targets in all availability zones
  • Setup crons to clear older data - EFS costs can be significant

EC2 Setup

  • Create a separate key for spot instances aws_batch_spot_key
    • EC2->Key Pairs->Create Key
  • Create a separate security group aws_batch_spot_sg
    • This should allow access to traffic on port 2049 for accessing EFS

Custom AmazonECS AMI AWS Batch

  • EC2->Launch Instance->Community AMIs->amzn-ami-2018.03.a-amazon-ecs-optimized (any latest should be ok)
    • Select t2.micro
    • Add storage (use 8GB + 500GB configuration)
      • Make sure both of these EBS instances are marked for 'delete on termination'
      • 500 GB come from ==> maximum 16 core spot instance ==> 16 parallel jobs ==> 30 GB disk for each job
      • You can reduce this based on your use-case.
    • Specify aws_batch_spot_key as key
  • Launch and connect to the instance using ssh -i aws_batch_spot_key.pem ec2-user@<ip>
    • sudo yum update - Upgrade system
    • sudo yum install telnet - Useful later for debugging
    • Append --storage-opt dm.basesize=30G to
      • DOCKER_STORAGE_OPTIONS in /etc/sysconfig/docker-storage
      • EXTRA_DOCKER_STORAGE_OPTIONS in /etc/sysconfig/docker-storage-setup
  • Goto EC2 console and save the AMI as aws_batch_ami
    • Make sure both of these EBS instances are marked for 'delete on termination'

Create Docker Image in AWS ECR

  • https://docs.aws.amazon.com/AmazonECS/latest/developerguide/docker-basics.html
  • Create a repo: aws ecr create-repository --repository-name batch-base-docker
  • You can either pull an existing image from AWS ECR or create a new one
    • To pull existing one: docker push <old_account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker
  • Create a docker image locally (Use 16.04 Ubuntu image + additional stuff as needed.)
    • docker pull ubuntu:16.04
    • docker run -it ubuntu:16.04 /bin/bash
    • Install required tools/packages inside the container
    • Create /run_job.sh
      • This can contains sudo su -c "$RUN_COMMAND" dummy_user where RUN_COMMAND is set up during job submissions
      • You can set this up to download scripts from S3 and run them to initialize the instance
      • This can also be used to ensure jobs do not run as root/superuser
      • You can generate custom job statistics and upload to S3 from this wrapper script
    • docker commit <commit_hash> batch-base-docker
  • Push to AWS ECR
    • You will need to edit following commands to select the correct AWS region
    • docker tag <image_hash> <account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker
    • docker tag <image_hash> <account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker
    • Login to ECR: $(aws ecr get-login --region us-east-1 | sed "s:\-e none::")
    • Push image to ECR: docker push <account_id>.dkr.ecr.us-east-1.amazonaws.com/batch-base-docker

Setting up IAM for AWS Batch

  • Create AWSBatchServiceRole
    • IAM->Create Role->Service: Batch->Use case: Batch->Next->Next->Next->Role name: AWSBatchServiceRole->Create
  • Create AmazonEC2ContainerServiceforEC2Role
    • IAM->Create Role->Service: Elastic Container Service->Use case: EC2 Role for Elastic Container Service->Next-Next->Next->Role name: AmazonEC2ContainerServiceforEC2Role->Create
    • Attached S3 access roles and other policies which the spot instance would need
  • Create AmazonEC2SpotFleetRole
    • IAM->Create Role->Service: EC2->Use case: EC2 Spot Fleet Role->Next->Next->Next->Role name: AmazonEC2SpotFleetRole->Create
  • Create role AwsBatchJobRole
    • IAM->Create Role->Service: Elastic Container Service->Use case: Elastic Container Service Task->Next->Next->Next->Role name: AwsBatchJobRole->Create
    • Attached S3 access roles and other policies which the spot instance would need

Setting up AWS Batch

  • Create Batch Compute Environment - AWS-Batch-M5CE1
    • Managed setup
    • Spot instances with 100% of on-demand price
    • Try to use latest family instances and avoid 16/32 core machines.
      • As of 2020, M5 seems most optimal in terms of cost
    • Size: min 0, max 1024
    • Use roles created in IAM in previous steps as applicable
    • Select AMI as aws_batch_ami which was created in previous steps
    • Use aws_batch_spot_key as the key
    • Use aws_batch_spot_sg as the security group
    • Add tag Name = AWS_BATCH_SPOT for easy tracking later
    • Select all batch subnets created in previous steps. (eg: aws_batch_subnet_1a, aws_batch_subnet_1b)
      • Ensure you have only one subnet in each availabilty zone - otherwise spot launches will fail
  • Create a Batch Job Definition
    • Use AwsBatchJobRole
    • Point to ECR image created previously. eg: <account_id>.dkr.ecr.us-east-1.amazonaws.com/aws-batch-base
    • Add environment variables as needed: eg: USER_NAME, USER_RUN_COMMAND filled with dummy values
      • This can be used by /run_job.sh to run with correct environment/commands/credentials
    • Enable Privileged mode
  • Create Batch Job Queue
    • Name: common-queue, Priority: 50, Select compute env = AWS-Batch-M5CE1
  • Spot Usage Datafeed subscriptions
    • aws ec2 create-spot-datafeed-subscription --bucket AWSDOC-EXAMPLE-BUCKET1
    • This will push daily spot usage statistics to the specified bucket

Cloudwatch

  • A log group is created /aws/batch/job - Set appropriate expiry time for this group.

Debugging

  • When jobs are launched, ssh to the spot instance
    • ssh -i aws_batch_spot_key ec2-user@10.x.x.x
    • docker ps
    • `tail -F /var/log/ecs/ecs-agent.log
    • amazon/amazon-ecs-agent:latest docker image launches first. This, in turn, launches batch-base-docker docker image. /var/log/ecs/ecs-agent.log will help to identify errors in setup (especially IAM permissions)

Performance optimizations

  • Parallel execution
    • Although unlimited jobs can run in parallel, unlimited jobs cannot be triggered in parallel
    • Typically 300-400 jobs runs can be triggered in parallel
    • This constraint arises from API throttling (ECR pull requests) in AWS
    • Solutions:
      • Avoid array size more than 300
      • Stagger job submissions to avoid hitting the API throttling limit.
      • Create a compute environment with 300-400 VCPUs (this will limit your parallel job limit to 300-400)
      • Use multiple accounts - throttling is at account level.
  • EFS I/O throughput
    • Use Cloudwatch to track your available EFS I/O bandwidth.
    • If you are running several jobs in parallel and each is writing large files to EFS, it is easy to run out of I/O bandwidth
    • Solutions
      • Temporarily turn on 'Provisioned' mode instead of 'Bursting' in EFS configuration - this can be toggled only once a day.
      • Change your job to write data in the main drive instead of EFS. Tar.gz the data to minimize EFS traffic.
  • EC2 instance utilizations
    • AWS BATCH manages instances based on pending jobs automatically.
      • eg: If you fire 16 jobs, it launches a 16 core machine and runs all 16 jobs in parallel on that box
    • However, depending on usecase such an approach might not be optimal. It usually turns out better to avoid 16/32 core machines in your compute environment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment