Skip to content

Instantly share code, notes, and snippets.

@DaisukeMiyamoto
Last active June 7, 2020 20:07
Show Gist options
  • Save DaisukeMiyamoto/e694d0f83ce9cbd893f6330fce0d4b7b to your computer and use it in GitHub Desktop.
Save DaisukeMiyamoto/e694d0f83ce9cbd893f6330fce0d4b7b to your computer and use it in GitHub Desktop.

Tutorial: AWS Batch setup for Neuron

Introduction

AWS Batch is a fully managed batch job scheduler service on AWS. It could easily manage large scale job queueing and execution. This tutorial shows a way to use Inferentia within a job on AWS Batch.

Steps Overview:

  1. create a launch template for base EC2 instance on AWS Batch Compute Environment
  2. create AWS Batch Compute Environment and Job Queue
  3. build a contaier image with TensorFlow-Neuron
  4. push the docker image to Elastic Container Registry
  5. Submit a inference job to AWS Batch

Steps:

Step 1: Create a launch template for base EC2 instance on AWS Batch Compute Environment

  • create a userdata.txt including below
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com
enabled=1
metadata_expire=0
EOF

rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

yum install -y aws-neuron-runtime-base aws-neuron-runtime aws-neuron-tools python3 gcc-c++ unzip

--==MYBOUNDARY==--
  • create a launch template with UserData and AMI ID of ECS-optimized AMI

You need to replace AMI_ID=ami-0aee8ced190c05726 to AMI ID of ECS-optimized AMI on your region. You can find the ID on https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html .

$ AMI_ID=ami-0aee8ced190c05726
$ aws ec2 create-launch-template --launch-template-name neuron-sdk-template --launch-template-data '{"ImageId": "'${AMI_ID}'", "UserData": "'$(cat userdata.txt | base64 --wrap=0)'"}'

Step 2: Create AWS Batch Compute Environment and Job Queue

  • create a Compute Environment and Job Queue of AWS Batch.

During the creation of a Compute Environment, all parameter could be default, except launch template and instance type. The launch template should be set as the created template in the previous step and select inf1.xlarge for instance type.

You also need to create a Job Queue associated with the Comute Environment

Step 3: Build a contaier image with TensorFlow-Neuron

  • Create a Dockerfile, including below.
# Example neuron-container dockerfile for AWS Batch

# To build:
#    docker build -t neuron-container .

# Prepare application:
# before create the docker image, you need to prepare some files based on the tutorial
# https://github.com/aws/aws-neuron-sdk/blob/master/docs/tensorflow-neuron/tutorial-compile-infer.md
# resnet50_neuron.zip
# infer_resnet50.py

# Note: the container must start with CAP_SYS_ADMIN + CAP_IPC_LOCK capabilities in order
# to map the memory needed from the Infernetia devices. These capabilities will
# be dropped following initialization.

# i.e. To start the container with required capabilities:
#   docker run --env AWS_NEURON_VISIBLE_DEVICES="0" -v /run:/run -it neuron-container

FROM amazonlinux:2

COPY resnet50_neuron.zip /tmp/
COPY infer_resnet50.py /tmp/

RUN echo $'[neuron] \n\
name=Neuron YUM Repository \n\
baseurl=https://yum.repos.neuron.amazonaws.com \n\
enabled=1' > /etc/yum.repos.d/neuron.repo

RUN rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

RUN yum install -y \
    aws-neuron-runtime-base \
    aws-neuron-runtime \
    aws-neuron-tools \
    python3 \
    gcc-c++ \
    unzip tar gzip

RUN python3 -m venv neuron_venv && \
    source neuron_venv/bin/activate && \
    pip install -U pip && \
    echo $'[global] \n\
extra-index-url = https://pip.repos.neuron.amazonaws.com' > $VIRTUAL_ENV/pip.conf && \
    pip install pillow && \
    pip install neuron-cc && \
    pip install tensorflow-neuron

RUN echo $'\
#!/bin/bash -xe\n\
source neuron_venv/bin/activate \n\
cd /tmp \n\
curl -O https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg \n\
unzip resnet50_neuron.zip \n\
python infer_resnet50.py'\
> /tmp/job.sh &&\
  chmod +x /tmp/job.sh

ENV PATH="/opt/aws/neuron/bin:${PATH}"
CMD /tmp/job.sh
  • complete TensorFlow-Neuron ResNet-50 tutorial and make resnet50_neuron.zip and infer_resnet50.py

To use a Inferentia, this tutorial depends on the Tutorial Getting Started with TensorFlow-Neuron (ResNet-50 Tutorial) . Before building the docker image, complete the TensorFlow-Neuron tutorial and put resnet50_neuron.zip and infer_resnet50.py on same directory as Dockerfile.

  • build docker image
$ docker build -t neuron-container .

Step 4: Push the docker image to Elastic Container Registry

  • create a repository on Elastic Container Registry.
$ aws ecr create-repository --repository-name neuron-container
  • push docker image to the repository

You should overwrite repository_uri=xxxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/neuron-container to your repository uri. The uri could be found in the output of create-repository command above.

$ repository_uri=xxxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/neuron-container
$ docker tag neuron-container ${repository_uri}
$ aws ecr get-login-password | docker login --username AWS --password-stdin ${repository_uri}
$ docker push ${repository_uri}

Step 5: Submit a inference job to AWS Batch

  • create a job definition

This command use ${repository_uri} set in the above.

$ aws batch register-job-definition --job-definition-name neuron-job-def --type container --container-properties '{"image": "'${repository_uri}'", "vcpus": 4, "memory": 4096, "volumes": [{"host": {"sourcePath": "/run"}, "name": "run"}], "mountPoints": [{"containerPath": "/run","sourceVolume": "run"}]}'
  • submit a job and check result

Replace the JOB_QUEUE_NAME to your job queue name for inf1 compute environment.

$ aws batch submit-job --job-name neuron-job --job-queue JOB_QUEUE_NAME --job-definition neuron-job-def
  • check job result

after the execution, you could find the inference result on the CloudWatch Logs of the job.

[('n02123045', 'tabby', 0.69945353), ('n02127052', 'lynx', 0.1215847), ('n02123159', 'tiger_cat', 0.08367486), ('n02124075', 'Egyptian_cat', 0.064890705), ('n02128757', 'snow_leopard', 0.009392076)]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment