Heap dumping a Java process running in AWS ECS

Note: this guide is designed for AWS ECS services, but starting from Step 4 is functionally equivalent to any Docker container on a Linux host.

Step 1. Look up ECS service

Log into the AWS Console using the appropriate AWS account
Navigate to AWS ECS service clusters (https://console.aws.amazon.com/ecs/home)
Make sure you are in the correct region, if not, switch to the correct region (second drop-down menu in top right corner)
Select the correct cluster (ex: https://console.aws.amazon.com/ecs/home?region=us-east-1#/clusters//services)
In the Services tab, In the 'Filter in this page' text box, type the name of the service
Click the appropriate service name

Step 2. Find the IP address of the EC2 Container Instance

Click the Tasks tab
Click one of the Task items, such as 95f7f552-5515-4997-81b7-8b6784a504b1. Copy this down as your Task ID.
Click the Container instance, such as 3040f458-b8e7-45d6-b3bb-af4101fce510
Copy the Private IP, such as 172.31.45.106

Step 3. Logging into the EC2 Container Instance

If you use a bastion host to connect to your EC2 Container Instances, you might use a command like the following:

ssh -v \
    -o ProxyCommand='ssh ec2-user@<my-bastion-hostname> -i ~/.ssh/<my-bastion-key> -N -W %h:%p' \
    -o ForwardAgent=yes \
    -i ~/.ssh/<ec2-container-instance-key> \
    -l ec2-user \
    172.31.45.106

The last line is the Private IP you copied before.

Step 4. Find the container ID that matches the service & task

Option 1. Grep for the service name

List docker processes and grep for the service name.

docker ps | grep -i <My_ECS_Service_Name>
8e1f316737c3        <my-container-repo-name-tag>             "/some/long/command…"   2 months ago        Up 2 months         8080/tcp, 0.0.0.0:11720->8080/tcp   <some-long-service-name-string-d4e99688fca48e89fd01>

If it looks like the right service, copy the first part of the output (8e1f316737c3). This is the Docker container ID.

Option 2. Grep for the task ARN

List all containers' IDs and task IDs and grep for the correct one

docker ps -q \
    | xargs -n1 docker inspect -f '{{.Id}} {{index .Config.Labels "com.amazonaws.ecs.task-arn"}}' \
    | grep 95f7f552-5515-4997-81b7-8b6784a504b1
8e1f316737c34a57243c87444b8ba2ad6b5eea140df8988dc44f0692fc3fc90a arn:aws:ecs:us-east-1:1234567890:task/95f7f552-5515-4997-81b7-8b6784a504b1

The first part of the output (before a space) is the container ID. You only need the first 12 characters of it.

Step 5. Get the host process ID of the container

Get the host process ID of the container's process.

docker inspect -f '{{.State.Pid}}' 8e1f316737c3
28780

Step 6. Download the heap-dumping tools

In order to perform a heap dump or core dump, you need some tools to extract the memory of the application. If curl isn't available on your system, try wget.

Options for dumping heap for Java applications

For Java applications, download the tools jattach, gcore and gdb:

for i in jattach gcore gdb ; do \
      curl -L -o $i https://github.com/pwillis-els/docker-build-static/releases/download/v0.1.0/$i ; \
done
chmod 755 jattach gcore gdb

Options for core dumping any other program

Download just gcore and gdb as shown above.

Step 7. Attempt to heap-dump the process

Option 1: Java processes

There are two ways to generate heap dumps for Java processes. The first option will attempt to communicate with the JVM and instruct it to perform the heap dump itself. If the JVM is not responding, this will not work, and you'll have to switch to option 2: a core dump of the process.

The reason jattach is used (and not jmap or jcmd) are as follows:

jattach is a tiny binary that's easy to download or copy, so no need to provide an entire JDK, as you do with jmap and jcmd.
jattach works on the host operating system and enters the container namespace before opening a socket to the running java process.
If jmap or jcmd fails to attach to the JRE, it will fall back to using ptrace() to copy memory one byte at a time. In that case, the gdb method below is much faster, though potentially larger.

Attempt jattach heap dump using the host process ID from earlier. This will create a heap dump file inside the container (until we come up with a revised method).
```
sudo ./jattach 28780 dumpheap /tmp/dumpheap-28780.dump
```
Copy the dump file out of the container, onto the local disk. This will use the container ID from before.
```
docker cp 8e1f316737c3:/tmp/dumpheap-28780.dump host-dumpheap-28780.dump
```

Remove the heap dump from the container.

docker exec -it 8e1f316737c3 rm /tmp/dumpheap-28780.dump

Note: at the end of the sudo ./jattach 28780 dumpheap /tmp/dumpheap-28780.dump, you may want to add the option -live. This will only collect live objects in memory, making the heap dump significantly smaller. It's up to the developers to decide which they'd prefer: -live or -all (the default).

Option 2: Any kind of process

If the jattach above either didn't work or doesn't apply, use the following method. There are two ways to go about this, so try the first, then the second.

Attempt gcore core dump using the host process ID from earlier.
```
sudo ./gcore -a 28780
```
You will probably see a lot of output and a bunch of errors.
Wait until you see a line that says "Saved corefile core.28780" or similar. This file should exist in your current directory; use it instead of the host-dumpheap-28780.dump file.

Step 8. Compress the heap dump

You have a couple options here:

Compressing with xz (faster than gzip, and much better compression)
```
$ xz -v -7 host-dumpheap-28780.dump
```
Compressing with zstd. You will have to download this tool, but it is lightning fast (50 seconds rather than 10 minutes) and just as good compression as xz. However, it may take up more memory and/or CPU.
```
./zstd -v -7 host-dumpheap-28780.dump
```

Step 9. Transfer to an S3 bucket

Extra Options

From the EC2 Container Instance, you can attempt to jump into the container using its Container ID, with the following command:

docker exec -it 8e1f316737c3 sh
/deployments $

pwillis-els/Heap_Dump_Java_AWS_ECS.md