- How to log into slaves
One can log into the slave nodes using Agent Forwarding by SSH. Suppose that I find in the YARN log
java.nio.channels.CancelledKeyException
14/09/25 23:12:58 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(ip-172-31-xx-xx.ec2.internal,49972)
java.io.IOException: Connection reset by peer
...
I would like to SSH to the slave ip-172-31-xx-xx.ec2.internal
to check the logs. To do so, firstly we have to add EMR identity to the agent
you@localhost: ssh-add ~/EC2_KEY.pem
and then log SSH to master using SSH agent forwarding
you@localhost: ssh -A [email protected] -i ~/EC2_KEY.pem
and then we are able to log into the slave node:
hadoop@master: ssh [email protected]
- Find the lost block manager/container
Sometimes we find that something is wrong and we end up with a FetchFailed
:
FetchFailed(BlockManagerId(81, ip-172-31-xx-xx.ec2.internal, 52487, 0), shuffleId=83, mapId=44, reduceId=49)
First we check the application master in resource manager
http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:9026/cluster/app/application_1413826885383_0001
And click on the nodes
http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:9026/cluster/nodes
Find the ip-172-31-xx-xx.ec2.internal
and click inside
http://ip-172-31-xx-xx.ec2.internal:9035/
and qw can find in List of Containers
and see no container is using BlockManagerId(81, ip-172-31-xx-xx.ec2.internal, 52487, 0)
.
This tells us that our block manager is gone. Now we want to find the container
using that killed block manager.
Go to Tools>Local logs
on the left:
http://ip-172-31-xx-xx.ec2.internal:9035/logs/userlogs/application_1413826885383_0001/
We check each of the container folders and find the one that used the block manager,
say, container_1413826885383_0001_01_000086
. And we are able to find the block manager
in the following log.
http://ip-172-31-xx-xx.ec2.internal:9035/logs/userlogs/application_1413826885383_0001/container_1413826885383_0001_01_000086/stderr
Now we would like to know that why the container container_1413826885383_0001_01_000086
is killed.
To do so we investiage the log file at
http://ip-172-31-xx-xx.ec2.internal:9035/logs/yarn-hadoop-nodemanager-ip-172-31-xx-xx.ec2.internal.log.2014-10-20-18
And we find out that the node is killed because some problems, e.g., memory limit reached:
Limit=11274289152, current usage = 11443380224
2014-10-20 18:00:20,624 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=3377,containerID=container_1413826885383_0001_01_000086] is running beyond physical memory limits. Current usage: 10.7 GB of 10.5 GB physical memory used; 11.3 GB of 52.5 GB virtual memory used. Killing container.
- Profile a live container
Once we have identified one container that causes the problem, we can dump the memory associated with the process that is running the container. After we log into the internal node, we can check the logs from node manager,e.g.,
hadoop@internal:tail /mnt/var/log/hadoop/yarn-hadoop-nodemanager-ip-172-31-xx-xx.ec2.internal.log
we are able to find the status of each container, memory usage. For example for the container container_1414015957471_0001_01_000015
2014-10-22 22:22:26,574 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 3929 for container-id container_1414015957471_0001_01_000015: 7.3 GB of 20.5 GB physical memory used; 21.5 GB of 102.5 GB virtual memory used
Suppose we try to analyze the memory usage of this container. The first thing we need to know is how to find the system process id running this container. To do so, we can search in system process list:
hadoop@internal:ps aux | grep container_1414015957471_0001_01_000015
And we are able to find
hadoop 3929 0.0 0.0 112916 1216 ? Ss 22:14 0:00 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms20480m -Xmx20480m -Djava.io.tmpdir=/mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1414015957471_0001/container_1414015957471_0001_01_000015/tmp '-Dspark.akka.timeout=200' '-Dspark.akka.frameSize=100' org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://[email protected]:37254/user/CoarseGrainedScheduler 25 ip-172-31-32-193.ec2.internal 6 1> /mnt/var/log/hadoop/userlogs/application_1414015957471_0001/container_1414015957471_0001_01_000015/stdout 2> /mnt/var/log/hadoop/userlogs/application_1414015957471_0001/container_1414015957471_0001_01_000015/stderr
hadoop 3940 58.6 14.4 22484956 8887668 ? Sl 22:14 5:59 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms20480m -Xmx20480m -Djava.io.tmpdir=/mnt/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1414015957471_0001/container_1414015957471_0001_01_000015/tmp -Dspark.akka.timeout=200 -Dspark.akka.frameSize=100 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://[email protected]:37254/user/CoarseGrainedScheduler 25 ip-172-31-32-193.ec2.internal 6
hadoop 8006 0.0 0.0 110280 860 pts/0 S+ 22:24 0:00 grep container_1414015957471_0001_01_000015
where the first one is the bash script and the last one is the grep we were doing. Thus the pid of
the process running the container is 3940
. We now can use java utilities, e.g.,
use jstack
to observe the current stack:
hadoop@internal:jstack 3940 | vim -
or we can dump the current memory content to a file
hadoop@internal:jmap -dump:format=b,file=/mnt/3940.dump 3940
and use jhat
or VisualVM
to analyze the memory status. For example,
run the following command
hadoop@internal:jhat /mnt/3940.dump
the computer will analyze the dump file and setup a web server at port 7000
.
And by using sock connection, we can access the webpage at
http://ip-172-31-xx-xx.ec2.internal:7000/
and show instance counts:
http://ip-172-31-xx-xx.ec2.internal:7000/showInstanceCounts/