Hadoop Inside Docker, DataNode Fails to Replicate Data Correctly

I just spent several hours trying to configure a psuedo-distributed Hadoop cluster inside a Docker container. I wanted to post our experience in case someone else makes the mistake of trying to do this themselves.

Problem

When we tried to save a file to HDFS with the Java client, the NameNode appeared to save the file. Using hdfs dfs ls we could see the file was represented in HDFS, but has a size of 0, indicating no data had made it into the cluster.

A really unhelpful stack trace was also issues by the client. The error was similar to this:

org.apache.hadoop.ipc.RemoteException(java.io.IOException):
    File foo.csv could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) excluded in this operation.

This is just the stacktrace from Hadoop piped back to the client.

Investigation

Our first clue was in this stack trace. The DataNode was labelled 127.0.0.1:50010, or similar to it. This is incorrect. Our client was in a docker container with IP address 172.18.0.3, and the DataNode was in a container with IP 172.18.0.4. From outside of docker, hdfs operations worked fine.

This told us the problem was probably a networking issue. At first we thought the DataNode was returning its localhost IP as its routable IP, which would fail inside docker. After wading through the hopelessly useless Hadoop docs and support we worked out that this was basically true, but it actually wasn't the problem.

The NameNode determines the DataNode IP by capturing the IP it registers from, which was localhost in our case. By default it then passes this IP along to the client.

Solution

The issue ultimately was that there is a barely-mentioned conf option in the hdfs client, dfs.client.use.datanode.hostname.

The way that HDFS handles client operations is this:

Client connects to HDFS NameNode and asks to save a file
NameNode creates a record of the file and decides where to save it in the cluster
NameNode tells client the address of a DataNode to save the file to
Client runs connects to DataNode and saves the file there

The problem was happening in step 3. Without the property above set to true, the NameNode told the client the DataNode was accessible on 127.0.0.1, which obviously failed from a docker container with a different IP. In a standard installation this wouldn't be a problem, but communicating between docker containers made this a problem. By setting the property to true, the NameNode used the DataNode's hostname, which was the name of the Docker container ID. Luckily for us, inside a docker network, container IDs are routable hosts from anywhere in the network.

cloventt/hadoop_problem.md

Problem

Investigation

Solution