Skip to content

Instantly share code, notes, and snippets.

@cloventt
Created July 27, 2017 05:05
Show Gist options
  • Save cloventt/f6b72c799215d99c4d2de1869e5aa5c3 to your computer and use it in GitHub Desktop.
Save cloventt/f6b72c799215d99c4d2de1869e5aa5c3 to your computer and use it in GitHub Desktop.
Hadoop Inside Docker, DataNode Fails to Replicate Data Correctly

I just spent several hours trying to configure a psuedo-distributed Hadoop cluster inside a Docker container. I wanted to post our experience in case someone else makes the mistake of trying to do this themselves.

Problem

When we tried to save a file to HDFS with the Java client, the NameNode appeared to save the file. Using hdfs dfs ls we could see the file was represented in HDFS, but has a size of 0, indicating no data had made it into the cluster.

A really unhelpful stack trace was also issues by the client. The error was similar to this:

org.apache.hadoop.ipc.RemoteException(java.io.IOException):
    File foo.csv could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) excluded in this operation.

This is just the stacktrace from Hadoop piped back to the client.

Investigation

Our first clue was in this stack trace. The DataNode was labelled 127.0.0.1:50010, or similar to it. This is incorrect. Our client was in a docker container with IP address 172.18.0.3, and the DataNode was in a container with IP 172.18.0.4. From outside of docker, hdfs operations worked fine.

This told us the problem was probably a networking issue. At first we thought the DataNode was returning its localhost IP as its routable IP, which would fail inside docker. After wading through the hopelessly useless Hadoop docs and support we worked out that this was basically true, but it actually wasn't the problem.

The NameNode determines the DataNode IP by capturing the IP it registers from, which was localhost in our case. By default it then passes this IP along to the client.

Solution

The issue ultimately was that there is a barely-mentioned conf option in the hdfs client, dfs.client.use.datanode.hostname.

The way that HDFS handles client operations is this:

  1. Client connects to HDFS NameNode and asks to save a file
  2. NameNode creates a record of the file and decides where to save it in the cluster
  3. NameNode tells client the address of a DataNode to save the file to
  4. Client runs connects to DataNode and saves the file there

The problem was happening in step 3. Without the property above set to true, the NameNode told the client the DataNode was accessible on 127.0.0.1, which obviously failed from a docker container with a different IP. In a standard installation this wouldn't be a problem, but communicating between docker containers made this a problem. By setting the property to true, the NameNode used the DataNode's hostname, which was the name of the Docker container ID. Luckily for us, inside a docker network, container IDs are routable hosts from anywhere in the network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment