Looked at all the process logs and took notes. For middlemanager I copied:
Connection timed out 10.204.148.226:8081
Retries exhausted, couldn't fulfill request to [http://10.204.148.226:8081/druid/indexer/v1/taskStatus]
And wrote a note:
This is the overlord IP address [found looking at ECS UI], Overlord does not have this port open. This looks like it is
for the router. Could it be trying to reach the wrong thing?
I used 2.zookeeper:8080/commands/dump & 3.zookeeper:8080/commands/dump
to see what Zookeeper thinks is around.
1.zookeeper
returned "This ZooKeeper instance is not currently serving requests"... maybe normal? Not sure.
In the other 2 responses I saw:
"/druid/internal-discovery/OVERLORD/10.204.148.226:8081"
Before this I wondered, "did the overlord and coordinator switch EC2 instances and somehow the zookeeper data is stale?" But this confirmed that 10.204.148.226:8081 knew it was the overlord. So now I was wondering, "why is something trying to reach the overlord on 8081?" That's the coordinator's open port. I Googled and looked in the documentation for port 8081. All I could find was documentation for the Coordinator referencing that port.
It occured to me that that port setting (druid.plaintextPort
) was not namespaced at all: each process type must have
a value like this.
I turned to the ultimate source of truth (the source code on Github) and searched for for "8081" to see if this was being set in the default config since I couldn't find it in the documentation. The first hit was the coordinator-overlord config for the cluster setup. Then I wondered how the Docker script handles this new type and found the line that references it.
At this point I decided to open port 8081 on the overlord process and deployed that change.
During my Github search I also found this issue discussing the removal of overlord. I see the phrase "breaking change" :-O
So it seems that both the coordinator process and overlord processes were advertising themselves as available on port
8081
because they share the exact same druid.plaintextPort
value. However we did not have that port open on the
overlord process because it was not documented. Therefore the overlord could not be reached at 8081. I removed the
overlord process and the UI started working and tasks started running. One less process to worry about.