Two Storm supervisors running in a 3-node configuration: one "head" node (nimbus) and two worker nodes. Each supervisor node runs a supervisor under supervision.
Multi-lang topology using Python and our streamparse library. Works fine when running on a single machine. Spout reads off Kafka and inserts tuples into topology, other bolts are doing various analysis with latencies from 10ms - 3,000ms. Spout parallelism of 4-8 and max.spout.pending value of 500-1000.
We have been testing the topology for months using a two-node cluster, single supervisor plus nimbus. We have tweaked worker parallelism settings; on a single supervisor node, we often have 8-12 worker JVM processes running. And the topology handled this fine.
The second you try to run the topology on multiple nodes, however, lots of weird behavior starts happening. Instead of a steady stream of about 1-2k reads per second from Kafka, we get this spiky behavior where 1k reads happen, but then the topology enters a frozen state, and no more tuples get pushed through the spout.
This seems to happen in both 0.9.1-incubating and the new 0.9.2-incubating release candidate. It seems to affect both ZeroMQ and Netty transports.
Download the stack traces in stacktraces.zip below.
Included with this report are stack traces gathered from all the worker JVMs (+ supervisor) at two states on each supervisor node.
The first state (dump1
) is when I see the spout has stopped sending tuples
and failures have started showing up in the Storm UI.
The second state (dump2
) is when I decided to forcibly deactivate the Spout
via the Storm UI, wait about 10 seconds, and then run this.
I then experimented a little and killed the supervisor on the second node. I
then watched as new workers spun up on the first node. And then I proceeded to
see how the topology was still frozen. So, I took a stack trace of all the pids
on this node only (supervisor1). That is dump3
.
The most interesting thing about this is that the machines aren't at 0% CPU or frozen. Instead, a few of the bolts are still doing a little work, and the JVM worker processes are at 20% CPU usage each. But the load is nowhere near where it should be if this thing were running full speed.
The other odd part is that when I deactivated the spout, nothing about the
load of the machine changed. Normally, when the cluster is healthy,
deactivating the spout has the effect of dropping the load dramatically,
especially on the JVM worker machines. But in this case, it basically had no
effect. I thought that odd enough to take the second set of stack trace dumps
(dump2
).
It was at that point that I had the idea to force the workers to move to a
single node and that I snapshotted dump3
while noting that the machine load
had stayed pretty much the same (20% CPU per worker JVM and basically no CPU
usage at the spouts).
I used a combination of jps
and jstack
. For example:
sudo jps -l | grep -v 'Jps' | cut -f 1 -d' ' >pids
for pid in `cat pids`; do
sudo jstack -F $pid >$pid.dump;
done
The issue had nothing to do with Storm and everything to do with Ubuntu 14.04 and its interaction with Xen network kernel drivers in EC2.
I was staring at the results of this research and thinking, "What could possibly cause the network subsystem of Storm to just hang?"
My first impulse: firewalls. Maybe as the network was ramping up, I was hitting up against a firewall rule?
I checked our munin monitoring graphs and noticed a bunch of eth0 errors correlated with our topologies running. I checked our production Storm 0.8.2 cluster -- no errors. Ah hah! It must be firewall rules or something!
That led me to run dmesg
on the supervisor nodes. I found a bunch of entries like
this:
xen_netfront: xennet: skb rides the rocket: 20 slots
xen_netfront: xennet: skb rides the rocket: 19 slots
That's odd. I also saw some entries related to ufw (Ubuntu's firewall service). So,
I try running ufw disable
. No change.
I then dig in more to these error messages and I come across this open bug on Launchpad:
https://bugs.launchpad.net/ubuntu/+source/linux-lts-raring/+bug/1195474
I dig in there and come across the current workaround, running:
sudo ethtool -K eth0 sg off
On the server. I issue that command, restart my topology, and VOILA, it's now running at full performance.
Back in my earliest days as a professional programmer, I had a friend named Jimmy. I once spent 3 days debugging a JVM garbage collection issue with him. We ran profilers, did detailed code traces, extensive logging, etc. And in the end, the fix to the problem was a single line of code change -- a mistaken allocation of an expensive object that was happening in a tight loop. At that moment, I coined "Jimmy's Law", which is:
"The amount of time it takes to discover a bug's fix is inversely proportional to the lines of code changed by the fix, with infinite time converging to one line."
After hours of investigating and debugging this issue, that's certainly how I feel. Shame on me for upgrading my Storm cluster and Ubuntu version simultaneously!
Nice catch.
We had the same exception on a cluster with NIC bonded nodes, where one of the bonded NICs was mistakenly configured to 100Mbps instead of 1000Mbps, which was getting it saturated and ultimately causing the infamous
[ERROR] Async loop died! java.lang.RuntimeException: java.lang.RuntimeException: Remote address is not reachable. We will close this client Netty-Client****
andjava.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more
exceptions, neither of which Storm recovers from.