We recently started focusing on telemetry at work. Creating data to help drive
good business decisions. Inasmuch, we are encouraging engineers to start
instrumenting their code to test and utilize our stack. We decided on an
App => Statsite => Graphite => Grafana flow for our metrics.
If you aren't aware, Statsite is a port of StatsD into C but using better and smarter algorithms for computing statistics. Both libraries listen on a port for simple strings to come through, aggregate the data and flush it to Graphite every so often.
I was troubleshooting an interesting issue today in regards to onboarding apps to using the stack. A php application was not able to get metrics through our pipeline and into graphite. They were using a php library built for StatsD to send measurements to Statsite.
The first thing was to make sure metrics were being sent to Statsite.
tcpdump -i lo -X udp and dst port 8125I was looking for UDP traffic on the loopback adapter destined for port 8125. Good news! I saw traffic.
19:16:59.062532 IP6 localhost.56309 > localhost.8125: UDP, length 23
0x0000: 6000 0000 001f 1140 0000 0000 0000 0000 `......@........
0x0010: 0000 0000 0000 0001 0000 0000 0000 0000 ................
0x0020: 0000 0000 0000 0001 dbf5 1fbd 001f 4cf2 ..............L.
0x0030: 6170 692e 616e 6765 6c2e 7375 6363 6573 api.angel.succes
0x0040: 7365 733a 317c 63 ses:1|c
This meant the library was sending metrics. According to the ASCII it looks properly formatted too. So lets see if Statsite is sending it's TCP traffic to graphite.
tcpdump -i eth1 -X tcp and dst port 2003Here we are looking for TCP traffic destined for port 2003 on the public eth1 adapter. Unlucky for us, it looks like some traffic is flowing but I don't see any of our metrics. Something is failing in between.
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
17:30:43.478514 IP xxx.49260 > xxx2: Flags [S], seq 3824488193, win 14600, options [mss 1460,sackOK,TS val 1336162366 ecr 0,nop,wscale 7], length 0
0x0000: 4500 003c 373f 4000 4006 a796 ae24 91d3 E..<7?@.@....$..
0x0010: 3691 e55d c06c 07d3 e3f5 0f01 0000 0000 6..].l..........
0x0020: a002 3908 5c15 0000 0204 05b4 0402 080a ..9.\...........
0x0030: 4fa4 383e 0000 0000 0103 0307 O.8>........
I wanted to test a little further so I logged into the PHP machine and tried to send a metric on my own via the cmd line.
echo "omg:1|c" | nc -u localhost 8125Voila! The metric appeared in Graphite. Lets look at the network packet for that communication.
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
17:31:30.361318 IP localhost.54538 > localhost.8125: UDP, length 8
0x0000: 4500 0024 6607 4000 4011 d6bf 7f00 0001 E..$f.@.@.......
0x0010: 7f00 0001 d50a 1fbd 0010 fe23 6f6d 673a ...........#omg:
0x0020: 317c 630a 1|c.
Ah ha! There are 2 sneaky little bits at the end of the hex code that are
different. 0a aka \n aka newline. Unbeknownst to me, the StatsD library
doesn't add a newline to the end of a single stats but it does when you send
multiple metrics. Statsite requires the newline.
This small incompatibility proved difficult to track down but was easy to solve once we found it.
Unfortunately adding the newline did not make metrics flow through the
pipeline. They still weren't coming out the other end of Statsite. So let's
kill Statsite and set up an nc server instead so we can see whats going on.
nc -lup 8125With netcat listening, I refreshed a page that should have generated PHP
metrics. Nothing... tcpdump was happy and proudly showed some data destined
for port 8125 but netcat never got it. Weird. This points to some routing
issue. Lets take a peek from another perspective.
sysdig is a great to for troubleshooting almost anything. I want to use it to
watch the traffic generated by PHP through Apache2 to the Statsite port 8125.
sysdig -s2000 -A -c echo_fds fd.port=8125Hmmmm.... Even when tcpdump shows traffic sysdig doesn't get it. If we
follow apache2 instead of the port:
sysdig -s2000 -A -c echo_fds proc.name=apache2then we see apache get the metric but we don't see it forward to a port.
------ Write 34B to (apache2)
omg:1|c
The PHP library is configured to send UDP traffic to localhost port 8125.
In the past I have had problems with the differences in translation between
localhost and 127.0.0.1 so on a whim I changed the configuration to
127.0.0.1.
------ Write 36B to 127.0.0.1:40158->127.0.0.1:8125 (apache2)
omg:1|c
IT WORKS! ![]()
So why is this the case? Based on some research it seems that as long as we have an alias for localhost pointing to the IPv6 ::1 then our app listening on 0.0.0.0 won't get the traffic. For some reason `fsockopen prefers IPv6.
This was a great catch. I'd have messed around with the PHP forever and never found anything.