deserat/gist:8245275

Last active January 2, 2016 03:39

Star (6) You must be signed in to star a gist
Fork (1) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/deserat/8245275.js"></script>
Save deserat/8245275 to your computer and use it in GitHub Desktop.

Download ZIP

Node JS Benchmarks Amazon AWS vs Joyent

Raw

gistfile1.md

The script:

var http = require('http')
var fs = require('fs')


http.createServer(function (req, res) {
  res.writeHead(200, {'Content-Type': 'text/plain'})
  res.end("hello world")
}).listen(3000);

This is just run with node server.js

The local ab

ab -n 100000 -c100 http://localhost:3000/

SmartOS High CPU 1.75 / SW-1

Requests per second: 1339.35 [#/sec] (mean)

Joyent High CPU 1.75 / East-1

Requests per second: 1019.17 [#/sec] (mean)

Joyent High CPU 1.75 Ubuntu 12.04

Requests per second: 2899.45 [#/sec] (mean)

Amazon High CPU 1.75 Ubuntu 12.04

Requests per second: 4085.26 [#/sec] (mean)

Python Loop

The code

for x in xrange(100000000):
    x + 1

The Test

time python test.py

SmartOS High CPU 1.75 / SW-1

real    0m13.531s
user    0m13.487s
sys     0m0.060s

Joyent High CPU 1.75 / East-1

real    0m10.337s
user    0m10.302s
sys     0m0.054s

Joyent High CPU 1.75 Ubuntu 12.04

real    0m9.177s
user    0m9.120s
sys     0m0.040s

Amazon High CPU 1.75

real    0m8.909s
user    0m8.897s                                                                                                          
sys     0m0.012s

Notes

The first iterations of this had node clustered with 2 forks, incrementing a remote redis counter, and ab being run from a remote machine. While the numbers were higher ( about 50% ) they were proportionally the same. This version just removes all variables.

I understand ab isn't exact but it's margin of error is tight enough to make it suitable for a ballpark. I've worked with Joyent and Amazon in the past. Infact I pulled 100 machines off AWS and put them on Joyent precisely because my own benchmarks - much more thorough than this - showed Joyent being 5 - 10 times faster.

When I ran the above I expected Joyent to completely demolish AWS.

I'm happy to be told I've done something terribly wrong. In fact I'm hoping for it.

brendangregg commented Jan 4, 2014

The benchmark isn't testing what it may seem to test. It's launching a flood of TCP connections from a single client IP address, which is causing a problem.

On SmartOS, these new connections are clashing with old ones still in TIME_WAIT, causing client retransmits and delays. The problem is that TCP ports are 16-bit, and the four-tuple (or three-tuple) of client-IP:client-port:server-IP:server-port is used as a unique TCP connection identifier. Since this benchmark only has one client IP, one server IP, and one server port, the only variable to uniquely identify connections is the 16-bit client ephemeral port (which by default is restricted to 32k-65k, so only 15-bits). So after (only) tens of thousands of connections, the chances of colliding with an old session in TIME_WAIT become great, which slows down ab. Linux has a different mechanism for recycling TIME_WAIT sessions, which is better at avoiding this.

While the test is running on SmartOS, the node server is largely idle:

# prstat -mLc 1
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 99815 root      14 2.7 0.0 0.0 0.0 0.0  83 0.1  97  19  5K   0 node/1
 99822 root     1.4 4.6 0.0 0.0 0.0 0.0  94 0.5  1K  56 12K   0 ab/1
[...]

In that interval, node is spending 83% of its time sleeping, as the ab clients becomes blocked on TCP retransmits.

The best way to benchmark node.js is using multiple clients, which avoids the TIME_WAIT clashes, and is what usually happens in the real world.

I'd like to show an easy workaround for anyone trying this form of single client benchmarking: have the node server listen on several ports, and have ab test them all simultaneously, with the aim of driving the node CPU time to 100%. However, I don't see a way to have ab test more than one URL at a time. :(

Here's a different tactic: adjust the server to support keep-alive:

var http = require('http')
var fs = require('fs')
var body = "hello world"

http.createServer(function (req, res) {
  res.writeHead(200, {
                'Content-Type': 'text/plain',
                'Content-Length': body.length})
  res.end(body)
}).listen(3000);

Then use ab -k

ab -n 100000 -c100 -k http://localhost:3000/

The idea of using keep-alive is to take the single-client-connection-flood-problem off the table.

Again, the use of multiple clients should avoid the clash, as usually happens in production.

brendangregg commented Jan 4, 2014

I took some notes while debugging this on my own lab machine; if they are of interest, I've pasted them below:

ab performs one connection per request. This can be observed using:

[root@27fa5ada-7281-e10e-a3e4-e81535118f64 ~]# netstat -s 1 | grep ActiveOpen
        tcpActiveOpens      =728004     tcpPassiveOpens     =726547
        tcpActiveOpens      =     0     tcpPassiveOpens     =     0
        tcpActiveOpens      =     0     tcpPassiveOpens     =     0
        tcpActiveOpens      =  4939     tcpPassiveOpens     =  4939
        tcpActiveOpens      =  5849     tcpPassiveOpens     =  5798 <-- fast
        tcpActiveOpens      =  1341     tcpPassiveOpens     =  1292
        tcpActiveOpens      =  1006     tcpPassiveOpens     =  1008
        tcpActiveOpens      =   872     tcpPassiveOpens     =   870
        tcpActiveOpens      =   932     tcpPassiveOpens     =   932 <-- slow
        tcpActiveOpens      =   879     tcpPassiveOpens     =   879
        tcpActiveOpens      =   562     tcpPassiveOpens     =   586
        tcpActiveOpens      =   613     tcpPassiveOpens     =   594
[...]

The output shows ab initially running fast and then slowing down.

Fast:

[root@27fa5ada-7281-e10e-a3e4-e81535118f64 ~]# prstat -mLc 1
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 63037 root      83  16 0.1 0.0 0.0 0.0 0.0 0.5  30 243 45K   0 node/1
 12927 root      14  49 0.0 0.0 0.0 0.0  34 2.9  6K 365 .1M   0 ab/1
 63037 root     0.5 0.6 0.0 0.0 0.0 3.7  95 0.4  1K   0  1K   0 node/2
 12925 root     0.0 0.1 0.0 0.0 0.0 0.0 100 0.0  16   0 185   0 prstat/1
[...]

Slow:

[root@27fa5ada-7281-e10e-a3e4-e81535118f64 ~]# prstat -mLc 1
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 63037 root      15 3.6 0.0 0.0 0.0 0.0  81 0.2 268  26  8K   0 node/1
 12927 root     2.4 8.3 0.0 0.0 0.0 0.0  89 0.7  1K  42 16K   0 ab/1
 12925 root     0.1 0.3 0.0 0.0 0.0 0.0 100 0.0  16   0 181   0 prstat/1
 63037 root     0.2 0.2 0.0 0.0 0.0  79  21 0.1 347   0 350   0 node/2

Both node and ab are blocking (SLP). No caps are being hit (LAT). Note that ab is single-process and single-threaded.

This type of benchmarking can flood the server with TCP connections stuck in TIME_WAIT, since they are launched from a single client to a single server and port, and only have one 16 bit identifier – the ephemeral client port – to differentiate them. Checking for this:

[root@27fa5ada-7281-e10e-a3e4-e81535118f64 ~]# netstat -an | wc -l
11268

11k isn't enough; with the default tuning there should be 33k slots: tcp_smallest_anon_port () to tcp_largest_anon_port (). This looks like a different issue.

Checking syscall times for both ab and node:

[root@27fa5ada-7281-e10e-a3e4-e81535118f64 ~]# dtrace -n 'syscall:::entry /execname == "node" || execname == "ab"/ { self->ts = timestamp; }
    syscall:::return /self->ts/ { @[execname, probefunc] = quantize(timestamp - self->ts); self->ts = 0; }'
[...]
  ab                                                  close                                             
           value  ------------- Distribution ------------- count    
            2048 |                                         0        
            4096 |                                         1        
            8192 |@@@@@                                    304      
           16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           1801     
           32768 |@@@@                                     266      
           65536 |                                         4        
          131072 |                                         0        

  ab                                                  connect                                           
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@                                        166      
            2048 |@@@@@@                                   656      
            4096 |@@@@@@@@@@@                              1346     
            8192 |@@@                                      312      
           16384 |@@@@@@@                                  803      
           32768 |@@@@@@@@@@@                              1301     
           65536 |@                                        61       
          131072 |                                         1        
          262144 |                                         0        
          524288 |                                         1        
         1048576 |                                         0        
         2097152 |                                         5        
         4194304 |                                         23       
         8388608 |                                         17       
        16777216 |                                         0        

  node                                                nanosleep                                         
           value  ------------- Distribution ------------- count    
          262144 |                                         0        
          524288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 813      
         1048576 |                                         2        
         2097152 |                                         0        

  node                                                lwp_park                                          
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |                                         7        
            2048 |@@                                       56       
            4096 |@@@@@@@@@@@@@@                           436      
            8192 |@@@@@@@                                  204      
           16384 |@@@@@                                    149      
           32768 |@@@                                      87       
           65536 |@@                                       47       
          131072 |@@                                       62       
          262144 |                                         3        
          524288 |                                         2        
         1048576 |                                         1        
         2097152 |                                         9        
         4194304 |@@                                       46       
         8388608 |@                                        29       
        16777216 |@                                        36       
        33554432 |@                                        25       
        67108864 |                                         6        
       134217728 |                                         0        

  ab                                                  portfs                                            
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@                                     1800     
            2048 |@@@@@@@@@@@@@@@@@@@@                     8559     
            4096 |@@@@@@@@@                                3855     
            8192 |@                                        275      
           16384 |@@                                       1089     
           32768 |@                                        618      
           65536 |@                                        502      
          131072 |@                                        367      
          262144 |@                                        247      
          524288 |                                         24       
         1048576 |                                         7        
         2097152 |                                         3        
         4194304 |                                         15       
         8388608 |                                         21       
        16777216 |                                         34       
        33554432 |                                         26       
        67108864 |                                         6        
       134217728 |                                         0        

  node                                                portfs                                            
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@                                     697      
            2048 |@@@@@@@@@@@@@@@@@@@@@@@@                 4157     
            4096 |@@@@@@                                   1076     
            8192 |@                                        117      
           16384 |                                         31       
           32768 |                                         64       
           65536 |@@                                       388      
          131072 |@                                        236      
          262144 |                                         5        
          524288 |                                         2        
         1048576 |                                         1        
         2097152 |                                         9        
         4194304 |                                         38       
         8388608 |                                         36       
        16777216 |                                         36       
        33554432 |                                         25       
        67108864 |                                         6        
       134217728 |                                         0

The distributions show the nanosecond durations for the system calls. The calls to portfs can block for 10s of milliseconds. Why? I don't know. This type of syscall (event ports) basically means it's blocked on someone else. We'll need to trace that someone else.

From the global zone, I'll investigate which code paths cause the node portfs to unblock, by tracing conditional variable wakeups:

[bgregg@globalzone (us-west-1) /var/tmp]$ ./cv_wakeup_slow.d 10
[...]
 23  12326       sleepq_wakeone_chan:wakeup 63037 1 0 0 sched... 46 ms
              genunix`cv_signal+0xa0
              genunix`port_send_event+0x131
              genunix`pollwakeup+0x86
              sockfs`so_notify_newconn+0x81
              sockfs`so_newconn+0x159
              ip`tcp_newconn_notify+0x198
              ip`tcp_input_data+0x1b4a
              ip`squeue_drain+0x2fa
              ip`squeue_enter+0x28e
              ip`tcp_input_listener+0x1197
              ip`squeue_drain+0x2fa
              ip`squeue_enter+0x28e
              ip`ip_fanout_v4+0xc7c
              ip`ire_send_local_v4+0x1d1
              ip`conn_ip_output+0x190
              ip`tcp_send_data+0x59
              ip`tcp_timer+0x6b2
              ip`tcp_timer_handler+0x3e
              ip`squeue_drain+0x2fa
              ip`squeue_worker+0xc0

 10  12326       sleepq_wakeone_chan:wakeup 63037 2 63037 1 node... 45 ms
              genunix`cv_signal+0xa0
              genunix`lwp_unpark+0x5e
              genunix`syslwp_park+0x77
              unix`sys_syscall+0x17a
[...]

Ah. tcp_send_data() is being called by tcp_timer()? Is this a retransmit? On localhost??

Back to basic TCP counters:

[root@27fa5ada-7281-e10e-a3e4-e81535118f64 ~]# netstat -s 1 | egrep 'tcpActiveOpens|tcpRetransSegs'
        tcpActiveOpens      =862104     tcpPassiveOpens     =860234
        tcpRetransSegs      = 82529     tcpRetransBytes     = 10432
        tcpActiveOpens      =     0     tcpPassiveOpens     =     0
        tcpRetransSegs      =     0     tcpRetransBytes     =     0
        tcpActiveOpens      =  2565     tcpPassiveOpens     =  2565
        tcpRetransSegs      =     0     tcpRetransBytes     =     0
        tcpActiveOpens      =  5635     tcpPassiveOpens     =  5635
        tcpRetransSegs      =     0     tcpRetransBytes     =     0
        tcpActiveOpens      =  3365     tcpPassiveOpens     =  3265
        tcpRetransSegs      =     0     tcpRetransBytes     =     0
        tcpActiveOpens      =  1201     tcpPassiveOpens     =  1201
        tcpRetransSegs      =   100     tcpRetransBytes     =     0
        tcpActiveOpens      =  1196     tcpPassiveOpens     =  1196
        tcpRetransSegs      =   100     tcpRetransBytes     =     0
        tcpActiveOpens      =   637     tcpPassiveOpens     =   644
        tcpRetransSegs      =    77     tcpRetransBytes     =     0
        tcpActiveOpens      =   642     tcpPassiveOpens     =   636
        tcpRetransSegs      =    67     tcpRetransBytes     =     0
        tcpActiveOpens      =   670     tcpPassiveOpens     =   669
        tcpRetransSegs      =    78     tcpRetransBytes     =     0
        tcpActiveOpens      =   709     tcpPassiveOpens     =   709
        tcpRetransSegs      =    87     tcpRetransBytes     =     0
        tcpActiveOpens      =   677     tcpPassiveOpens     =   677
        tcpRetransSegs      =    91     tcpRetransBytes     =     0
        tcpActiveOpens      =   753     tcpPassiveOpens     =   753
        tcpRetransSegs      =   100     tcpRetransBytes     =     0
        tcpActiveOpens      =   685     tcpPassiveOpens     =   685
        tcpRetransSegs      =   100     tcpRetransBytes     =     0
        tcpActiveOpens      =   773     tcpPassiveOpens     =   773
        tcpRetransSegs      =   100     tcpRetransBytes     =     0
[...]

Yes! So, when TCP retransmits begin to occur, performance drops.

The rate seems interesting, between about 70 and 100 retransmits per second, and appearing to hit a ceiling of 100. The ab client concurrency is set to 100. As an experiment, this was adjusted to 333, which caused the following rate of retransmits:

        tcpActiveOpens      =  1802     tcpPassiveOpens     =  1804
        tcpRetransSegs      =   289     tcpRetransBytes     =     0
        tcpActiveOpens      =  1509     tcpPassiveOpens     =  1504
        tcpRetransSegs      =   327     tcpRetransBytes     =     0
        tcpActiveOpens      =  1645     tcpPassiveOpens     =  1643
        tcpRetransSegs      =   311     tcpRetransBytes     =     0
        tcpActiveOpens      =  1285     tcpPassiveOpens     =  1299
        tcpRetransSegs      =   296     tcpRetransBytes     =     0

This makes sense, as the TCP retransmit interval is 1000ms:

[root@27fa5ada-7281-e10e-a3e4-e81535118f64 ~]# ndd /dev/tcp tcp_rexmit_interval_initial
1000

So, as each ab client hits a retransmit, it waits for 1 second. The most that can retransmit and wait during a second is equal to the ab client concurrency.

Getting some information about those retransmits:

[bgregg@globalzone (us-west-1) /var/tmp]$ ./tcpretranssnoop_sdc6.d              
TIME                 TCP_STATE         SRC              DST              PORT  
2014 Jan  4 01:31:31 TCPS_SYN_SENT     127.0.0.1        127.0.0.1        3000  
2014 Jan  4 01:31:31 TCPS_SYN_SENT     127.0.0.1        127.0.0.1        3000  
2014 Jan  4 01:31:31 TCPS_SYN_SENT     127.0.0.1        127.0.0.1        3000  
2014 Jan  4 01:31:31 TCPS_SYN_SENT     127.0.0.1        127.0.0.1        3000  
2014 Jan  4 01:31:31 TCPS_SYN_SENT     127.0.0.1        127.0.0.1        3000  
2014 Jan  4 01:31:31 TCPS_SYN_SENT     127.0.0.1        127.0.0.1        3000  
2014 Jan  4 01:31:31 TCPS_SYN_SENT     127.0.0.1        127.0.0.1        3000  
2014 Jan  4 01:31:31 TCPS_SYN_SENT     127.0.0.1        127.0.0.1        3000  
2014 Jan  4 01:31:31 TCPS_SYN_SENT     127.0.0.1        127.0.0.1        3000  
2014 Jan  4 01:31:31 TCPS_SYN_SENT     127.0.0.1        127.0.0.1        3000  
[...]

So these are SYN sent.

Listen drops?

[root@27fa5ada-7281-e10e-a3e4-e81535118f64 ~]# netstat -s

RAWIP   rawipInDatagrams    =     0     rawipInErrors       =     0
        rawipInCksumErrs    =     0     rawipOutDatagrams   =     0
        rawipOutErrors      =     0

UDP     udpInDatagrams      =    20     udpInErrors         =     0
        udpOutDatagrams     =    20     udpOutErrors        =     0

TCP     tcpRtoAlgorithm     =     4     tcpRtoMin           =   400
        tcpRtoMax           = 60000     tcpMaxConn          =    -1
        tcpActiveOpens      =997934     tcpPassiveOpens     =995668
        tcpAttemptFails     =  2280     tcpEstabResets      =     1
        tcpCurrEstab        =     2     tcpOutSegs          =4070005
        tcpOutDataSegs      =2023572    tcpOutDataBytes     =215950225
        tcpRetransSegs      = 93156     tcpRetransBytes     = 10432
        tcpOutAck           =3087787    tcpOutAckDelayed    =    35
        tcpOutUrg           =     0     tcpOutWinUpdate     =     0
        tcpOutWinProbe      =     0     tcpOutControl       =4175449
        tcpOutRsts          = 97392     tcpOutFastRetrans   =     0
        tcpInSegs           =8155510
        tcpInAckSegs        =     0     tcpInAckBytes       =24798285
        tcpInDupAck         =995662     tcpInAckUnsent      =     0
        tcpInInorderSegs    =2002108    tcpInInorderBytes   =205410367
        tcpInUnorderSegs    =  1312     tcpInUnorderBytes   =1898376
        tcpInDupSegs        = 95072     tcpInDupBytes       =    48
        tcpInPartDupSegs    =     0     tcpInPartDupBytes   =     0
        tcpInPastWinSegs    =     0     tcpInPastWinBytes   =     0
        tcpInWinProbe       =     0     tcpInWinUpdate      =     0
        tcpInClosed         =     1     tcpRttNoUpdate      =     0
        tcpRttUpdate        = 11322     tcpTimRetrans       =    25
        tcpTimRetransDrop   =     1     tcpTimKeepalive     =     4
        tcpTimKeepaliveProbe=     0     tcpTimKeepaliveDrop =     0
        tcpListenDrop       =    70     tcpListenDropQ0     =     0
        tcpHalfOpenDrop     =     0     tcpOutSackRetrans   =     0
[...]

These aren't listen drops (tcpListenDrop & Q0), but the tcpOutRsts counter is consistent with tcpRetransSegs.

Some digging:

[bgregg@globalzone (us-west-1) /var/tmp]$ dtrace -n 'mib:::tcpOutRsts,mib:::tcpRetransSegs { @[probename, stack()] = count(); }'
dtrace: description 'mib:::tcpOutRsts,mib:::tcpRetransSegs ' matched 6 probes
^C
[...]
  tcpRetransSegs                                    
              ip`tcp_timer_handler+0x3e
              ip`squeue_drain+0x2fa
              ip`squeue_enter+0x28e
              ip`ip_fanout_v4+0xc7c
              ip`ire_send_local_v4+0x1d1
              ip`conn_ip_output+0x190
              ip`tcp_send_data+0x59
              ip`tcp_do_connect+0x412
              ip`tcp_connect+0xa9
              sockfs`so_connect+0xbe
              sockfs`socket_connect+0x3c
              sockfs`connect+0x99
              unix`sys_syscall+0x17a
               13
  tcpRetransSegs                                    
              ip`tcp_timer_handler+0x3e
              ip`squeue_drain+0x2fa
              ip`squeue_worker+0xc0
              unix`thread_start+0x8
              219
  tcpOutRsts                                        
              ip`tcp_input_data+0xc1b
              ip`squeue_drain+0x2fa
              ip`squeue_worker+0xc0
              unix`thread_start+0x8
              236

This doesn't tell me much; will need to trace the tcp functions further.

Trying snoop quickly, this time from a remote client.

A successful connection, showing SYN, SYN-ACK, ACK:

  7   0.00758 165.225.145.222 -> 8.17.169.209 ETHER Type=0800 (IP), size=74 bytes
  7   0.00758 165.225.145.222 -> 8.17.169.209 IP  D=8.17.169.209 S=165.225.145.222 LEN=60, ID=20573, TOS=0x0, TTL=63
  7   0.00758 165.225.145.222 -> 8.17.169.209 TCP D=3000 S=61247 Syn Seq=985695742 Len=0 Win=64057 Options=<mss 1460,sackOK,tstamp 2452225860 0,nop,wscale 4>
________________________________
  8   0.00003 8.17.169.209 -> 165.225.145.222 ETHER Type=0800 (IP), size=74 bytes
  8   0.00003 8.17.169.209 -> 165.225.145.222 IP  D=165.225.145.222 S=8.17.169.209 LEN=60, ID=18162, TOS=0x0, TTL=64
  8   0.00003 8.17.169.209 -> 165.225.145.222 TCP D=61247 S=3000 Syn Ack=985695743 Seq=3308140219 Len=0 Win=32806 Options=<sackOK,tstamp 268758289 2452225860,mss 1460,nop,wscale 5>
________________________________
  9   0.00016 165.225.145.222 -> 8.17.169.209 ETHER Type=0800 (IP), size=66 bytes
  9   0.00016 165.225.145.222 -> 8.17.169.209 IP  D=8.17.169.209 S=165.225.145.222 LEN=52, ID=20574, TOS=0x0, TTL=63
  9   0.00016 165.225.145.222 -> 8.17.169.209 TCP D=3000 S=61247 Ack=3308140220 Seq=985695743 Len=0 Win=64074 Options=<nop,nop,tstamp 2452225860 268758289>

And an unsuccessful connection, resulting in a RST:

 51   0.00001 165.225.145.222 -> 8.17.169.209 ETHER Type=0800 (IP), size=74 bytes
 51   0.00001 165.225.145.222 -> 8.17.169.209 IP  D=8.17.169.209 S=165.225.145.222 LEN=60, ID=20597, TOS=0x0, TTL=63
 51   0.00001 165.225.145.222 -> 8.17.169.209 TCP D=3000 S=38937 Syn Seq=1038927569 Len=0 Win=64057 Options=<mss 1460,sackOK,tstamp 2452225860 0,nop,wscale 4>
________________________________
 52   0.00001 8.17.169.209 -> 165.225.145.222 ETHER Type=0800 (IP), size=66 bytes
 52   0.00001 8.17.169.209 -> 165.225.145.222 IP  D=165.225.145.222 S=8.17.169.209 LEN=52, ID=18182, TOS=0x0, TTL=64
 52   0.00001 8.17.169.209 -> 165.225.145.222 TCP D=38937 S=3000 Ack=1669456229 Seq=3949501922 Len=0 Win=32806 Options=<nop,nop,tstamp 268758291 2452220973>
________________________________
 53   0.00017 165.225.145.222 -> 8.17.169.209 ETHER Type=0800 (IP), size=56 bytes
 53   0.00017 165.225.145.222 -> 8.17.169.209 IP  D=8.17.169.209 S=165.225.145.222 LEN=40, ID=20598, TOS=0x0, TTL=63
 53   0.00017 165.225.145.222 -> 8.17.169.209 TCP D=3000 S=38937 Rst Seq=1669456229 Len=0 Win=64057

That's a SYN, ACK, RST. Packet arrival during TIME_WAIT?

Back on the server, lots of packets are arriving during TIME_WAIT:

[bgregg@globalzone (us-west-1) /var/tmp]$ ./tcptimewait.d 
TIME                 TCP_STATE         SRC-IP           PORT   DST-IP           PORT  FLAGS
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        54170  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        50427  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        37854  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        62731  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        36337  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        34437  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        54368  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        34937  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        36499  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        45755  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        39447  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        61147  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        43460  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        57684  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        63767  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        43964  127.0.0.1        3000  2
2014 Jan  4 01:56:16 TCPS_TIME_WAIT    127.0.0.1        39997  127.0.0.1        3000  2
[...]

The rate is around 300 per second.

Checking where in the stack:

[bgregg@globalzone (us-west-1) /var/tmp]$ dtrace -n 'tcp:::send /args[4]->tcp_flags & TH_RST/ { trace(args[4]->tcp_dport); stack(); }'
[...]
 17  12776               tcp_send_data:send   3000
              ip`tcp_xmit_ctl+0x28a
              ip`tcp_input_data+0xc1b
              ip`squeue_drain+0x2fa
              ip`squeue_worker+0xc0
              unix`thread_start+0x8

  4  12776               tcp_send_data:send   3000
              ip`tcp_xmit_ctl+0x28a
              ip`tcp_input_data+0xc1b
              ip`squeue_drain+0x2fa
              ip`squeue_worker+0xc0
              unix`thread_start+0x8

tcp_xmit_ctl() is a:

1 tcp_output.c 2347 tcp_xmit_ctl(char *str, tcp_t *tcp, uint32_t seq, uint32_t ack, int ctl)

So I can:

[bgregg@globalzone (us-west-1) /var/tmp]$ dtrace -n 'fbt::tcp_xmit_ctl:entry { @[arg0 ? stringof(arg0) : "NULL"] = count(); }'
dtrace: description 'fbt::tcp_xmit_ctl:entry ' matched 1 probe
^C

  new data when detached                                            2
  NULL                                                            233
  TCPS_SYN_SENT-Bad_seq                                           233

So this is in tcp_input_data() as (usr/src/uts/common/inet/tcp/tcp_input.c):

                if (flags & TH_ACK) {
                        /*
                         * Note that our stack cannot send data before a
                         * connection is established, therefore the
                         * following check is valid.  Otherwise, it has
                         * to be changed.
                         */
                        if (SEQ_LEQ(seg_ack, tcp->tcp_iss) ||
                            SEQ_GT(seg_ack, tcp->tcp_snxt)) {
                                freemsg(mp);
                                if (flags & TH_RST)
                                        return;
                                tcp_xmit_ctl("TCPS_SYN_SENT-Bad_seq",
                                    tcp, seg_ack, 0, TH_RST);
                                return;
                        }
                        ASSERT(tcp->tcp_suna + 1 == seg_ack);
                }

This is testing that the sequence number is sane, base on tcp_iss and tcp_snxt. It looks like this is failing up to 100 times per second (one for each ab client), causing the RSTs, retransmits, blocking while wating on the retransmits, and finally the SLP latency.

brendangregg commented Jan 4, 2014

My earlier comments were for the ab/node.js testing. We looked at Python as well, and saw similar variance based on Python versioning (2.7.2, 2.7.3, 2.7.5, ...) and building (eg, 32-bit vs 64-bit). If it was to help shed light on the ab/node.js result – which was a good idea – I think in this case it's leading us to investigate unrelated things.

peteryorke commented Jan 6, 2014

Would be interesting with zb instead of ab.

JMeter will allow multiple URL's

Author

deserat commented Jan 6, 2014

zb performs exactly like ab except the reports are prettier. I like the -v mode of zeus enough that if it didn't exhibit the same issue as ab I'd use it.

It demonstrate what Brendan described above. High throughput then kaput. We are moving on to 3 jmeter machines.

One thing to note: Calling ab on Ubuntu or to Ubuntu performs substantially better and with a cleaner spread of connections.

Requests per second: 3618.19 #/sec

I'm guessing this has to do with the TCP implementations.

[admin@d4d38765-6223-6a6f-e9ce-8d5fd6c3abce ~]$ zeusbench  -n 100000 -c 100 -v  http://192.168.25.182:3000/
This is ZeusBench, version 30 (Build date: Jun 20 2013 08:57:11)
   Time   Reqs  Resps |  USucc  UFail | Connecting Connected Out(kb)  In(kb) RespTime Timeouts Bad | Sockets Established Connecting
   1.00   5690   5590 |      0      0 |       5687      5687     355     818    0.017        0   0 |      97          97          0
   2.00   5929   5929 |      0      0 |       5932      5932     370     868    0.017        0   0 |     100         100          0
   3.00   6757   6757 |      0      0 |       6756      6750     421     989    0.014        0   0 |      99          93          6
   4.00   4046   4046 |      0      0 |       4047      3953     247     592    0.008        0   0 |     100           0        100
   5.00   2600   2600 |      0      0 |       2600      2600     162     380    0.002        0   0 |     100           0        100
   6.00   1740   1740 |      0      0 |       1740      1740     108     254    0.001        0   0 |     100           0        100
   7.00   1100   1100 |      0      0 |       1100      1103      68     161    0.001        0   0 |     100           3         97
   8.00    879    879 |      0      0 |        877       877      54     128    0.001        0   0 |      98           1         97
   9.00    856    856 |      0      0 |        857       857      53     125    0.001        0   0 |      99           2         97
  10.00    784    784 |      0      0 |        782       783      48     114    0.001        0   0 |      97           1         96
  11.00    805    805 |      0      0 |        808       804      50     117    0.001        0   0 |     100           0        100
  12.00    885    885 |      0      0 |        885       885      55     129    0.001        0   0 |     100           0        100
  13.00   1113   1113 |      0      0 |       1113      1113      69     163    0.001        0   0 |     100           0        100
  14.00    952    952 |      0      0 |        951       953      59     139    0.001        0   0 |      99           1         98
  15.00    593    593 |      0      0 |        594       593      37      86    0.001        0   0 |     100           1         99
  16.00    678    677 |      0      0 |        678       676      42      99    0.001        0   1 |     100           0        100
  17.00    708    708 |      0      0 |        708       710      44     103    0.001        0   0 |     100           2         98
  18.00    504    504 |      0      0 |        504       502      31      73    0.001        0   0 |     100           0        100
  19.00    749    749 |      0      0 |        749       749      46     109    0.001        0   0 |     100           0        100
  20.00    688    688 |      0      0 |        688       688      43     100    0.001        0   0 |     100           0        100
  21.00    782    782 |      0      0 |        782       782      48     114    0.001        0   0 |     100           0        100
  22.00    586    586 |      0      0 |        585       588      36      85    0.001        0   0 |      99           2         97
  23.00    507    507 |      0      0 |        507       506      31      74    0.001        0   0 |      99           1         98
  24.00    596    596 |      0      0 |        597       595      37      87    0.001        0   0 |     100           0        100
  25.00    635    635 |      0      0 |        635       636      39      93    0.001        0   0 |     100           1         99
  26.00    619    619 |      0      0 |        619       618      38      90    0.001        0   0 |     100           0        100
  27.00    862    862 |      0      0 |        862       862      53     126    0.001        0   0 |     100           0        100
  28.00    935    935 |      0      0 |        935       935      58     136    0.001        0   0 |     100           0        100
  29.00    811    811 |      0      0 |        811       811      50     118    0.001        0   0 |     100           0        100
  30.00    624    624 |      0      0 |        624       627      39      91    0.001        0   0 |     100           3         97
  31.00    545    545 |      0      0 |        543       543      33      79    0.001        0   0 |      98           1         97
  32.00    633    633 |      0      0 |        634       633      39      92    0.001        0   0 |      99           1         98
  33.00    679    679 |      0      0 |        680       679      42      99    0.001        0   0 |     100           1         99
  34.00    764    764 |      0      0 |        764       764      47     111    0.001        0   0 |     100           1         99
  35.00   1001   1001 |      0      0 |       1001      1000      62     146    0.001        0   0 |     100           0        100
  36.00    943    943 |      0      0 |        943       943      58     138    0.001        0   0 |     100           0        100
  37.00    979    979 |      0      0 |        979       980      61     143    0.001        0   0 |     100           1         99
  38.00    786    786 |      0      0 |        785       785      49     115    0.001        0   0 |      99           0         99
  39.00    802    802 |      0      0 |        802       804      50     117    0.001        0   0 |      99           2         97
  40.00    804    804 |      0      0 |        803       803      50     117    0.001        0   0 |      98           1         97
  41.00    862    862 |      0      0 |        863       861      53     126    0.001        0   0 |      99           0         99
  42.00    857    857 |      0      0 |        858       857      53     125    0.001        0   0 |     100           0        100
  43.00    994    993 |      0      0 |        994       993      62     145    0.001        0   1 |     100           0        100
  44.00   1103   1103 |      0      0 |       1103      1103      68     161    0.001        0   0 |     100           0        100
  45.00    946    942 |      0      0 |        944       947      59     138    0.001        0   4 |      98           5         93
  46.00    549    548 |      0      0 |        551       544      33      80    0.001        0   1 |     100           1         99
  47.00    670    669 |      0      0 |        669       669      41      98    0.001        0   1 |      99           1         98
  48.00    689    689 |      0      0 |        690       688      42     100    0.001        0   0 |     100           0        100
  49.00    706    704 |      0      0 |        706       704      43     103    0.001        0   2 |     100           0        100
  50.00    706    706 |      0      0 |        706       706      44     103    0.001        0   0 |     100           0        100
  51.00    794    789 |      0      0 |        794       789      49     115    0.001        0   5 |     100           0        100
  52.00    814    811 |      0      0 |        814       811      50     118    0.001        0   3 |     100           0        100
  53.00    697    697 |      0      0 |        695       699      43     102    0.001        0   0 |      98           2         96
  54.00    674    672 |      0      0 |        673       671      41      98    0.001        0   2 |      97           1         96
  55.00    664    662 |      0      0 |        665       661      41      96    0.001        0   2 |      98           0         98
  56.00    663    662 |      0      0 |        665       664      41      96    0.001        0   1 |     100           2         98
  57.00    465    465 |      0      0 |        465       463      28      68    0.001        0   0 |     100           0        100
  58.00    583    580 |      0      0 |        583       580      36      84    0.001        0   3 |     100           0        100
  59.00    682    679 |      0      0 |        682       679      42      99    0.001        0   3 |     100           0        100
  60.00    580    580 |      0      0 |        579       580      36      84    0.001        0   0 |      99           0         99
  61.00    458    457 |      0      0 |        459       458      28      66    0.001        0   1 |     100           1         99
   Time   Reqs  Resps |  USucc  UFail | Connecting Connected Out(kb)  In(kb) RespTime Timeouts Bad | Sockets Established Connecting
  62.00    372    372 |      0      0 |        372       371      23      54    0.001        0   0 |     100           0        100
  63.00    338    337 |      0      0 |        337       338      21      49    0.001        0   1 |      99           1         98
  64.00    243    243 |      0      0 |        244       242      15      35    0.001        0   0 |     100           0        100
  65.00    308    308 |      0      0 |        308       308      19      45    0.001        0   0 |     100           0        100
  66.00   1264   1264 |      0      0 |       1264      1264      79     185    0.001        0   0 |     100           0        100
  67.00   1861   1861 |      0      0 |       1861      1861     116     272    0.001        0   0 |     100           0        100
  68.00   1978   1978 |      0      0 |       1977      1979     123     289    0.001        0   0 |      99           1         98
  69.00   1586   1585 |      0      0 |       1587      1591      99     232    0.001        0   1 |     100           7         93
  70.00   1664   1663 |      0      0 |       1663      1657     103     243    0.001        0   1 |      99           1         98
  71.00   1928   1928 |      0      0 |       1929      1927     120     282    0.001        0   0 |     100           0        100
  72.00   2147   2147 |      0      0 |       2146      2148     134     314    0.001        0   0 |      99           1         98
  73.00   1686   1685 |      0      0 |       1687      1684     105     246    0.001        0   1 |     100           0        100
  74.00   1138   1138 |      0      0 |       1138      1138      71     166    0.001        0   0 |     100           0        100
  75.00   1146   1146 |      0      0 |       1143      1148      71     167    0.001        0   0 |      97           2         95
  76.00   1003   1003 |      0      0 |       1004      1001      62     146    0.001        0   0 |      98           0         98
  77.00   1514   1514 |      0      0 |       1516      1515      94     221    0.001        0   1 |      99           1         98
  78.00   1108   1108 |      0      0 |       1105      1109      69     162    0.001        0   0 |      96           2         94
  79.00    842    844 |      0      0 |        845       842      52     123    0.001        0   0 |      97           0         97
  80.00    819    818 |      0      0 |        819       818      51     119    0.001        0   1 |      97           0         97
  81.00   1277   1276 |      0      0 |       1277      1276      79     186    0.001        0   1 |      97           0         97
  82.00   1076   1076 |      0      0 |       1076      1077      67     157    0.001        0   0 |      97           1         96
  83.00    813    814 |      0      0 |        812       813      50     119    0.001        0   0 |      95           0         95
  84.00    487    488 |      0      0 |        488       488      30      71    0.001        0   0 |      95           0         95
  85.00    416    416 |      0      0 |        416       418      26      60    0.001        0   0 |      95           2         93
  86.00    705    705 |      0      0 |        705       704      44     103    0.001        0   0 |      95           1         94
  87.00    760    761 |      0      0 |        759       761      47     111    0.001        0   0 |      93           1         92
  88.00    632    632 |      0      0 |        633       631      39      92    0.001        0   0 |      94           0         94
  89.00    551    551 |      0      0 |        551       551      34      80    0.001        0   0 |      94           0         94
  90.00    376    376 |      0      0 |        376       376      23      55    0.001        0   0 |      94           0         94
  91.00    531    533 |      0      0 |        531       533      33      78    0.001        0   0 |      92           0         92
  92.00    348    350 |      0      0 |        347       350      21      51    0.001        0   0 |      89           0         89
  93.00    287    287 |      0      0 |        288       288      17      41    0.001        0   0 |      90           1         89
  94.00    289    291 |      0      0 |        289       290      18      42    0.001        0   0 |      88           0         88
  95.00    116    117 |      0      0 |        116       118       7      17    0.001        0   0 |      87           1         86
  96.00    242    244 |      0      0 |        242       244      15      35    0.001        0   0 |      85           1         84


Requests made:        100084
HTTP responses:       99963
Sent:                 6397760 [Bytes]
Received:             14994450 [Bytes]
Keepalives:           0
Keepalives dropped:   0
SSL sessions reused:  0
Connections made:     99965
Connections dropped:  0
Invalid responses:    0
Failed connections:   37
Timeouts (connect):   0
Timeouts (data):      0
Size mis-matches:     0
Early responses:      0
Max concurrency:      0

Run time:             88.763 [seconds] (0.2 user, 0.9 system)
Request  rate:        1127.542 [requests/sec]
Response rate:        1126.179 [responses/sec]
Transfer rate:        164.968 [KBytes/sec] received
Transfer rate (out):  70.388 [KBytes/sec] sent
Average response:     0.004 [seconds]
Average transaction:  0.004 [seconds]

Response codes:
               200:   99963

*** Test didn't seem to run successfully ***
There were failed connections

Author

deserat commented Jan 7, 2014

We cut the benchmarks over to jmeter. 3 servers at a concurrency of 33.

var http = require('http')
var redis = require("redis")


var client = redis.createClient(6379,"x.x.x.x" )
var body = "hello world"

    http.createServer(function (req, res) {
      res.writeHead(200, {'Content-Type': 'text/plain', 'Content-Length': body.length})
      client.incr("foo", function (err, data) { res.end("hello world") } );
    }).listen(3000);

Joyent SmartOS 1.75 GB High CPU

Requests per second: 7029

Joyent High CPU 1.75 Ubuntu 12.04

Requests per second: 4800

Amazon numbers are the same.

Thank god I don't have to use AWS!

mhart commented Jan 7, 2014

I'd be interested to see how wrk performs here too.

Also, just for anyone else who's tempted to copy/paste the Content-Length snippet - it should technically be 'Content-Length': Buffer.byteLength(body) - have been bitten by this before!

Author

deserat commented Jan 7, 2014

@mhart. Nice catch. Copy Paste is exactly what I did.

wrk looks interesting. Though it looks like it runs under the same paradigm as ab and zeusbench. One process on one machine opening n connections.

What we found was that the issue wasn't the benchmarking tool. ab and zb are booth good tools for what they do. The issue is in the the SmartOS TCP stack, it doesn't like getting a ton of requests from a single ip address. In order to get a reliable benchmark we had to reduce the concurrency and distribute the benchmark clients across multiple machines. 3 machines running at a concurrency of 33. jmeter, funkload, and tsung should all be able to perform this reasonably well. One could in theory fire ab, zb, wrk on muliple machines simultaneously and aggregate the results but the other tools I list already have that built in.