brendangregg · May 13, 2012 02:42
diff --git a/gistfile1.txt b/gistfile1.txt
 Here's some more perf analysis: what was the limiter for node.js in the original fs.readFile benchmark,
 causing large performance difference?

 Examining thread microstates using prstat (SmartOS):

 # prstat -mLcp 92432 1
 Please wait...
 [...]
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 92432 root      90  10 0.0 0.0 0.0 0.0 0.0 0.0   7   9 42K   0 node/1

 One thread is 100% on-CPU (USR + SYS), which is likely to be the limiter (this isn't enough information to
 confirm, however, given the node.js architecture it is very likely).  That thread is spending 90% of its
 time in user-land.  Some straightforward profiling with DTrace can tell us what it's doing.

 # dtrace -n 'profile-99 /pid == $1/ { @[jstack(100, 8000)] = count(); }' 92432
 dtrace: description 'profile-99 ' matched 1 probe
 ^C
 [...]
              node`_ZN2v88internal19FlexibleBodyVisitorINS0_20StaticMarkingVisitorENS0_10FixedArray14BodyDescriptorEvE5VisitEPNS0_3MapEPNS0_10HeapObjectE+0x5b
              node`_ZN2v88internal20MarkCompactCollector17EmptyMarkingStackEv+0x6f
              node`_ZN2v88internal13GlobalHandles18IterateStrongRootsEPNS0_13ObjectVisitorE+0x51
              node`_ZN2v88internal4Heap18IterateStrongRootsEPNS0_13ObjectVisitorENS0_9VisitModeE+0x139
              node`_ZN2v88internal20MarkCompactCollector15MarkLiveObjectsEv+0x83
              node`_ZN2v88internal20MarkCompactCollector14CollectGarbageEv+0x24
              node`_ZN2v88internal4Heap11MarkCompactEPNS0_8GCTracerE+0x92
              node`_ZN2v88internal4Heap24PerformGarbageCollectionENS0_16GarbageCollectorEPNS0_8GCTracerE+0x33b
              node`_ZN2v88internal4Heap14CollectGarbageENS0_15AllocationSpaceENS0_16GarbageCollectorE+0x23c
              node`_ZN2v88internal7Factory11NewJSObjectENS0_6HandleINS0_10JSFunctionEEENS0_13PretenureFlagE+0x90
              node`_ZN2v88internal17Runtime_NewObjectENS0_9ArgumentsEPNS0_7IsolateE+0x1a2
              << internal code >>
              << constructor >>
              allocNewPool at fs.js position 53610
              (anon) as ReadStream._read at fs.js position 57236
              (anon) as fs.ReadStream at fs.js position 56236
              << internal >>
              << entry >>
              node`_ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPPPS4_Pb+0x1d5
              node`_ZN2v88internal9Execution4CallENS0_6HandleINS0_6ObjectEEES4_iPPPS3_Pbb+0x57
              node`_ZN2v88Function4CallENS_6HandleINS_6ObjectEEEiPNS1_INS_5ValueEEE+0xd5
              node`_ZN4nodeL5AfterEP7uv_fs_s+0x14d
              node`uv__fs_after+0x271
              node`eio_finish+0x28
              node`etp_poll+0x130
              node`eio_poll+0xb
              node`uv_eio_do_poll+0x31
              node`uv__idle+0x29
              node`ev_invoke_pending+0x71
              node`ev_run+0x452
              node`uv_run+0x17
              node`_ZN4node5StartEiPPc+0xa6
              node`main+0x1b
              node`_start+0x83
               15
 [...]

 I've included one of the common stacks.  It shows node.js is spending its time in garbage collect.
 It shows some other interesting details I'll get to in a moment.

 Quantifying garbage collect time, per second:

 # dtrace -n 'node$target:::gc-start { self->start = timestamp; }
    node$target:::gc-done /self->start/ { @["ms"] = sum(timestamp - self->start); self->start = 0; }
    tick-1s { normalize(@, 1000000); printa("GC: %@d ms", @); trunc(@); }' -p 92432
 dtrace: description 'node$target:::gc-start ' matched 3 probes
 CPU     ID                    FUNCTION:NAME
 12  64930                         :tick-1s GC: 649 ms
 12  64930                         :tick-1s GC: 155 ms
 12  64930                         :tick-1s GC: 115 ms
 12  64930                         :tick-1s GC: 551 ms
 12  64930                         :tick-1s GC: 552 ms
 12  64930                         :tick-1s GC: 132 ms
 12  64930                         :tick-1s GC: 219 ms
 12  64930                         :tick-1s GC: 633 ms
 12  64930                         :tick-1s GC: 264 ms

 The time in GC is indeed significant, and varies from second to second.  Using both sampling and tracing,
 I measured the GC time to be between between 32% and 54% of runtime.  Since this is caused by the allocation
 of fs objects (eg, the "allocNewPool at fs.js" line in the above stack), there is additional overhead from
 object creation.

 This means this is "a" problem, and one that may lead to a 2x improvement, should the GC/object time be
 eliminated.  There may well be other problems in how the fs package actually operates, leading to further
 wins in addition.

 I've heard it was already known than the fs package doesn't perform well.  I'm guessing it wasn't fixed
 sooner as it hadn't really shown up in the real world.  I've DTrace'd production node.js many times, and
 haven't come across this before myself.

 I'll take a quick look at the stack line shown earlier by "allocNewPool at fs.js" (lib/fs.js):

 [...]
  38 var kPoolSize = 40 * 1024;
 [...]
 1014 function allocNewPool() {
 1015   pool = new Buffer(kPoolSize);
 1016   pool.used = 0;
 1017 }
 [...]
 1096 ReadStream.prototype._read = function() {
 [...]
 1102   if (!pool || pool.length - pool.used < kMinPoolSpace) {
 1103     // discard the old pool. Can't add to the free list because
 1104     // users might have refernces to slices on it.
 1105     pool = null;
 1106     allocNewPool();
 1107   }
 [...]

 Line 38 sets the buffer size to 40k.  Setting it to 1k improved performance (for this benchmark, on my system)
 by about 30%.

 Before, with kPoolSize @ 40k:

 134986 Rate: count/sec: 5298.013245033113 Average rate: 4726.416072777918
 138006 Rate: count/sec: 5298.013245033113 Average rate: 4738.92439459154
 141026 Rate: count/sec: 5298.013245033113 Average rate: 4750.8969977167335
 144046 Rate: count/sec: 0.0 Average rate: 4651.291948405371
 147066 Rate: count/sec: 5298.013245033113 Average rate: 4664.572368868399
 150086 Rate: count/sec: 5298.013245033113 Average rate: 4677.3183374865075
 153106 Rate: count/sec: 5298.013245033113 Average rate: 4689.561480281635

 It's averaging around 4700/sec (last column; the count/sec column is erratic due to how the vertx RateCounter
 is updated).

 Now with 1k:

 193475 Rate: count/sec: 5298.013245033113 Average rate: 6109.316449153637
 196495 Rate: count/sec: 5298.013245033113 Average rate: 6096.847248021578
 199515 Rate: count/sec: 5298.013245033113 Average rate: 6084.755532165502
 202535 Rate: count/sec: 10596.026490066226 Average rate: 6152.023107117288
 205555 Rate: count/sec: 5298.013245033113 Average rate: 6139.4760526379805
 208575 Rate: count/sec: 5298.013245033113 Average rate: 6127.292340884574
 211595 Rate: count/sec: 5298.013245033113 Average rate: 6115.456414376521

 The average is now about 6100/sec.

 Someone will need to look at fs.js carefully to see what the best way to improve it is (which may include a
 different value for kPoolSize, or a different strategy altogether).  That is, if this is considered a
 priority to fix (real world usage).

 ...

 Some were wondering if the main issue was different FS I/O latency.  It's not the case on my system, with ZFS,
 which can be easily confirmed using DTrace:

 node.js:

 # dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; }
    fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900
 dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
 ^C

  ns                                                
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@@@@@@@@@@@@@@                      20410    
            2048 |@@@@@@@@@@@@@@@                          16156    
            4096 |@@@@@                                    5475     
            8192 |                                         231      
           16384 |                                         9        
           32768 |                                         2        
           65536 |                                         0   

 The output shows the distribution of FS read latency, as nanoseconds.

 vert.x:

 # dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; }
    fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900
 dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
 ^C

  ns                                                
           value  ------------- Distribution ------------- count    
             512 |                                         0        
            1024 |@@@@@@@@@@@@@@@@@@@                      56674    
            2048 |@@@@@@@@@@@@@@@@                         48668    
            4096 |@@@@                                     12820    
            8192 |                                         555      
           16384 |                                         21       
           32768 |                                         2        
           65536 |                                         0
	Here's some more perf analysis: what was the limiter for node.js in the original fs.readFile benchmark,
	causing large performance difference?

	Examining thread microstates using prstat (SmartOS):

	# prstat -mLcp 92432 1
	Please wait...
	[...]
	PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
	92432 root 90 10 0.0 0.0 0.0 0.0 0.0 0.0 7 9 42K 0 node/1

	One thread is 100% on-CPU (USR + SYS), which is likely to be the limiter (this isn't enough information to
	confirm, however, given the node.js architecture it is very likely). That thread is spending 90% of its
	time in user-land. Some straightforward profiling with DTrace can tell us what it's doing.

	# dtrace -n 'profile-99 /pid == $1/ { @[jstack(100, 8000)] = count(); }' 92432
	dtrace: description 'profile-99 ' matched 1 probe
	^C
	[...]
	node`_ZN2v88internal19FlexibleBodyVisitorINS0_20StaticMarkingVisitorENS0_10FixedArray14BodyDescriptorEvE5VisitEPNS0_3MapEPNS0_10HeapObjectE+0x5b
	node`_ZN2v88internal20MarkCompactCollector17EmptyMarkingStackEv+0x6f
	node`_ZN2v88internal13GlobalHandles18IterateStrongRootsEPNS0_13ObjectVisitorE+0x51
	node`_ZN2v88internal4Heap18IterateStrongRootsEPNS0_13ObjectVisitorENS0_9VisitModeE+0x139
	node`_ZN2v88internal20MarkCompactCollector15MarkLiveObjectsEv+0x83
	node`_ZN2v88internal20MarkCompactCollector14CollectGarbageEv+0x24
	node`_ZN2v88internal4Heap11MarkCompactEPNS0_8GCTracerE+0x92
	node`_ZN2v88internal4Heap24PerformGarbageCollectionENS0_16GarbageCollectorEPNS0_8GCTracerE+0x33b
	node`_ZN2v88internal4Heap14CollectGarbageENS0_15AllocationSpaceENS0_16GarbageCollectorE+0x23c
	node`_ZN2v88internal7Factory11NewJSObjectENS0_6HandleINS0_10JSFunctionEEENS0_13PretenureFlagE+0x90
	node`_ZN2v88internal17Runtime_NewObjectENS0_9ArgumentsEPNS0_7IsolateE+0x1a2
	<< internal code >>
	<< constructor >>
	allocNewPool at fs.js position 53610
	(anon) as ReadStream._read at fs.js position 57236
	(anon) as fs.ReadStream at fs.js position 56236
	<< internal >>
	<< entry >>
	node`_ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPPPS4_Pb+0x1d5
	node`_ZN2v88internal9Execution4CallENS0_6HandleINS0_6ObjectEEES4_iPPPS3_Pbb+0x57
	node`_ZN2v88Function4CallENS_6HandleINS_6ObjectEEEiPNS1_INS_5ValueEEE+0xd5
	node`_ZN4nodeL5AfterEP7uv_fs_s+0x14d
	node`uv__fs_after+0x271
	node`eio_finish+0x28
	node`etp_poll+0x130
	node`eio_poll+0xb
	node`uv_eio_do_poll+0x31
	node`uv__idle+0x29
	node`ev_invoke_pending+0x71
	node`ev_run+0x452
	node`uv_run+0x17
	node`_ZN4node5StartEiPPc+0xa6
	node`main+0x1b
	node`_start+0x83
	15
	[...]

	I've included one of the common stacks. It shows node.js is spending its time in garbage collect.
	It shows some other interesting details I'll get to in a moment.

	Quantifying garbage collect time, per second:

	# dtrace -n 'node$target:::gc-start { self->start = timestamp; }
	node$target:::gc-done /self->start/ { @["ms"] = sum(timestamp - self->start); self->start = 0; }
	tick-1s { normalize(@, 1000000); printa("GC: %@d ms", @); trunc(@); }' -p 92432
	dtrace: description 'node$target:::gc-start ' matched 3 probes
	CPU ID FUNCTION:NAME
	12 64930 :tick-1s GC: 649 ms
	12 64930 :tick-1s GC: 155 ms
	12 64930 :tick-1s GC: 115 ms
	12 64930 :tick-1s GC: 551 ms
	12 64930 :tick-1s GC: 552 ms
	12 64930 :tick-1s GC: 132 ms
	12 64930 :tick-1s GC: 219 ms
	12 64930 :tick-1s GC: 633 ms
	12 64930 :tick-1s GC: 264 ms

	The time in GC is indeed significant, and varies from second to second. Using both sampling and tracing,
	I measured the GC time to be between between 32% and 54% of runtime. Since this is caused by the allocation
	of fs objects (eg, the "allocNewPool at fs.js" line in the above stack), there is additional overhead from
	object creation.

	This means this is "a" problem, and one that may lead to a 2x improvement, should the GC/object time be
	eliminated. There may well be other problems in how the fs package actually operates, leading to further
	wins in addition.

	I've heard it was already known than the fs package doesn't perform well. I'm guessing it wasn't fixed
	sooner as it hadn't really shown up in the real world. I've DTrace'd production node.js many times, and
	haven't come across this before myself.

	I'll take a quick look at the stack line shown earlier by "allocNewPool at fs.js" (lib/fs.js):

	[...]
	38 var kPoolSize = 40 * 1024;
	[...]
	1014 function allocNewPool() {
	1015 pool = new Buffer(kPoolSize);
	1016 pool.used = 0;
	1017 }
	[...]
	1096 ReadStream.prototype._read = function() {
	[...]
	1102 if (!pool \|\| pool.length - pool.used < kMinPoolSpace) {
	1103 // discard the old pool. Can't add to the free list because
	1104 // users might have refernces to slices on it.
	1105 pool = null;
	1106 allocNewPool();
	1107 }
	[...]

	Line 38 sets the buffer size to 40k. Setting it to 1k improved performance (for this benchmark, on my system)
	by about 30%.

	Before, with kPoolSize @ 40k:

	134986 Rate: count/sec: 5298.013245033113 Average rate: 4726.416072777918
	138006 Rate: count/sec: 5298.013245033113 Average rate: 4738.92439459154
	141026 Rate: count/sec: 5298.013245033113 Average rate: 4750.8969977167335
	144046 Rate: count/sec: 0.0 Average rate: 4651.291948405371
	147066 Rate: count/sec: 5298.013245033113 Average rate: 4664.572368868399
	150086 Rate: count/sec: 5298.013245033113 Average rate: 4677.3183374865075
	153106 Rate: count/sec: 5298.013245033113 Average rate: 4689.561480281635

	It's averaging around 4700/sec (last column; the count/sec column is erratic due to how the vertx RateCounter
	is updated).

	Now with 1k:

	193475 Rate: count/sec: 5298.013245033113 Average rate: 6109.316449153637
	196495 Rate: count/sec: 5298.013245033113 Average rate: 6096.847248021578
	199515 Rate: count/sec: 5298.013245033113 Average rate: 6084.755532165502
	202535 Rate: count/sec: 10596.026490066226 Average rate: 6152.023107117288
	205555 Rate: count/sec: 5298.013245033113 Average rate: 6139.4760526379805
	208575 Rate: count/sec: 5298.013245033113 Average rate: 6127.292340884574
	211595 Rate: count/sec: 5298.013245033113 Average rate: 6115.456414376521

	The average is now about 6100/sec.

	Someone will need to look at fs.js carefully to see what the best way to improve it is (which may include a
	different value for kPoolSize, or a different strategy altogether). That is, if this is considered a
	priority to fix (real world usage).

	...

	Some were wondering if the main issue was different FS I/O latency. It's not the case on my system, with ZFS,
	which can be easily confirmed using DTrace:

	node.js:

	# dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; }
	fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900
	dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
	^C

	ns
	value ------------- Distribution ------------- count
	512 \| 0
	1024 \|@@@@@@@@@@@@@@@@@@@ 20410
	2048 \|@@@@@@@@@@@@@@@ 16156
	4096 \|@@@@@ 5475
	8192 \| 231
	16384 \| 9
	32768 \| 2
	65536 \| 0

	The output shows the distribution of FS read latency, as nanoseconds.

	vert.x:

	# dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; }
	fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900
	dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
	^C

	ns
	value ------------- Distribution ------------- count
	512 \| 0
	1024 \|@@@@@@@@@@@@@@@@@@@ 56674
	2048 \|@@@@@@@@@@@@@@@@ 48668
	4096 \|@@@@ 12820
	8192 \| 555
	16384 \| 21
	32768 \| 2
	65536 \| 0