Created
May 13, 2012 02:42
-
-
Save brendangregg/2670942 to your computer and use it in GitHub Desktop.
node.js fs.readFile analysis
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here's some more perf analysis: what was the limiter for node.js in the original fs.readFile benchmark, | |
causing large performance difference? | |
Examining thread microstates using prstat (SmartOS): | |
# prstat -mLcp 92432 1 | |
Please wait... | |
[...] | |
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID | |
92432 root 90 10 0.0 0.0 0.0 0.0 0.0 0.0 7 9 42K 0 node/1 | |
One thread is 100% on-CPU (USR + SYS), which is likely to be the limiter (this isn't enough information to | |
confirm, however, given the node.js architecture it is very likely). That thread is spending 90% of its | |
time in user-land. Some straightforward profiling with DTrace can tell us what it's doing. | |
# dtrace -n 'profile-99 /pid == $1/ { @[jstack(100, 8000)] = count(); }' 92432 | |
dtrace: description 'profile-99 ' matched 1 probe | |
^C | |
[...] | |
node`_ZN2v88internal19FlexibleBodyVisitorINS0_20StaticMarkingVisitorENS0_10FixedArray14BodyDescriptorEvE5VisitEPNS0_3MapEPNS0_10HeapObjectE+0x5b | |
node`_ZN2v88internal20MarkCompactCollector17EmptyMarkingStackEv+0x6f | |
node`_ZN2v88internal13GlobalHandles18IterateStrongRootsEPNS0_13ObjectVisitorE+0x51 | |
node`_ZN2v88internal4Heap18IterateStrongRootsEPNS0_13ObjectVisitorENS0_9VisitModeE+0x139 | |
node`_ZN2v88internal20MarkCompactCollector15MarkLiveObjectsEv+0x83 | |
node`_ZN2v88internal20MarkCompactCollector14CollectGarbageEv+0x24 | |
node`_ZN2v88internal4Heap11MarkCompactEPNS0_8GCTracerE+0x92 | |
node`_ZN2v88internal4Heap24PerformGarbageCollectionENS0_16GarbageCollectorEPNS0_8GCTracerE+0x33b | |
node`_ZN2v88internal4Heap14CollectGarbageENS0_15AllocationSpaceENS0_16GarbageCollectorE+0x23c | |
node`_ZN2v88internal7Factory11NewJSObjectENS0_6HandleINS0_10JSFunctionEEENS0_13PretenureFlagE+0x90 | |
node`_ZN2v88internal17Runtime_NewObjectENS0_9ArgumentsEPNS0_7IsolateE+0x1a2 | |
<< internal code >> | |
<< constructor >> | |
allocNewPool at fs.js position 53610 | |
(anon) as ReadStream._read at fs.js position 57236 | |
(anon) as fs.ReadStream at fs.js position 56236 | |
<< internal >> | |
<< entry >> | |
node`_ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPPPS4_Pb+0x1d5 | |
node`_ZN2v88internal9Execution4CallENS0_6HandleINS0_6ObjectEEES4_iPPPS3_Pbb+0x57 | |
node`_ZN2v88Function4CallENS_6HandleINS_6ObjectEEEiPNS1_INS_5ValueEEE+0xd5 | |
node`_ZN4nodeL5AfterEP7uv_fs_s+0x14d | |
node`uv__fs_after+0x271 | |
node`eio_finish+0x28 | |
node`etp_poll+0x130 | |
node`eio_poll+0xb | |
node`uv_eio_do_poll+0x31 | |
node`uv__idle+0x29 | |
node`ev_invoke_pending+0x71 | |
node`ev_run+0x452 | |
node`uv_run+0x17 | |
node`_ZN4node5StartEiPPc+0xa6 | |
node`main+0x1b | |
node`_start+0x83 | |
15 | |
[...] | |
I've included one of the common stacks. It shows node.js is spending its time in garbage collect. | |
It shows some other interesting details I'll get to in a moment. | |
Quantifying garbage collect time, per second: | |
# dtrace -n 'node$target:::gc-start { self->start = timestamp; } | |
node$target:::gc-done /self->start/ { @["ms"] = sum(timestamp - self->start); self->start = 0; } | |
tick-1s { normalize(@, 1000000); printa("GC: %@d ms", @); trunc(@); }' -p 92432 | |
dtrace: description 'node$target:::gc-start ' matched 3 probes | |
CPU ID FUNCTION:NAME | |
12 64930 :tick-1s GC: 649 ms | |
12 64930 :tick-1s GC: 155 ms | |
12 64930 :tick-1s GC: 115 ms | |
12 64930 :tick-1s GC: 551 ms | |
12 64930 :tick-1s GC: 552 ms | |
12 64930 :tick-1s GC: 132 ms | |
12 64930 :tick-1s GC: 219 ms | |
12 64930 :tick-1s GC: 633 ms | |
12 64930 :tick-1s GC: 264 ms | |
The time in GC is indeed significant, and varies from second to second. Using both sampling and tracing, | |
I measured the GC time to be between between 32% and 54% of runtime. Since this is caused by the allocation | |
of fs objects (eg, the "allocNewPool at fs.js" line in the above stack), there is additional overhead from | |
object creation. | |
This means this is "a" problem, and one that may lead to a 2x improvement, should the GC/object time be | |
eliminated. There may well be other problems in how the fs package actually operates, leading to further | |
wins in addition. | |
I've heard it was already known than the fs package doesn't perform well. I'm guessing it wasn't fixed | |
sooner as it hadn't really shown up in the real world. I've DTrace'd production node.js many times, and | |
haven't come across this before myself. | |
I'll take a quick look at the stack line shown earlier by "allocNewPool at fs.js" (lib/fs.js): | |
[...] | |
38 var kPoolSize = 40 * 1024; | |
[...] | |
1014 function allocNewPool() { | |
1015 pool = new Buffer(kPoolSize); | |
1016 pool.used = 0; | |
1017 } | |
[...] | |
1096 ReadStream.prototype._read = function() { | |
[...] | |
1102 if (!pool || pool.length - pool.used < kMinPoolSpace) { | |
1103 // discard the old pool. Can't add to the free list because | |
1104 // users might have refernces to slices on it. | |
1105 pool = null; | |
1106 allocNewPool(); | |
1107 } | |
[...] | |
Line 38 sets the buffer size to 40k. Setting it to 1k improved performance (for this benchmark, on my system) | |
by about 30%. | |
Before, with kPoolSize @ 40k: | |
134986 Rate: count/sec: 5298.013245033113 Average rate: 4726.416072777918 | |
138006 Rate: count/sec: 5298.013245033113 Average rate: 4738.92439459154 | |
141026 Rate: count/sec: 5298.013245033113 Average rate: 4750.8969977167335 | |
144046 Rate: count/sec: 0.0 Average rate: 4651.291948405371 | |
147066 Rate: count/sec: 5298.013245033113 Average rate: 4664.572368868399 | |
150086 Rate: count/sec: 5298.013245033113 Average rate: 4677.3183374865075 | |
153106 Rate: count/sec: 5298.013245033113 Average rate: 4689.561480281635 | |
It's averaging around 4700/sec (last column; the count/sec column is erratic due to how the vertx RateCounter | |
is updated). | |
Now with 1k: | |
193475 Rate: count/sec: 5298.013245033113 Average rate: 6109.316449153637 | |
196495 Rate: count/sec: 5298.013245033113 Average rate: 6096.847248021578 | |
199515 Rate: count/sec: 5298.013245033113 Average rate: 6084.755532165502 | |
202535 Rate: count/sec: 10596.026490066226 Average rate: 6152.023107117288 | |
205555 Rate: count/sec: 5298.013245033113 Average rate: 6139.4760526379805 | |
208575 Rate: count/sec: 5298.013245033113 Average rate: 6127.292340884574 | |
211595 Rate: count/sec: 5298.013245033113 Average rate: 6115.456414376521 | |
The average is now about 6100/sec. | |
Someone will need to look at fs.js carefully to see what the best way to improve it is (which may include a | |
different value for kPoolSize, or a different strategy altogether). That is, if this is considered a | |
priority to fix (real world usage). | |
... | |
Some were wondering if the main issue was different FS I/O latency. It's not the case on my system, with ZFS, | |
which can be easily confirmed using DTrace: | |
node.js: | |
# dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; } | |
fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900 | |
dtrace: description 'fbt::zfs_read:entry ' matched 2 probes | |
^C | |
ns | |
value ------------- Distribution ------------- count | |
512 | 0 | |
1024 |@@@@@@@@@@@@@@@@@@@ 20410 | |
2048 |@@@@@@@@@@@@@@@ 16156 | |
4096 |@@@@@ 5475 | |
8192 | 231 | |
16384 | 9 | |
32768 | 2 | |
65536 | 0 | |
The output shows the distribution of FS read latency, as nanoseconds. | |
vert.x: | |
# dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; } | |
fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900 | |
dtrace: description 'fbt::zfs_read:entry ' matched 2 probes | |
^C | |
ns | |
value ------------- Distribution ------------- count | |
512 | 0 | |
1024 |@@@@@@@@@@@@@@@@@@@ 56674 | |
2048 |@@@@@@@@@@@@@@@@ 48668 | |
4096 |@@@@ 12820 | |
8192 | 555 | |
16384 | 21 | |
32768 | 2 | |
65536 | 0 | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment