To try this out:
wget https://www.fec.gov/files/bulk-downloads/2018/indiv18.zip
unzip indiv18.zip
node find-most-frequent-name.js itcont.txt
The interesting thing is that while other solutions to the problem require to increase NodeJs' memory limits, we can even restrict the limit and use less memory:
node --max-old-space-size=300 find-most-frequent-name.js itcont.txt
Also worth noting is that no JavaScript dependencies are needed to achieve this. We don't even need to copy or reimplement anything from a library – NodeJs simply has everything needed on board.
I got the idea for writing this example after reading through the code in this blogpost. The author there
notes that only one of her solutions (using the EventStream library) could handle the large file without
running out of memory. The way she used Node's readline()
function seemed a bit suspicious to me (for example,
there is an unused output stream of which I had no idea what it does), so I thought I would try to show that
pure NodeJs can handle that data really well.
As I gave the problem of finding the most common first name a fresh look, I decided to go directly for a solution which aggregates name counts as the data is read. This did of course use much less memory than the original program. (To her defense comes that the original problem statement actually required to "read all names into an array" so the bad performance is caused by an overspecification of the problem. Not so uncommon in the field, haha!)
Also to the defense of Node's readline
: even though it crashes even for a reduced data set of only 5 million entries,
the EventStream-based solution also crashes with the full data set of 19 million entries and even crashes already with
just 10 million.
So the learning is:
- NodeJs has great built-in support for processing data way bigger than available RAM (which can actually translate to real saved money if you run in the cloud and can use smaller RAM allocations for less money)
- more than the technology, rephrasing the problem can bring a better solution (that of course requires an open discussion with whoever stated the problem and depending on circumstance the problem might not be rephrasable after all)
My fork of Paige's repository contains some hints how to easily reduce the size of the data set and fixes so the programs can deal with that stripped down data.