A very simple vanilla implementation was done in Javascript (main.js
), it might not be the simplest approach. The simplest approach would probably consists of a function that loads the full input file on memory (with the consequent impact it might have depending on the file size and system capacity). The implemented function uses the readline
read stream and Promises approach to resolve the result as early as possible, while the file is read line by line.
I've explored additional scenarios in Golang, providing reference implementations for different possible strategies. Main reason is that I'm pretty familiar with Golang benchmark tooling. These were the results:
BenchmarkLineByLine-8 4276 259729 ns/op 330681 B/op 1713 allocs/op
BenchmarkFullRead-8 4486 262854 ns/op 330681 B/op 1713 allocs/op
BenchmarkIndexed-8 27449694 43.70 ns/op 16 B/op 1 allocs/op
The BenchmarkLineByLine
replicates the same implemented logic from the Javascript sample.
The BenchmarkFullRead
is different to the previous one as it loads the full input file into memory before reading line-by-line. This is one of the fastest approaches and the memory usage is pretty much the same (see number of allocs
). The main disadvantage is that we're limited to the amount of memory that's available, which should be enough to temporarily store the file on memory. Could be a good option if we know what the file size looks like. It's slightly faster than the line by line approach.
The BenchmarkIndexed
follows a completely different approach. Even though the file is also completely loaded into memory like the full read approach, this is only done once in order to generate an index -all benchmarks are ran multiple times-. The index consists of a key (chromosome+position) while the reference is used as the value. For every lookup, we rely on a hashmap lookup allowing way faster reads. It's multiple times faster than any of the previous approaches.
To summarize there might be different possible optimizations depending on the constraints (speed requirements, memory requirements, etc.) and context of the application. In the context of this small experiment the full read or indexing approach might be the fastest available approaches despite the involved memory penalty they may present.