This is a small experiment to see whether one can:
- Lex a file efficiently, retaining line/column and indentation information.
- Consuming no or little memory (aside from the input size itself), and
- Still have the flexibility to perform zero-cost operations like folds (counting tokens), doing nothing (a simple pass), or printing. SAX-style.
This proves that one could, e.g., run in ST and write to a mutable Storable vector. Allowing the caller to process the set of tokens later. But the cost/calculation of figuring out line/col/indentation of each token has already been figured out.
The input file is war-and-peace.txt which is 6MB. Simply reading the file takes 27ms. Counting all words (non-space) in the file takes 36ms. So let's say about 9ms, in the absense of more rigorous gauge-based benchmarking. There are 1,132,619 "words" in the file.
chris@precision:~/Work/chrisdone/sandbox$ stack ghc -- -O2 hask-tok3.hs -o tok -rtsopts && ./tok ~/Work/chrisdone/writer/test/assets/war-and-peace.txt silent +RTS -s
[1 of 2] Compiling Zepto ( Zepto.hs, Zepto.o )
[2 of 2] Compiling Main ( hask-tok3.hs, hask-tok3.o )
Linking tok ...
6,655,592 bytes allocated in the heap
11,184 bytes copied during GC
46,720 bytes maximum residency (1 sample(s))
31,104 bytes maximum slop
9 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.027s ( 0.030s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.027s ( 0.031s elapsed)
%GC time 0.3% (0.3% elapsed)
Alloc rate 250,210,225 bytes per MUT second
Productivity 99.4% of total user, 99.5% of total elapsed
Counting all words in the file:
chris@precision:~/Work/chrisdone/sandbox$ stack ghc -- -O2 hask-tok3.hs -o tok -rtsopts && ./tok ~/Work/chrisdone/writer/test/assets/war-and-peace.txt count +RTS -s
[2 of 2] Compiling Main ( hask-tok3.hs, hask-tok3.o )
Linking tok ...
1132619
6,665,680 bytes allocated in the heap
11,256 bytes copied during GC
55,016 bytes maximum residency (1 sample(s))
35,096 bytes maximum slop
9 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.035s ( 0.044s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.036s ( 0.044s elapsed)
%GC time 0.4% (0.4% elapsed)
Alloc rate 189,279,872 bytes per MUT second
Productivity 99.0% of total user, 99.2% of total elapsed
Here is a tight loop in the fast file lexer. I'm just adding up all the integer values to make sure that the values are by used from the couple parser.
simple_count :: P (State Int) ()
simple_count = do
(Token { byteString
, start = Point {line, column, indentation}
, end = Point { line = line1
, column = column2
, indentation = indentation2
}
}, end) <- couple
lift
(modify
(+ (line + column + indentation + line1 + column2 +indentation2
)))
unless end simple_count
It runs with very good time and space characteristics (the file is 6MB, hence the 6MB allocation):
$ stack ghc -- -O2 hask-tok3.hs -o tok -rtsopts && ./tok ~/Work/chrisdone/writer/test/assets/war-and-peace.txt count +RTS -s
[2 of 2] Compiling Main ( hask-tok3.hs, hask-tok3.o )
Linking tok ...
13106856
6,665,720 bytes allocated in the heap
11,256 bytes copied during GC
55,016 bytes maximum residency (1 sample(s))
35,096 bytes maximum slop
9 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.000s 0.000s 0.0001s 0.0001s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.028s ( 0.032s elapsed)
GC time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.028s ( 0.032s elapsed)
%GC time 0.3% (0.3% elapsed)
Alloc rate 237,146,719 bytes per MUT second
Productivity 99.5% of total user, 99.5% of total elapsed
We used constant memory to do our work, and has 99.5% producitivity. Just one garbage collection!
Which is expected, as the code compiles down to a few state variables:
$s$wsimple_count_rcRY
= \ (sc_scFc :: GHC.Prim.Int#)
(sc1_scFb :: GHC.Prim.Int#)
(sc2_scFa :: GHC.Prim.Int#)
(sc3_scF9 :: GHC.Prim.Int#)
(sc4_scF8 :: GHC.Prim.Int#)
(sc5_scF7 :: GHC.Prim.Int#)
(sc6_scF6 :: GHC.ForeignPtr.ForeignPtrContents)
(sc7_scF5 :: GHC.Prim.Addr#) ->
...
So the nested data structures I was using are a "for free" abstraction.
However, watch what happens if I put an INLINE
pragma on the declaration:
$ stack ghc -- -O2 hask-tok3.hs -o tok -rtsopts && ./tok ~/Work/chrisdone/writer/test/assets/war-and-peace.txt count +RTS -s
[2 of 2] Compiling Main ( hask-tok3.hs, hask-tok3.o )
Linking tok ...
13106856
653,869,784 bytes allocated in the heap
319,353,904 bytes copied during GC
85,539,056 bytes maximum residency (7 sample(s))
1,447,176 bytes maximum slop
191 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 609 colls, 0 par 0.082s 0.089s 0.0001s 0.0010s
Gen 1 7 colls, 0 par 0.053s 0.080s 0.0114s 0.0355s
INIT time 0.000s ( 0.000s elapsed)
MUT time 0.103s ( 0.112s elapsed)
GC time 0.135s ( 0.169s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.238s ( 0.281s elapsed)
%GC time 56.7% (60.2% elapsed)
Alloc rate 6,343,261,939 bytes per MUT second
Productivity 43.3% of total user, 39.8% of total elapsed
Be careful and measure your performance as demonstrated here.