Can compiled Haskell programs benefit from any pointer compression schemes? The goal would be to lower memory residency significantly, while hopefully seeing an improvement in other runtime metrics. V8 Successfully implemented a pointer compression scheme; it seems the challenge for them was recovering the lost performance after the change; After optimizing they did observe single-digit improvements to CPU time.
In the analysis below I'm using
this ghc-debug analysis
on hasura, running huge_schema
.
First...
Conservative estimate of size of heap pointers + info pointers:
1264 MB
...while profiling shows ~1400 MB residency. So we're mostly to nearly-all pointers all the way down it seems.
Some basic data on closures and pointers (taken individually):
Out of 50388864 total closures...
...34% have 1 pointer fields
...46% have 2 pointer fields
...7% have 3 pointer fields
...2% have 4 pointer fields
...7% have 5 pointer fields
...4% have 6 or pointer fields
Out of 113246923 total heap pointers...
...15% of heap pointers are in objects with 1 pointers
...41% of heap pointers are in objects with 2 pointers
...10% of heap pointers are in objects with 3 pointers
...3% of heap pointers are in objects with 4 pointers
...16% of heap pointers are in objects with 5 pointers
...15% of heap pointers are in objects with 6 or more pointers
There's a little bit of a long tail here (11% of widest closures take 31% of space), but mostly we care about small heap objects it seems.
On its face this seems feasible, since the original full pointer form that we have now can be recovered just by looking at the closure itself. A tag bit in the header might be used to indicate compressed or uncompressed ( And also perhaps 32- or 16-bit width).
We probably need to assume all pointers within a particular heap object must have the same width, so…
If all heap pointers in an object must have same width compressed (i.e. as offset),
... 0% of closures could use offsets of width 8
... 32% of closures could use offsets of width 16
... 24% of closures could use offsets of width 32
... 44% of closures must use offsets of width 64
Digging into the raw data (below), and taking into consideration alignment, let's try to quantify the effects of different schemes. Numbers below reflect any required padding at the end of heap objects.
Total memory consumed by compressible (to 1-, 2-, or 4-byte offsets) closures (including info pointer), currently:
668 MB (53% of total)
Assuming info pointers stay 64 bits ...
With 16-bit compressed offsets this shrinks to: 568 MB
With 32-bit offsets: 490 MB
With a combination of the two: 485 MB
...and if we assume that we can stuff info pointers into 32 bits (while maintaining 64 bit alignment of heap objects):
With 16-bit compressed offsets this shrinks to: 440 MB
With 32-bit offsets: 405 MB
With a combination of the two: 315 MB <---- best case
So in the best case a 28% reduction in memory usage.
This is optimistic in that we are treating the closures with 6 or more fields as if they had exactly 6. It is pessimistic in that we expect that as we compress the heap more closures become compressible.
On our heap we see:
Out of 50388864 total closures...
...34% have 1 pointer fields
...46% have 2 pointer fields
Out of 113246923 total heap pointers...
...15% of heap pointers are in objects with 1 sibling pointers
...41% of heap pointers are in objects with 2 sibling pointers
For every 2.25 heap pointers we have an info pointer, so the space used by info pointers is quite significant.
If info pointers could be compressed to 32 bits, this could significantly
compliment a compression scheme for heap pointers in the body (even if we still
must align heap objects to 64-bit boundaries, and possibly pad; see
above). e.g. what would be a relatively common heap object like
[8-byte info pointer][2-byte offset][2-byte offset]
now fits in one word
instead of two.
On the other hand, if there's no way (or no plan) to compress any of an object’s heap pointers then, due to alignment, there would be no benefit to compressing the info pointer I think.
We see the (dynamic?) linker places tables and code over the entire address space:
infotables + code range, in bits: 46.997
Is there a way to constrain this to a particular range, using a
linker script or custom ld.so
,
or... (waves hands)?
Above we explored storing heap pointers as offsets relative to the originating heap object. But it occurred to me that there's not really a good reason to expect a heap object to be close to the child objects it has pointers to: my intuition is (unless all closures contain a single pointer) in the copying collector, as GC proceeds, scavenged objects are placed farther and farther away from their parent in to-space.
But I realized (in my mental model at least) scavenged sibling objects ought to end up adjacent to each other in the heap. So what about trying to compress all sibling pointers relative to each other?
I added this analysis, and it turns out it seems my intuition about sibling objects being compacted doesn't hold up at all; Most require more than 14 bits to represent an offset between them… I even tried disabling parallel garbage collection, to see if that was the issue, made sure to force major GC several times... but got the same result.
Likewise trying to measure relative to the last pointer in a closure.
This seems surprising to me, and it would be great to be able to explain this at least.
-
On paper closures with pointers stored as offsets would seem to have the nice property that a contiguous subgraph can be copied to a new region of memory as a valid unit without fixing up the pointers; could this be used to optimize the evacuate/scavenge loop?
-
I assume most of the heap I'm looking at is resident in the oldest generation; how do these numbers compare for different parts of the garbage collection life cycle, and for other programs?
-
I assume most of the pointers that are representable by 16-bit offsets, are within the same block; Is there a less naive way of representing offsets, that takes into consideration some knowledge of blocks and where they live?
-
If we make room in a block using pointer compression, how much of a “virtuous cycle” effect will we see, With more compression opportunities exposed?
I looked at the block-to-block graph and was a little surprised by the high number of out edges. I wonder if there are a small number of “hot” target blocks
Out of 352619 blocks...
... 0% have edges to exactly 1 distinct other blocks
... 0% have edges to exactly 2 distinct other blocks
... 1% have edges to exactly 3 distinct other blocks
... 2% have edges to exactly 4 distinct other blocks
... 6% have edges to exactly 5 distinct other blocks
... 3% have edges to exactly 6 distinct other blocks
... 2% have edges to exactly 7 distinct other blocks
... 2% have edges to exactly 8 distinct other blocks
... 1% have edges to exactly 9 distinct other blocks
... 1% have edges to exactly 10 distinct other blocks
... 1% have edges to exactly 11 distinct other blocks
... 4% have edges to exactly 12 distinct other blocks
... 2% have edges to exactly 13 distinct other blocks
... 1% have edges to exactly 14 distinct other blocks
... 1% have edges to exactly 15 distinct other blocks
... 1% have edges to exactly 16 distinct other blocks
... 1% have edges to exactly 17 distinct other blocks
... 1% have edges to exactly 18 distinct other blocks
... 1% have edges to exactly 19 distinct other blocks
... 68% have edges to 20 or more other blocks
Most pointers point forwards in the address space (shrug):
Percentage of pointers representable as positive offset: 76
Have not read...
In v8 :
misc:
- https://news.ycombinator.com/item?id=22220342 (discusses above)
- https://llvm.org/pubs/2005-06-12-MSP-PointerComp.pdf - 32-bit pointers, "macrocopic" for an entire data structure?
- https://github.com/oilshell/oil/wiki/Compact-AST-Representation - a nice collection of research notes
For my huge_schema
heap:
========= Analysis ======================================
infotables + code range, in bits: 46.99734750754052
----------------------------------
Percentage of pointers representable as positive offset: 76
----------------------------------
Out of 113246923 total heap pointers...
...15% of heap pointers are in objects with 1 sibling pointers
...41% of heap pointers are in objects with 2 sibling pointers
...10% of heap pointers are in objects with 3 sibling pointers
...3% of heap pointers are in objects with 4 sibling pointers
...16% of heap pointers are in objects with 5 sibling pointers
...15% of heap pointers are in objects with 6 or more sibling pointers
----------------------------------
Out of 50388864 total closures...
...34% have 1 pointer fields
...46% have 2 pointer fields
...7% have 3 pointer fields
...2% have 4 pointer fields
...7% have 5 pointer fields
...4% have 6 or pointer fields
----------------------------------
Conservative estimate of size of heap pointers + info pointers:
1264 MB
----------------------------------
If all heap pointers in an object must have same width compressed (i.e. as offset),
... 0% of closures could use offsets of width 8
... 32% of closures could use offsets of width 16
... 24% of closures could use offsets of width 32
... 44% of closures could use offsets of width 64
----------------------------------
Out of 352619 blocks...
... 0% have edges to exactly 1 distinct other blocks
... 0% have edges to exactly 2 distinct other blocks
... 1% have edges to exactly 3 distinct other blocks
... 2% have edges to exactly 4 distinct other blocks
... 6% have edges to exactly 5 distinct other blocks
... 3% have edges to exactly 6 distinct other blocks
... 2% have edges to exactly 7 distinct other blocks
... 2% have edges to exactly 8 distinct other blocks
... 1% have edges to exactly 9 distinct other blocks
... 1% have edges to exactly 10 distinct other blocks
... 1% have edges to exactly 11 distinct other blocks
... 4% have edges to exactly 12 distinct other blocks
... 2% have edges to exactly 13 distinct other blocks
... 1% have edges to exactly 14 distinct other blocks
... 1% have edges to exactly 15 distinct other blocks
... 1% have edges to exactly 16 distinct other blocks
... 1% have edges to exactly 17 distinct other blocks
... 1% have edges to exactly 18 distinct other blocks
... 1% have edges to exactly 19 distinct other blocks
... 68% have edges to 20 or more other blocks
-
offset bits buckets for first and last pointers:
-64,5220660 -30,5700677 -14,3280952 -6 ,270143 6 ,450223 14,18040789 30,6980271 64,10445149 -64,5220660 -30,5700677 -14,3280952 -6 ,270143 6 ,450223 14,18040789 30,6980272 64,10445148
-
count of heap pointers by (offset bits bucket, pointers in closure):
(-64,1),2072994 (-64,2),3499989 (-64,3),1250323 (-64,4),143811 (-64,5),671592 (-64,6),1004615 (-30,1),1904876 (-30,2),3937939 (-30,3),981790 (-30,4),405707 (-30,5),1198154 (-30,6),2581246 (-14,1),1045442 (-14,2),2752133 (-14,3),225390 (-14,4),333855 (-14,5),1672882 (-14,6),873397 (-6,1),73411 (-6,2),190813 (-6,3),9252 (-6,4),82464 (-6,5),1572 (-6,6),35882 (6,1),122370 (6,2),570572 (6,3),12200 (6,4),28980 (6,5),75774 (6,6),1068 (14,1),3628883 (14,2),25321422 (14,3),4643580 (14,4),851502 (14,5),6121834 (14,6),7508721 (30,1),2210526 (30,2),5144066 (30,3),1211565 (30,4),601420 (30,5),962642 (30,6),4277196 (64,1),6038417 (64,2),5088038 (64,3),2607356 (64,4),835237 (64,5),7248795 (64,6),1185230
-
count of closures by (offset bits reqd. for all fields, pointers in closure):
(6,1),195781 (6,2),27649 (6,3),32 (6,4),1 (6,6),1 (14,1),4674325 (14,2),10864244 (14,3),353339 (14,4),69804 (14,5),57870 (14,6),118235 (30,1),4115402 (30,2),5667822 (30,3),800243 (30,4),241340 (30,5),407888 (30,6),863629 (64,1),8111411 (64,2),6692771 (64,3),2493538 (64,4),509599 (64,5),3124891 (64,6),999049
(14,1),4674325+195781
(14,2),10864244+27649
(14,3),353339+32
(14,4),69804+1
(14,5),57870+0
(14,6),118235+1
(30,1),4115402+4674325+195781
(30,2),5667822+10864244+27649
(30,3),800243+353339+32
(30,4),241340+69804+1
(30,5),407888+57870+0
(30,6),863629+118235+1
4674325+195781
10864244+27649
353339+32
69804+1
57870+0
118235+1
4115402+4674325+195781
5667822+10864244+27649
800243+353339+32
241340+69804+1
407888+57870+0
863629+118235+1
16: 1 4870106 2 10891893 3 353371 4 69805 5 57870 6 118236
8*
((2*(4870106 + 10891893 + 353371 + 69805)) +
(3*( 57870 + 118236)))
263 MB (+ 305 MB)
8*
((1*(4870106 + 10891893)) +
(2*(353371 + 69805 + 57870 + 118236)))
135 MB (+ 305 MB)
32: (and 16 and 8): 1 8985508 2 16559715 3 1153614 4 311145 5 465758 6 981865
8 {- bytes -} *
((2*(8985508 + 16559715)) +
(3*(1153614 + 311145)) +
(4*(465758 + 981865)))
8 {- bytes -} *
((1*(8985508)) +
(2*(16559715 + 1153614)) +
(3*(311145 + 465758)) +
(4 * 981865))
32: not inclusive: (these are 305 MB uncompressed) 1 4115402 2 5667822 3 800243 4 241340 5 407888 6 863629
8*
((2*(4115402 + 5667822)) +
(3*(800243 + 241340)) +
(4*(407888 + 863629)))
222 MB
8 * (
(1*4115402) +
(2*(5667822 + 800243)) +
(3*(241340 + 407888)) +
(4*863629))
180 MB