Pointer compression in GHC?

Can compiled Haskell programs benefit from any pointer compression schemes? The goal would be to lower memory residency significantly, while hopefully seeing an improvement in other runtime metrics. V8 Successfully implemented a pointer compression scheme; it seems the challenge for them was recovering the lost performance after the change; After optimizing they did observe single-digit improvements to CPU time.

In the analysis below I'm using this ghc-debug analysis on hasura, running huge_schema.

Analysis of my heap dump

First...

Conservative estimate of size of heap pointers + info pointers:
   1264 MB

...while profiling shows ~1400 MB residency. So we're mostly to nearly-all pointers all the way down it seems.

Some basic data on closures and pointers (taken individually):

Out of 50388864 total closures...
  ...34% have 1 pointer fields
  ...46% have 2 pointer fields
  ...7% have 3 pointer fields
  ...2% have 4 pointer fields
  ...7% have 5 pointer fields
  ...4% have 6 or pointer fields

Out of 113246923 total heap pointers...
  ...15% of heap pointers are in objects with 1 pointers
  ...41% of heap pointers are in objects with 2 pointers
  ...10% of heap pointers are in objects with 3 pointers
  ...3% of heap pointers are in objects with 4 pointers
  ...16% of heap pointers are in objects with 5 pointers
  ...15% of heap pointers are in objects with 6 or more pointers

There's a little bit of a long tail here (11% of widest closures take 31% of space), but mostly we care about small heap objects it seems.

Can we store heap pointers as offsets from the header?

On its face this seems feasible, since the original full pointer form that we have now can be recovered just by looking at the closure itself. A tag bit in the header might be used to indicate compressed or uncompressed ( And also perhaps 32- or 16-bit width).

We probably need to assume all pointers within a particular heap object must have the same width, so…

If all heap pointers in an object must have same width compressed (i.e. as offset),
  ... 0% of closures could use offsets of width 8
  ... 32% of closures could use offsets of width 16
  ... 24% of closures could use offsets of width 32
  ... 44% of closures must use offsets of width 64

Digging into the raw data (below), and taking into consideration alignment, let's try to quantify the effects of different schemes. Numbers below reflect any required padding at the end of heap objects.

Total memory consumed by compressible (to 1-, 2-, or 4-byte offsets) closures (including info pointer), currently:

668 MB  (53% of total)

Assuming info pointers stay 64 bits ...

With 16-bit compressed offsets this shrinks to:   568 MB
With 32-bit offsets:                              490 MB
With a combination of the two:                    485 MB

...and if we assume that we can stuff info pointers into 32 bits (while maintaining 64 bit alignment of heap objects):

With 16-bit compressed offsets this shrinks to:   440 MB
With 32-bit offsets:                              405 MB
With a combination of the two:                    315 MB  <---- best case

So in the best case a 28% reduction in memory usage.

This is optimistic in that we are treating the closures with 6 or more fields as if they had exactly 6. It is pessimistic in that we expect that as we compress the heap more closures become compressible.

Can info pointers be compressed?

On our heap we see:

Out of 50388864 total closures...
  ...34% have 1 pointer fields
  ...46% have 2 pointer fields

Out of 113246923 total heap pointers...
  ...15% of heap pointers are in objects with 1 sibling pointers
  ...41% of heap pointers are in objects with 2 sibling pointers

For every 2.25 heap pointers we have an info pointer, so the space used by info pointers is quite significant.

If info pointers could be compressed to 32 bits, this could significantly compliment a compression scheme for heap pointers in the body (even if we still must align heap objects to 64-bit boundaries, and possibly pad; see above). e.g. what would be a relatively common heap object like [8-byte info pointer][2-byte offset][2-byte offset] now fits in one word instead of two.

On the other hand, if there's no way (or no plan) to compress any of an object’s heap pointers then, due to alignment, there would be no benefit to compressing the info pointer I think.

We see the (dynamic?) linker places tables and code over the entire address space:

infotables + code range, in bits: 46.997

Is there a way to constrain this to a particular range, using a linker script or custom ld.so, or... (waves hands)?

What about compressing sibling pointers that are compacted after garbage collection?

Above we explored storing heap pointers as offsets relative to the originating heap object. But it occurred to me that there's not really a good reason to expect a heap object to be close to the child objects it has pointers to: my intuition is (unless all closures contain a single pointer) in the copying collector, as GC proceeds, scavenged objects are placed farther and farther away from their parent in to-space.

But I realized (in my mental model at least) scavenged sibling objects ought to end up adjacent to each other in the heap. So what about trying to compress all sibling pointers relative to each other?

I added this analysis, and it turns out it seems my intuition about sibling objects being compacted doesn't hold up at all; Most require more than 14 bits to represent an offset between them… I even tried disabling parallel garbage collection, to see if that was the issue, made sure to force major GC several times... but got the same result.

Likewise trying to measure relative to the last pointer in a closure.

This seems surprising to me, and it would be great to be able to explain this at least.

Questions & other ideas

On paper closures with pointers stored as offsets would seem to have the nice property that a contiguous subgraph can be copied to a new region of memory as a valid unit without fixing up the pointers; could this be used to optimize the evacuate/scavenge loop?
I assume most of the heap I'm looking at is resident in the oldest generation; how do these numbers compare for different parts of the garbage collection life cycle, and for other programs?
I assume most of the pointers that are representable by 16-bit offsets, are within the same block; Is there a less naive way of representing offsets, that takes into consideration some knowledge of blocks and where they live?
If we make room in a block using pointer compression, how much of a “virtuous cycle” effect will we see, With more compression opportunities exposed?

Trivia?

I looked at the block-to-block graph and was a little surprised by the high number of out edges. I wonder if there are a small number of “hot” target blocks

Out of 352619 blocks...
    ... 0% have edges to exactly 1 distinct other blocks
    ... 0% have edges to exactly 2 distinct other blocks
    ... 1% have edges to exactly 3 distinct other blocks
    ... 2% have edges to exactly 4 distinct other blocks
    ... 6% have edges to exactly 5 distinct other blocks
    ... 3% have edges to exactly 6 distinct other blocks
    ... 2% have edges to exactly 7 distinct other blocks
    ... 2% have edges to exactly 8 distinct other blocks
    ... 1% have edges to exactly 9 distinct other blocks
    ... 1% have edges to exactly 10 distinct other blocks
    ... 1% have edges to exactly 11 distinct other blocks
    ... 4% have edges to exactly 12 distinct other blocks
    ... 2% have edges to exactly 13 distinct other blocks
    ... 1% have edges to exactly 14 distinct other blocks
    ... 1% have edges to exactly 15 distinct other blocks
    ... 1% have edges to exactly 16 distinct other blocks
    ... 1% have edges to exactly 17 distinct other blocks
    ... 1% have edges to exactly 18 distinct other blocks
    ... 1% have edges to exactly 19 distinct other blocks
    ... 68% have edges to 20 or more other blocks

Most pointers point forwards in the address space (shrug):

Percentage of pointers representable as positive offset: 76

Prior Art

Have not read...

In v8 :

misc:

https://news.ycombinator.com/item?id=22220342 (discusses above)
https://llvm.org/pubs/2005-06-12-MSP-PointerComp.pdf - 32-bit pointers, "macrocopic" for an entire data structure?
https://github.com/oilshell/oil/wiki/Compact-AST-Representation - a nice collection of research notes

Raw output, formatted a bit

For my huge_schema heap:

========= Analysis   ======================================
infotables + code range, in bits: 46.99734750754052
----------------------------------
Percentage of pointers representable as positive offset: 76
----------------------------------
Out of 113246923 total heap pointers...
  ...15% of heap pointers are in objects with 1 sibling pointers
  ...41% of heap pointers are in objects with 2 sibling pointers
  ...10% of heap pointers are in objects with 3 sibling pointers
  ...3% of heap pointers are in objects with 4 sibling pointers
  ...16% of heap pointers are in objects with 5 sibling pointers
  ...15% of heap pointers are in objects with 6 or more sibling pointers
----------------------------------
Out of 50388864 total closures...
  ...34% have 1 pointer fields
  ...46% have 2 pointer fields
  ...7% have 3 pointer fields
  ...2% have 4 pointer fields
  ...7% have 5 pointer fields
  ...4% have 6 or pointer fields
----------------------------------
Conservative estimate of size of heap pointers + info pointers:
   1264 MB
----------------------------------
If all heap pointers in an object must have same width compressed (i.e. as offset),
  ... 0% of closures could use offsets of width 8
  ... 32% of closures could use offsets of width 16
  ... 24% of closures could use offsets of width 32
  ... 44% of closures could use offsets of width 64
----------------------------------
Out of 352619 blocks...
    ... 0% have edges to exactly 1 distinct other blocks
    ... 0% have edges to exactly 2 distinct other blocks
    ... 1% have edges to exactly 3 distinct other blocks
    ... 2% have edges to exactly 4 distinct other blocks
    ... 6% have edges to exactly 5 distinct other blocks
    ... 3% have edges to exactly 6 distinct other blocks
    ... 2% have edges to exactly 7 distinct other blocks
    ... 2% have edges to exactly 8 distinct other blocks
    ... 1% have edges to exactly 9 distinct other blocks
    ... 1% have edges to exactly 10 distinct other blocks
    ... 1% have edges to exactly 11 distinct other blocks
    ... 4% have edges to exactly 12 distinct other blocks
    ... 2% have edges to exactly 13 distinct other blocks
    ... 1% have edges to exactly 14 distinct other blocks
    ... 1% have edges to exactly 15 distinct other blocks
    ... 1% have edges to exactly 16 distinct other blocks
    ... 1% have edges to exactly 17 distinct other blocks
    ... 1% have edges to exactly 18 distinct other blocks
    ... 1% have edges to exactly 19 distinct other blocks
    ... 68% have edges to 20 or more other blocks

offset bits buckets for first and last pointers:

  -64,5220660
  -30,5700677
  -14,3280952
  -6 ,270143
   6 ,450223
   14,18040789
   30,6980271
   64,10445149

  -64,5220660
  -30,5700677
  -14,3280952
  -6 ,270143
   6 ,450223
   14,18040789
   30,6980272
   64,10445148

count of heap pointers by (offset bits bucket, pointers in closure):

  (-64,1),2072994
  (-64,2),3499989
  (-64,3),1250323
  (-64,4),143811
  (-64,5),671592
  (-64,6),1004615

  (-30,1),1904876
  (-30,2),3937939
  (-30,3),981790
  (-30,4),405707
  (-30,5),1198154
  (-30,6),2581246

  (-14,1),1045442
  (-14,2),2752133
  (-14,3),225390
  (-14,4),333855
  (-14,5),1672882
  (-14,6),873397

  (-6,1),73411
  (-6,2),190813
  (-6,3),9252
  (-6,4),82464
  (-6,5),1572
  (-6,6),35882

  (6,1),122370
  (6,2),570572
  (6,3),12200
  (6,4),28980
  (6,5),75774
  (6,6),1068

  (14,1),3628883
  (14,2),25321422
  (14,3),4643580
  (14,4),851502
  (14,5),6121834
  (14,6),7508721

  (30,1),2210526
  (30,2),5144066
  (30,3),1211565
  (30,4),601420
  (30,5),962642
  (30,6),4277196

  (64,1),6038417
  (64,2),5088038
  (64,3),2607356
  (64,4),835237
  (64,5),7248795
  (64,6),1185230

count of closures by (offset bits reqd. for all fields, pointers in closure):

  (6,1),195781
  (6,2),27649
  (6,3),32
  (6,4),1
  (6,6),1

  (14,1),4674325
  (14,2),10864244
  (14,3),353339
  (14,4),69804
  (14,5),57870
  (14,6),118235

  (30,1),4115402
  (30,2),5667822
  (30,3),800243
  (30,4),241340
  (30,5),407888
  (30,6),863629

  (64,1),8111411
  (64,2),6692771
  (64,3),2493538
  (64,4),509599
  (64,5),3124891
  (64,6),999049

scratch

    (14,1),4674325+195781
    (14,2),10864244+27649
    (14,3),353339+32
    (14,4),69804+1
    (14,5),57870+0
    (14,6),118235+1

    (30,1),4115402+4674325+195781
    (30,2),5667822+10864244+27649
    (30,3),800243+353339+32
    (30,4),241340+69804+1
    (30,5),407888+57870+0
    (30,6),863629+118235+1



    4674325+195781
    10864244+27649
    353339+32
    69804+1
    57870+0
    118235+1


    4115402+4674325+195781
    5667822+10864244+27649
    800243+353339+32
    241340+69804+1
    407888+57870+0
    863629+118235+1

16: 1 4870106 2 10891893 3 353371 4 69805 5 57870 6 118236

8*
((2*(4870106 + 10891893 + 353371 + 69805)) +
(3*( 57870 + 118236)))
263 MB (+  305 MB)

8*
((1*(4870106 + 10891893)) +
 (2*(353371 + 69805 + 57870 + 118236)))
 135 MB (+ 305 MB)

32: (and 16 and 8): 1 8985508 2 16559715 3 1153614 4 311145 5 465758 6 981865

  8 {- bytes -} * 
 ((2*(8985508 + 16559715)) + 
 (3*(1153614 + 311145)) + 
 (4*(465758  + 981865)))

  8 {- bytes -} * 
 ((1*(8985508)) +
 (2*(16559715 + 1153614)) +
 (3*(311145 + 465758)) +
 (4 * 981865))

32: not inclusive: (these are 305 MB uncompressed) 1 4115402 2 5667822 3 800243 4 241340 5 407888 6 863629

8*
((2*(4115402 + 5667822)) +
(3*(800243 + 241340)) +
(4*(407888 + 863629)))
222 MB

 8 * (
 (1*4115402) +
 (2*(5667822 + 800243)) +
 (3*(241340 + 407888)) +
 (4*863629))
 180 MB

jberryman/ghc_pointer_compression.md