Yegor Wienski wienski

Lightweight region memory management in a two-stage language

Intro
Basics
- Stage inference
Regions
- Bit-stealing
- Using regions
Eager regions

Garbage collection with zero cost at non-GC time

Every once in a while I investigate low-level backend options for PL-s, although so far I haven't actually written any such backend for my projects. Recently I've been looking at precise garbage collection in popular backends, and I've been (like on previous occasions) annoyed by limitations and compromises.

I was compelled to think about a system which accommodates precise relocating GC as much as possible. In one extreme configuration, described in this note, there

(Originally written as a reply to an HN submission of this article: https://www.cs.virginia.edu/~lat7h/blog/posts/434.html)

There's a simple recipe for arithmetically encoding recursive algebraic data types (in the functional programming sense) which is related to this.

What you might have seen is Goedel numbering where a finite sequence of natural numbers a_0, a_1, ..., a_n (where n isn't fixed but can vary per sequence) is mapped bijectively onto p_0^a_0 a_1^p_1 ... a_n^p_n where p_0, p_1, ... is an enumeration of the primes.

However, if you want to represent trees instead of sequences, you have a better, simpler option. The key is the existence of a bijective pairing function between N^2 and N, which you can write as <m, n> for m, n in N.

You have a lot of choices for how to construct the pairing function. But a curious fact is that there is essentially one polynomial pairing function and it's the one you saw in class when you learned that the rationals are countable: https://en.wikipedia.org/wiki/Fuet

	# I happened to be looking at some of Cranelift's code, and I noticed that their constant-time dominates()
	# check was using a somewhat more ad-hoc version of a hidden gem from the data structures literature called the
	# parenthesis representation for trees. As far as I know, this was invented by Jacobson in his 1989 paper
	# Space-Efficient Static Trees and Graphs. I first learned about it from the slightly later paper by Munro and Raman
	# called Succinct Representations of Balanced Parentheses and Static Trees. I figured I'd give it an extremely
	# quick intro and then show how it leads to a (slightly better) version of Cranelift's algorithm.
	#
	# This parenthesis representation of trees is surprisingly versatile, but its most striking feature is that
	# it lets us query the ancestor relationship between two nodes in a tree in constant time, with a few instructions.
	# And the idea is extremely simple and intuitive if you just draw the right kind of picture.

	// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
	// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
	// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
	// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
	// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
	// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
	// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
	void grow(Table *table) {
	u64 exp = 64 - table->shift;
	// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).

	INLINE
	u32 index(Ref r) {
	return r >> 4;
	}

	INLINE
	u32 tag(Ref r) {
	return r & 15;
	}

	#include <assert.h>

	#include <tuple>
	#include <vector>
	#include <string>

	typedef uint32_t Str;

	std::vector<const char*> strs;

	// Length-segregated string tables for length < 16. You use a separate overflow table for length >= 16.
	// By segregating like this you can pack the string data in the table itself tightly without any padding. The datapath
	// is uniform and efficient for all lengths < 16 by using unaligned 16-byte SIMD loads/compares and masking off the length prefix.

	// One of the benefits of packing string data tightly for each length table is that you can afford to reduce the load factor
	// on shorter length tables without hurting space utilization too much. This can push hole-in-one rates into the 95% range without
	// too much of a negative impact on cache utilization.

	// Since get() takes a vector register as an argument with the key, you want to shape the upstream code so the string to be queried
	// is naturally in a vector. For example, in an optimized identifier lexer you should already have a SIMD fast path for length < 16

	// We have five kinds of nodes: literal, negate, not, add, xor.
	// In this case we only need 3 bits for the tag but you can use as many as you need.

	enum Tag {
	LIT,
	NEG,
	NOT,
	ADD,
	XOR,
	};

	// The two sweetspots are 8-bit and 4-bit tags. In the latter case you can fit 14 32-bit keys in
	// a cacheline-sized bucket and still have one leftover byte for metadata.

	// As for how to choose tags for particular problems, Google's Swiss table and Facebook's F14 table
	// both use hash bits that weren't used for the bucket indexing. This is ideal from an entropy perspective
	// but it can be better to use some particular feature of the key that you'd otherwise check against anyway.
	// For example, for 32-bit keys (with a usable sentinel value) you could use the 8 low bits as the tag
	// while storing the remaining 24 bits in the rest of the bucket; that fits 16 keys per bucket. Or if the keys
	// are strings you could store the length as the discriminator: with an 8-bit tag, 0 means an empty slot,
	// 1..254 means a string of that length, and 255 means a string of length 255 or longer. With a 4-bit tag