Partial redundancy elimination (PRE) is an optimization where you eliminate computations that are redundant on some but not all control flow paths.
Example:
if (...) {
// ...
int a = x + y;
// ...
} else {
// ...
| void gen_mul(Value *dest, Value *src) { | |
| if (isconst(dest)) { | |
| gen_swap(dest, src); // this doesn't generate instructions and just swaps descriptors ("value renaming") | |
| } | |
| val_to_reg(dest); // the post-condition is that dest is allocated to a register | |
| if (isconst(src) && isimm32(src->ival)) { | |
| if (src->ival == 0) { | |
| int_to_val(dest, 0); | |
| } else if (src->ival == 1) { | |
| // do nothing |
Partial redundancy elimination (PRE) is an optimization where you eliminate computations that are redundant on some but not all control flow paths.
Example:
if (...) {
// ...
int a = x + y;
// ...
} else {
// ...
| typedef enum { | |
| CMD_INT = NODE_INT, | |
| CMD_NEG = NODE_NEG, | |
| CMD_NOT = NODE_NOT, | |
| CMD_ADD = NODE_ADD, | |
| CMD_SUB = NODE_SUB, | |
| CMD_RET = 128, | |
| CMD_SETREF, | |
| CMD_GETREF, | |
| } Cmd; |
| # Here's a probably-not-new data structure I discovered after implementing weight-balanced trees with | |
| # amortized subtree rebuilds (e.g. http://jeffe.cs.illinois.edu/teaching/algorithms/notes/10-scapegoat-splay.pdf) | |
| # and realizing it was silly to turn a subtree into a sorted array with an in-order traversal only as an | |
| # aid in constructing a balanced subtree in linear time. Why not replace the subtree by the sorted array and | |
| # use binary search when hitting that leaf array? Then you'd defer any splitting of the array until inserts and | |
| # deletes hit that leaf array. Only in hindsight did I realize this is similar to the rope data structure for strings. | |
| # Unlike ropes, it's a key-value search tree for arbitrary ordered keys. | |
| # | |
| # The main attraction of this data structure is its simplicity (on par with a standard weight-balanced tree) and that it | |
| # coalesces subtrees into contiguous arrays, which reduces memory overhead and boosts the performance of in-order traversals |
| // Heavily based on ideas from https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/lj_opt_fold.c | |
| // The most fundamental deviation is that I eschew the big hash table and the lj_opt_fold() | |
| // trampoline for direct tail calls. The biggest problem with a trampoline is that you lose | |
| // the control flow context. Another problem is that there's too much short-term round-tripping | |
| // of data through memory. It's also easier to do ad-hoc sharing between rules with my approach. | |
| // From what I can tell, it also isn't possible to do general reassociation with LJ's fold engine | |
| // since that requires non-tail recursion, so LJ does cases like (x + n1) + n2 => x + (n1 + n2) | |
| // but not (x + n1) + (y + n2) => x + (y + (n1 + n2)) which is common in address generation. The | |
| // code below has some not-so-obvious micro-optimizations for register passing and calling conventions, | |
| // e.g. the unary_cse/binary_cse parameter order, the use of long fields in ValueRef. |
| ValueRef add(ValueRef left, ValueRef right) { | |
| // Canonical commutation: x + y => y + x (constants have negative pos and move to the left) | |
| if (left.pos > right.pos) SWAP(left, right); | |
| switch (left.op) { | |
| case NEG: | |
| // Strength reduction: (-x) + y => y - x | |
| return sub(right, getleft(left)); | |
| case NUM: | |
| // Strength reduction: 0 + x => x | |
| if (getnum(left) == 0) return right; |
| struct Object { | |
| Key key; // The key is any piece of data that uniquely identifies the object. | |
| // ... | |
| }; | |
| struct Handle { | |
| Key key; | |
| Index index; // This caches a speculative table index for an object with the corresponding key. | |
| }; |
| // Linear-scan mark-compact collector for causal data structures, i.e. the reference graph is acyclic and the objects | |
| // are ordered by age in the heap (which happens if you use a linear allocator) so they are automatically topologically sorted. | |
| // This algorithm has very high memory-level parallelism which can be exploited by modern out-of-order processors, and should | |
| // be memory throughput limited. | |
| void collect(void) { | |
| // Initialize marks from roots. | |
| memset(marks, 0, num_nodes * sizeof(uint32_t)); | |
| int newest = 0, oldest = num_nodes; | |
| for (int i = 0; i < num_roots; i++) { | |
| marks[roots[i]] = 1; |
| // This is another take on the mark-compact collector from https://gist.github.com/pervognsen/7fe51bef8977cb249ac4c6f830f818a5 | |
| // To avoid having to do global compaction, our object indices will have two parts: a block index and a block offset. | |
| // Within a block we have the same linear older-to-newer ordering by offset. But now blocks are allowed to have different ages. | |
| // The block ages are defined by their position in a linked list: There's oldest_block and newest_block indices and then | |
| // previous_block[block_index] for each block. This enables newest-to-oldest block iteration and the linked-list structure | |
| // means that we can free an empty block by unlinking it. When a block is reused, it becomes the newest_block. Now, instead | |
| // of only compacting within a block we will actually be coalescing across an age range of blocks. By doing so, we will usually | |
| // be able to empty out entire blocks from the newer part of that age range, so they can be reused. This should have very similar | |
| // performance characteri |
| // The two sweetspots are 8-bit and 4-bit tags. In the latter case you can fit 14 32-bit keys in | |
| // a cacheline-sized bucket and still have one leftover byte for metadata. | |
| // As for how to choose tags for particular problems, Google's Swiss table and Facebook's F14 table | |
| // both use hash bits that weren't used for the bucket indexing. This is ideal from an entropy perspective | |
| // but it can be better to use some particular feature of the key that you'd otherwise check against anyway. | |
| // For example, for 32-bit keys (with a usable sentinel value) you could use the 8 low bits as the tag | |
| // while storing the remaining 24 bits in the rest of the bucket; that fits 16 keys per bucket. Or if the keys | |
| // are strings you could store the length as the discriminator: with an 8-bit tag, 0 means an empty slot, | |
| // 1..254 means a string of that length, and 255 means a string of length 255 or longer. With a 4-bit tag |