For the past two weeks or so, I've been working on a little compiler project in C, mostly for educational purposes, i.e. to understand how a compiler really works. I'm not using any libraries, other than the C runtime library.
I have a hand-written lexer and parser, and a simple code generator targetting the Nintendo Gameboy Advance. I don't support named variables yet, but I do have pointer dereference and assignment, so I can write programs like this:
# Test Program: Collatz Conjecture
0x02000000.* = 27 # Starting number
0x02000004.* = 0 # Step counter
loop {
if 0x02000000.* & 1 {
0x02000000.* *= 3
0x02000000.* += 1
} else {
0x02000000.* = 0x02000000.* >> 1
}
0x02000004.* += 1
} while 0x02000000.* - 1
I'm saying I'm target the GBA, rather than "the ARMv4T instruction set" because I'm only generating valid ROM images, no other types of executable file. I'm also assuming the absence of memory protection featurs, and the GBA's memory map - address 0x02000000 is where EWRAM (external work RAM) is mapped.
As you can see from the snippet above, I have loops and if statements for control flow, but no functions or type checking. Those features are planned, of course, but I wanted to take a step back and talk about what I've learned so far.
I've been primarilly referring to two sources on compiler construction; Crafting Interpreters by Bob Nystrom, and Programming an x64 compiler from scratch, by Per Vognsen.
The former is a freely available, though (as of the time of writing) unfinished book that assumes familiarity with imperative programming, but not much more. It's concise, accessible, and a fun read.
The latter is a series of recordings of Per live streaming development, totaling almost 15 hours of video, with periods of silent thinking, debugging, and refactoring thrown in. There is a whole host of valuable knowledge in there, but in his explanations, Per assumes not only familiarity with various C-specific idioms, but also some amount of familiarity with compiler-specific terminology. It's not the most accessible source, to say the least.
I don't want to retread the ground covered in Crafting Interpreters here, so I'll assume you've read at least the first few chapters. Instead, consider this a complementary work, where I'll try to relate the new information from Per's videos, as well as my own findings.
With all that said, let's get started. We'll work our way through the compiler stages, starting with the lexer:
The lexer we see in Crafting Interpreters works as a precursory pass to parsing;
it calls scanToken()
until it reaches the end of the file, and then passes an
array of all tokens to the parser. This is nice in principle, because having
random access into the token stream allows us to parse pretty much any grammar
we want.
In practice, it is totally unnecessary; language designers generally frown upon languages that require unbounded lookahead, and some languages, like Lisp, can be parsed without any lookahead at all.
Knowing this, Per opted to only keep the most recently lexed token around, and
to call ReadToken()
directly from the parser. With this setup, we have a strict,
forward streaming interface. If we want lookahead, we have to manually keep track
of previous tokens in local variables.
I didn't want to do that either, though, because I knew I'd need at least two tokens of lookahead for the grammar I wanted, and wouldn't be able to leverage the call-stack to cleanly manage that state. As a compromise between the two methods, I implemented a ring buffer of tokens, which would be refilled completely whenever the parser tried to lookahead farther than had been lexed.
static Token lookahead(TokenStream * tokens, u32 offset) {
// We want LookaheadBufferSize to be a power of two so that the modulo
// reduction of ring buffer indices can be optimized to a bitwise AND.
StaticAssert(IsPowerOf2(LookaheadBufferSize));
Assert(offset <= MaxTokenLookahead);
Assert(tokens->read_cursor <= tokens->write_cursor);
u32 index = tokens->read_cursor + offset;
if (index < tokens->write_cursor) {
return tokens->lookahead_buffer[index % LookaheadBufferSize];
}
{ // Refill lookahead buffer:
while (tokens->write_cursor - tokens->read_cursor < LookaheadBufferSize) {
u32 write_index = tokens->write_cursor++ % LookaheadBufferSize;
Token * write_ptr = &tokens->lookahead_buffer[write_index];
bool stop_lexing = read_token(tokens, write_ptr);
if (stop_lexing) break;
}
Assert(tokens->read_cursor <= tokens->write_cursor);
Assert(tokens->write_cursor - tokens->read_cursor > offset);
}
return tokens->lookahead_buffer[index % LookaheadBufferSize];
}
If I really wanted to micro-optimize this, I could tweak LookaheadBufferSize
.
For now, I arbitrarilly chose 32, but the order of magnitude might really matter
for parsing speed: make it too large, and you won't be reading tokens from the
data cache; make it too small and you'll be refilling the ring buffer very often,
constanly switching between lexer and parser code, and potentially thrashing the
instruction cache.
Realistically, it doesn't matter that much. Compilers generally don't spend the majority of their time in lexing or parsing; if it does end up being a problem, profiling will show that, and I'll have one avenue of optimization figured out already.
While this is strictly a research project to me, I want to learn what goes into a production-grade compiler, and one thing I'd definetely want there is clean handling of Unicode.
I figured requiring source files to be UTF-8 encoded would be okay, but what if a file was using a different encoding? Just producing encoding errors, or worse, producing (and attempting to parse) garbage tokens would make for a, shall we say, interesting user experience.
So, when we begin lexing a file, before we even fill the lookahead buffer with tokens, we do this:
{ // Detect Unicode encodings other than UTF-8
const char *EncodingError =
"Encoding Error in '%s': source files must be UTF-8! (Detected %s)\n";
if (file_size >= 4) {
const char *UTF_32_BE_BOM = "\x00\x00\xFE\xFF";
const char *UTF_32_LE_BOM = "\xFF\xFE\x00\x00";
if (memcmp(file_memory, UTF_32_BE_BOM, 4) == 0) {
printf(EncodingError, file_name, "UTF-32 BE byte order mark");
return true;
}
if (memcmp(file_memory, UTF_32_LE_BOM, 4) == 0) {
printf(EncodingError, file_name, "UTF-32 LE byte order mark");
return true;
}
}
if (file_size >= 2) {
const char *UTF_16_BE_BOM = "\xFE\xFF";
const char *UTF_16_LE_BOM = "\xFF\xFE";
if (memcmp(file_memory, UTF_16_BE_BOM, 2) == 0) {
printf(EncodingError, file_name, "UTF-16 BE byte order mark");
return true;
}
if (memcmp(file_memory, UTF_16_LE_BOM, 2) == 0) {
printf(EncodingError, file_name, "UTF-16 LE byte order mark");
return true;
}
}
if (file_size >= 3) {
const char *UTF_8_BOM = "\xEF\xBB\xBF";
if (memcmp(file_memory, UTF_8_BOM, 3) == 0) {
// Skip byte order mark in UTF8 files
file_memory += 3;
}
}
}
In the context of this block, return true
basically tells the calling function
"something went wrong, don't even bother attempting parsing". This code is adapted
from the example code in the "Check for BOM markers" section of this chapter
of the Programming with Unicode book.
For the actual UTF-8 decoding, I'll just refer you to this article written by Bjoern Hoehrmann, where he describes a really nice technique based on finite state machines.
There are some corners of Unicode that can cause trouble for programmers, though. Currently I detect zero-width spaces, as well as a few other characters that could be used for code obfuscation. I'm told jslint does similar checks, but I couldn't find a list of sensible characters to forbid. An important one is the Greek question mark (U+037E), which will use the same glyph as the semicolon in most fonts.
In order to do name resolution, you need some way of comparing two identifier tokens for equality. The easiest way to do that is to store a slice (i.e. a pointer and a length) into the file buffer directly on the token. This is bad, because it forces you to keep the file in memory for at least the lifetime of all tokens, which makes reclamation of resources a pain.
You could just copy the string, but then you're making a lot of small allocations, and don't forget that a string compare is quite slow: two pointer dereferences, at least one of which is likely to be a cache miss, and O(n) comparisons.
I picked up the solution to this problem from Jonathan Blow live streaming work on his Jai compiler: string interning. The idea is to maintain a table of unique strings you've seen before, and deduplicating as early as possible. This allows us to replace any and all string comparisons we might otherwise have to do with a single integer comparison, at the cost of doing some upfront work for each identifier token. Since pretty much every identifier in a program will be looked up in some table eventually, this trick is definetely worth it.
Another cool thing is that this allows you to do keyword recognition with the same table lookup: At program startup, intern all your keyword strings, and remember their unique representation. After lexing an identifier, intern it, and compare it to all the keywords. If your "unique representation" is a sequential integer ID, you can even do this with just one branch and one addition:
u32 atom = intern_string(token_start, length_in_bytes);
if (atom < KeywordCount) {
result->token_type = KeywordIf + atom;
} else {
result->token_type = IdentifierToken;
result->value = atom;
}
That's pretty much free. You have to be careful to intern your keywords in the
same order they appear in the TokenType
enumeration, though.
This topic is pretty trivial if you've worked with assembly language before, but it's not covered in Crafting Interpreters, so here's a 10000ft overview:
Conceptually, you can think of your CPU as a state machine. Its internal state consists of a number of registers, tiny pieces of memory, typically 32 or 64 bits for machines.
One of those registers serves a special purpose: it holds the address of the next instruction to be executed. This register is called, unsurprisingly, the instruction pointer, or sometimes the program counter, depending on the architecture.
Executing an instruction really means changing the state of the processor. We can
group different instructions based on how they affect the processor state:
arithmetic instructions such as ADD
, SUB
, or MUL
, will move the data from
two registers into the arithmetic logic unit, where the result of the operation
is calculated, and then move the result into a third register. Memory access
instructions such as LDR
and STR
will move a register-size piece of data from
RAM into a register, or vice-versa. The CPU cannot do arithmetic on values in
memory, they have to first go through a register. The third important category are
branch instructions, which affect the instruction pointer.
Different instructions are systematically assigned binary representations. This underlying system is called the instruction encoding. The x64 CPU inside your desktop machine uses a variable-length encoding, where different instructions take up a varying number of bytes, whereas the ARM CPU inside the GBA uses a simpler fixed-length encoding, where all instructions are the same size.
It makes sense to split the compiler back-end into two parts: the code generator, which walks the abstract syntax tree and determines what instructions should be used, and the instruction emitter, which is just a big collection of functions that encode instructions and insert their binary representation into an output stream.
If you have a suitable output stream abstraction, writing an emitter is mostly mechanical work. Just refer to the instruction set manual for the CPU you're targeting, which should include relevant sections on instruction encoding. Here's a snippet from my emitter:
static inline void
Emit3OperandSum(u16 op, u16 dst_register, u16 src_register, u16 operand) {
Assert(dst_register < 8);
Assert(src_register < 8);
Assert(operand < 8);
EmitRaw(0x1800 | op << 9 | operand << 6 |
src_register << 3 | dst_register);
}
#define EmitADD_R_R_R(Dst, Src, Operand) Emit3OperandSum(0, (Dst), (Src), (Operand))
#define EmitSUB_R_R_R(Dst, Src, Operand) Emit3OperandSum(1, (Dst), (Src), (Operand))
#define EmitADD_R_R_I(Dst, Src, Operand) Emit3OperandSum(2, (Dst), (Src), (Operand))
#define EmitSUB_R_R_I(Dst, Src, Operand) Emit3OperandSum(3, (Dst), (Src), (Operand))
The idea here is that one function handles one instruction encoding, and macros
provide more convenient ways of calling the appropriate function to emit a specific
instruction. This is helpful because, in my case, there's also an EmitADD_R_I
and EmitADD_R_R
, both of which call different functions, for example.
Now imagine about 150 more lines of very similar code, and you know what my emitter looks like.
Now, figuring out which instructions to emit, and in what order, that's a little more challenging. There are many algorithms for doing this with varying degrees of sophistication; I mostly coded along with Per's videos here, so I ended up with something more on the simple side.
You should already be familiar with the simple tree-walk approach to interpretation:
static u32 eval_expression(const * AstExpr expr) {
switch (expr->type) {
case PrimaryExpression: {
return expr->primary.value;
} break;
case BinaryExpression: {
u32 lhs = eval_expression(expr->binary.lhs);
u32 rhs = eval_expression(expr->binary.rhs);
if (expr->binary.op == '+') {
return lhs + rhs;
} else if (expr->binary.op == '-') {
return lhs - rhs;
} /*
...
*/
} break;
}
}
We can use a perfectly analogous approach to code generation, where every return
statement is replaced with some Emit
calls, and functions instead return the index
of the register that will contain the result of the operation:
static u16 emit_expression(const * AstExpr expr) {
switch (expr->type) {
case PrimaryExpression: {
u16 result = allocate_register();
// MOVes the constant value ("Immediate") into the Register result
EmitMOV_R_I(result, expr->primary.value);
return result;
} break;
case BinaryExpression: {
u16 lhs_reg = emit_expression(expr->binary.lhs);
u16 rhs_reg = emit_expression(expr->binary.rhs);
if (expr->binary.op == '+') {
// ADDs the value in Register rhs_reg to the value in lhs_reg,
// and stores the result in lhs_reg
EmitADD_R_R_R(lhs_reg, lhs_reg, rhs_reg);
} else if (expr->binary.op == '-') {
EmitSUB_R_R_R(lhs_reg, lhs_reg, rhs_reg);
} /* ... */
free_register(rhs_reg);
return lhs_reg;
} break;
}
}
You may have heard scary things about register allocation, but that really only becomes hard when you try to keep a function's local variables in registers - just like the tree walk interpreter becomes more complicated when you add state and statements.
I won't show the implementation here, since it's a lot of bit-fiddling, but the basic idea is to maintain 8 bits of state - one for each register - indicating which registers are occupied with temporary values in the execution context of the instruction we emit next. To allocate a register, you find the index of any zero bit, set it to one, and return the index. To free the register, you mask out the bit at the specified index.
This approach is nice and simple, but it doesn't exactly generate spectacular
code. For example, for the expression 1 + (2 + 3)
, it will generate the
following piece of assembly:
MOV r0, 0x1
MOV r1, 0x2
MOV r2, 0x3
ADD r1, r1, r2
ADD r0, r0, r1
At the very least, I'd like to see something more like this:
MOV r0, 0x1
MOV r1, 0x2
ADD r1, 0x3 ; small constants can be encoded in the same instruction!
ADD r0, r0, r1
We can achieve this by delaying code generation for primary expressions until we know more about how they'll be used, i.e. on our way back up the call stack. To do that, we cannot simply return register indices anymore, we need to introduce a new type:
enum { ImmediateOperand, RegisterOperand };
typedef struct {
u32 type;
union {
u32 imm;
u16 reg;
};
} Operand;
static Operand emit_expression(const AstExpr * expr) {
switch (expr->type) {
case PrimaryExpression: {
Operand result;
result.type = ImmediateOperand;
result.imm = expr->primary.value;
return result;
} break;
case BinaryExpression: {
Operand lhs = emit_expression(expr->binary.lhs);
emit_as_register(&lhs);
Operand rhs = emit_expression(expr->binary.rhs);
if (expr->binary.op == '+') {
if (rhs.type == ImmediateOperand && rhs.imm < 0x100) {
// We only have 8 bits to encode immediate operands...
EmitADD_R_I(lhs.reg, rhs.imm);
} else {
emit_as_register(&rhs);
EmitADD_R_R_R(lhs.reg, lhs.reg, rhs.reg);
free_register(rhs.reg);
}
} /* ... */
return lhs;
} break;
}
}
emit_as_register
is a helper function that will do nothing when called with a
RegisterOperand
, but when called with an ImmediateOperand
, it will allocate a
register, emit the appropriate load instruction, and modify the Operand
to
reflect that the value is now held in a register.
This approach is really flexible; we have enough wiggle room to rathole on
instruction selection (the above code will call EmitADD_R_I
for 1000 + 1
, but
not for 1 + 1000
, for example), and we can get constant folding pretty much for
free by checking whether both operands are immediates.
If we restructure our code a bit, such that we do the check on expr->binary.op
before recursing to get lhs
and rhs
, we can even do operator-specific
optimizations, such as eliminating the right sub-expression of a multiplication
if the left sub-expression is a constant 0, or emitting a SUB
instruction for
expressions of the shape a + (-b)
.
If it isn't obvious from the code snippet in the introduction, I denote a pointer
dereference A.*
. I chose this syntax for several reasons, none of which really
matter right now; all that matters is that this expression evaluates to the value
stored at address A
. We can shoehorn that into the emit_expression
function
from the previous section pretty easily:
case UnaryExpression: {
Operand addr = emit_expression(expr->unary.operand);
emit_as_register(&addr);
if (expr->unary.op == OperatorDereference) {
// Load from the address in addr.reg (offset by 0),
// and store the value in the same register:
EmitLDR_R_R_OFF(addr.reg, addr.reg, 0);
} /* ... */
return addr;
} break;
That does not allow us to store anything, though. For that, we need assignment operators. I don't really like C's idea of assignments being expressions, and wouldn't want users of my language to think of them as such. But if you were to look at my parser, you'd see that I still treat them that way internally.
That's in part because it made my life a little bit easier, but also because it
allows for better error reporting. Think Error: assignments are not expressions
,
rather than Syntax Error: unexpected token '='
. This technique of parsing a
super-set of the language you intend to actually accept is generally pretty good,
and similar techniques can be applied to other parts of a compiler as well.
That said, let's look at two assignment operators, plain-old =
and +=
.
Here, I'll assume that we have not yet emitted code for the two sub-expressions
by the time we know we're looking at an assignment:
/* ... */
} else if (expr->binary.op == '=') {
AstExpr lhs = expr->binary.lhs;
Assert(lhs.type == UnaryExpression &&
lhs.unary.op == OperatorDereference);
Operand addr = emit_expression(lhs.unary.operand);
emit_as_register(&addr);
Operand value = emit_expression(expr->binary.rhs);
emit_as_register(&value);
EmitSTR_R_R_OFF(value.reg, addr.reg, 0);
free_register(value.reg);
} else if (expr->binary.op == OperatorPlusEqual) {
AstExpr lhs = expr->binary.lhs;
Assert(lhs.type == UnaryExpression &&
lhs.unary.op == OperatorDereference);
Operand addr = emit_expression(lhs.unary.operand);
emit_as_register(&addr);
u16 tmp_reg = allocate_register();
EmitLDR_R_R_OFF(tmp_reg, addr.reg, 0);
Operand value = emit_expression(expr->binary.rhs);
emit_as_register(&value);
EmitADD_R_R_R(tmp_reg, tmp_reg, value.reg);
EmitSTR_R_R_OFF(tmp_reg, addr.reg, 0);
free_register(value.reg);
free_register(tmp_reg);
} /* ... */
The code duplication, the fact that assignments are not actually expressions, and
don't evaluate to anything (and therefore don't need to return a RegisterOperand
),
all suggest that we should split this code into a different function, which is in
fact what I did in my code.
Factoring style aside, though, the basic =
assignment works pretty much like the
+
and -
operators we've seen before, modulo the reaching into lhs
to skip
the pointer dereference.
The +=
assignment is a bit more interesting, I think. Because we need both the
address and the value at that address, we allocate a temporary register and load
into that before generating code for rhs
.
Note that I'm omitting the optimization of the addition itself; if value
was an
ImmediateOperand
, we could save another instruction here. Furthermore, we're
always emitting load and store instructions with 0 offset; we could pack pointer
arithmetic into the same instructions by again reaching into lhs.unary.operand
.
Now that we have side effects at our disposal, we may become more interested in
control flow, i.e. if
and while
statements. The basic framework will assume
that a statement will point to the statement immediately following it; the last
statement in a scope has no successor, even if the scope is part of a statement
that does:
static void emit_statement(const AstStmt * stmt) {
switch (stmt->type) {
case IfElseStatement: {
// ...
} break;
case LoopStatement: {
// ...
} break;
case ExpressionStatement: {
emit_expression(stmt->expr);
} break;
}
if (stmt->next) {
emit_statement(stmt->next);
}
}
To figure out what we need to fill in the blanks here, we need to take a look at how control flow works at the machine level: in addition to the instruction pointer, there's another special register called the status register. It's not accessible directly by most instructions. Instead, it holds information about previous operations, such as the carry flag (telling us whether the last arithmetic instruction caused an overflow), the zero flag (which is exactly what you think it is), and a few others.
The CMP
instruction is essentially the same as the SUB
instruction - except
that the result of the operation is discarded. Only these flags are affected. The
other piece of the puzzle are conditional branches, which, depending on the
status register and the condition encoded in the instruction, may or may not add
some constant offset to the instruction pointer. (These are called relative jumps,
because the address of the code we want to jump to is encoded relative to the
location of the jump instruction itself. There are also jumps to absolute addresses,
and variable target jumps, where the target location is read from a general
purpose register).
With that, we are equipped to look at the kind of code we want to generate. For
instance, for the statement if x % 2 { /* ... */ } else { /* ... */ }
we want
to generate this structure:
MOV r1, 0x1
AND r0, r1 ; assume x is in r0
CMP r0, 0x0 ; C-style implicit conversion of condition
BEQ else
; ... execute then-branch statements here
B end_if
else:
; ...
end_if:
; ...
There's a problem here: because we don't know how many instructions we'll emit for
the two branch bodies, we don't know the locations of the two labels, and thus
cannot encode the instructions BEQ else
and B end_if
. The solution here is
called backpatching: we emit a placeholder, and remember it's position. When we
reach the label we want to backpatch, we emit the missing instruction.
In my own code, I use a macro called Backpatch
for that, which temporarily sets
the write cursor to the specified value, emits one instruction, and then restores
the cursor. We can write something like this:
case IfElseStatement: {
Operand cond = emit_expression(stmt->if_else.cond);
emit_as_register(&cond);
EmitCMP_R_I(cond.reg, 0);
free_register(cond.reg);
u16 * beq_else = EmitPlaceholder();
emit_statement(stmt->if_else.then_branch);
u16 * b_end_if = EmitPlaceholder();
s16 jmp_offset = JumpOffsetFwd(beq_else);
Backpatch(beq_else, EmitBEQ_OFF(beq_else));
emit_statement(stmt->if_else.else_branch);
jmp_offset = JumpOffsetFwd(b_end_if);
Backpatch(b_end_if, EmitB_OFF(b_end_if));
} break;
Great, now let's do loops. There are a couple different ways to write loops in assembly, and they have slightly different properties. The one you see the most often is this:
B cond ; pre-check, go to the _end_ to the loop
loop:
; ... loop body goes here
cond: ; condition is evaluated at the bottom of the loop
MOV r1, 0x1
AND r0, r1
CMP r0, 0x0
BNE loop ; Branch if Not Equal
To get a post-checked loop, we can just skip emitting the first instruciton, B cond
- I'll only show the pre-checked variant here:
case LoopStatement: {
u16 * b_cond = EmitPlaceholder();
u16 * loop = emit_cursor;
emit_statement(stmt->loop.body);
s16 jmp_offset = JumpOffsetFwd(b_cond);
Backpatch(b_cond, EmitB_OFF(jmp_offset));
Operand cond = emit_expression(stmt->loop.cond);
emit_as_register(&cond);
EmitCMP_R_I(cond.reg, 0);
EmitBNE_OFF(JumpOffsetBkwd(loop));
free_register(cond.reg);
} break;
Now the code I show here works, but if you play around with it a little bit, you'll see that it generates some super awkward code in some cases. Things like jumps to jump instructions - which we could replace with a jump to the actual target, without any indirection.
You could do that with the same "reaching in" trick I've discussed before (i.e. looking at the interior nodes of the AST node you're currently looking at), but there are many different combinations where this can occur, and the necessary checks will get more and more messy.
I haven't actually done this myself yet, but a trick I've learned from a paper (recommended by Fabian Giesen on his blog) is called destination-driven code generation. The idea is similar to the delayed code generation we're using for expressions: We introduce the notion of a control destination that we pass around as we traverse the syntax tree, and delay the backpatching of jump instructions until we know where they should go.
If you've never read a computer science paper before, this one might make a good starting point. There's some scary-looking equations in there, but they are explained in a very nice way.
So far, the code I've shown is pretty much instruction set agnostic. It applies
to an x64 backend just as well as for the THUMB instruction set we've been using.
You'd need to replace the emitter and and adjust the calls to any Emit
functions
accordingly, but other than that, it'll "just work". But now I want to talk about
an annoyance that is instruction set specific, and the solution I came up with:
As mentioned before a THUMB instruction is two bytes, and thus the MOV rd, imm
instruction you'd use inside emit_as_register
can only encode immediate values
up to 255 - there's only so much encoding space. What you're supposed to do is
use a PC-relative load, LDR rd, off
, which loads a 32-bit value from an address
close to the instruction pointer.
Since we don't want the CPU to try and execute that data, we need to store it in a suitable location, called a literal pool. That is to say, you generate code until you reach a good place to dump all the values you couldn't fit into the instruction stream, and then backpatch a bunch of load instructions.
Any given expression might require any number of these, so we need to maintain a buffer of backpatch locations that we can push requests into. The question then becomes where to check whether emtiting a literal pool is necessary; if you checked after every single statement, you'd need to emit jump instructions to prevent it from being executed. That's not only ugly, those jumps aren't free, so you're also making the generated program slower.
Instead, you want to flush the backpatch location buffer, as I call it, only in places where you're emitting jumps anyway. In a pre-checked loop, we can put it between the unconditional jump to the condition and the loop body; in an if/else statement, we can fit it in between the unconditional jump at the end of the first branch and the beginning of the second branch.
What's nice about this approach is that we can repurpose it to also handle break
and continue
statements. What's not so nice is that it doesn't solve a fundamental
robustness problem: just like the size of an immediate value is limited, so is the
size of the offset we can encode in a load instruction - there's a hard limit to
how many instructions we can emit between putting a location into the buffer, and
flushing the buffer. That limit is quite large (about 500 instructions), and we're
not likely to run into any problems for simple programs. But there are pathological
cases where this will blow up and fail to generate valid code.
I don't have a nice solution for this problem yet, and I don't want to dedicate too much effort to it for now. Going forward, I'd rather work on more language features.
And there you have it, all the interesting things I've done and learned so far. I hope this will be useful to somebody - if so, maybe I'll do a similar write-up once I've made some progress on the project.
I will say that this has already made me a much better programmer. Not using any libraries means you cannot rely on Stackoverflow to solve your cookie-cutter problems. Since you're laying the foundations of a larger system, you cannot rely on hacky bug fixes, you need to get down to the root cause. Successfully solving challenging problems in this context has been a real boost to my confidence in my skills.
At the same time, it was a really humbling experience. When you have a high-level understanding of how compilers work, it's easy to fall into thinking that they are simple. This illusion is shattered as soon as you get into the trenches and fight with all the little details.
All in all, this has been a very rewarding project so far, and I'd recommend doing something similar to anyone even remotely interested. I certainly don't intend to quit anytime soon.