Bootstrapping Zig from below C level.
The first computer programs were written directly in machine code. To make things easier, automated assemblers were developed, initially in machine code as well and then in their own input language. To make things easier again, high-level (for the time) languages such as BCPL, B, and eventually C were developed, which abstracted some of the details of the underlying architecture and allowed more complex compound logic. These languages were implemented by first writing compilers for subsets of them in assembly, then using those compilers to compile compilers for the full languages. Thus, there was a concrete bootstrap chain extending right down to assembly.
Nowadays, this is a lost art. Quite literally all languages that bootstrap themselves nowadays start from a pre-existing high-level language, usually C or C++, and then reject their roots as soon as they can. Zig stands out by maintaining its bootstrap chain, but this chain begins at a system C++ compiler, and currently there are no plans to build on firmer ground than a system C compiler. Granted, this is by no means a bad choice -- almost every platform is targeted by several C compilers, and it means we can have a single unified codebase for at least the first stage of bootstrapping. However, it also means we are forced into lockstep with C compiler standards (not to mention we have to assume that the system compiler actually follows these), and that we can't implement Zig on a completely unsupported platform, such as a personal project architecture.
We could mitigate these problems, and gain some serious nerd cred, if we could pull ourselves up from bare metal. This could be done by defining a strict subset of Zig (herein referred to as tZig, for "thin Zig", i.e. a thin layer over assembly), which is easier to translate to any machine code, then writing the stage 1 compiler in that subset, such that any language could be used for bootstrapping. This is more complex internally than the current plan, but it does not create more work for the end user -- in fact, if a platform does not have a C compiler for whatever reason, it is much less work to implement a tZig compiler than a full C compiler, especially if you only have assembly to work with. If a platform does have a C compiler, we can simply write a tZig compiler in C, and we don't even have to distribute this with the main Zig distribution.
- No forward references -- global declarations may only reference previously defined symbols
- No comptime -- only global constants are comptime-known, and see below
- No type or error set inference
- Declarations must not contain ALU ops, function calls, dereferences, or indexes
- Non-primitive types must be bound to a global constant before use
- No
union (enum)
-- enum type must be explicitly declared
- No
- Functions must be bound to a global constant before being called or referenced (once #1717 comes through)
- No member functions or types
- Local variables must all be declared at the start of a function
- No compound operations -- at most (one assignment and (one ALU op/function call/index/dereference OR one complex literal (array, struct, string))) OR one
switch
/if
statement per line/switch
arm/if
clause- In particular, this means that
switch
andif
statements cannot return values -- extracting values must be done through assignment
- In particular, this means that
- No
defer
orerrdefer
- No floating point types or values
- No
for
loops --while
loops only, with optional continue expressions, subject to the previously stated statement restrictions - No
async
(this one is definitely negotiable) - No inline assembly (this makes code non-portable without comptime switches, and requires an integrated assembler -- see below for an alternative)
- Only allowed import is the special-case
@import("bootstrap")
, which provides hooks into the implementation for heap allocation, file I/O, and other necessary functionality- When compiling tZig as Zig, this aliases selected standard library functionality
- No builtin functions that would contradict previous rules
These restrictions will allow the language to be compiled line by line in one pass, without the need for complex register allocation, and maintaining full compatibility with the final stage compiler.
Notes:
- The tight restrictions on single statements were chosen such that a typical RISC system can compile them into single instructions.
- An index/dereference may appear on either the left or the right of an assignment, but there must be only one.
defer
/errdefer
require a compiler to know a future address, which may depend on future statements -- thus, implementing them with a single pass is impossible. (I could be wrong about that. If so, they should be allowed.)- In contrast,
try
only requires knowledge of a return address, which typically has a designated register. - Floating point values are not available on all platforms, and not necessary for a compiler.
- Any
for
loop can be implemented as awhile
loop, andfor
loops have hidden state and functionality (namely, the current index and bounds check). async
is a deliberate departure from platform ABIs, hence it might not be obvious how to implement it. However, if it is particularly helpful, it can be accommodated.- Inline assembly makes code non-portable without comptime switches, and requires an integrated assembler, hurting simplicity.
- Not only is
@import("bootstrap")
ugly, it's an exception to both the declaration rule and the compound operation rule. However, at some point we do need to hook into the host system, and without inline assembly, the best we can do is put all the magic in one place. Also considered was placing a singlebootstrap
module instd
so as to avoid another@import
special case -- however, that would have required even more special casing on the tZig side, as well as necessitating lazy evaluation (and therefore the full complexity of Zig syntax) to avoid processing the whole standard library. - I have definitely missed details. Most of this was written in 2 hours, and the rest at midnight. I might do a proper proposal tomorrow.