adamnew123456/rpnjit-1.md

Created March 8, 2020 21:53

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/adamnew123456/00cdaa77aabbd8b6db02d53d9c0b0659.js"></script>
Save adamnew123456/00cdaa77aabbd8b6db02d53d9c0b0659 to your computer and use it in GitHub Desktop.

Download ZIP

A Programmable RPN Calculator and JIT Compiler

Raw

rpnjit-1.md

A Basic RPN Calculator

We're going to start with a basic RPN interpreter and add function definitions to it. That will serve as a good starting point for adding on JIT compilation later. There are only going to be a few primitives to start with:

Integer constants
Arithmetic operators: +, -, *, /, %
Stack manipulation: swap, dup, drop
Display: print

These are simple enough to implement if we don't mind ignoring a lot of error handling:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LINE_SZ 4096

typedef enum {
    WORD_NONE,
    WORD_INT,
    WORD_COMMAND
} wordtype_t;

typedef struct {
    int vals[LINE_SZ];
    int idx;
} stack_t;

void eval_command(stack_t *stack, char *command) {
    if (!strcmp(command, "+")) {
        int b = stack->vals[--stack->idx];
        int a = stack->vals[--stack->idx];
        stack->vals[stack->idx++] = a + b;
    } else if (!strcmp(command, "-")) {
        int b = stack->vals[--stack->idx];
        int a = stack->vals[--stack->idx];
        stack->vals[stack->idx++] = a - b;
    } else if (!strcmp(command, "*")) {
        int b = stack->vals[--stack->idx];
        int a = stack->vals[--stack->idx];
        stack->vals[stack->idx++] = a * b;
    } else if (!strcmp(command, "/")) {
        int b = stack->vals[--stack->idx];
        int a = stack->vals[--stack->idx];
        stack->vals[stack->idx++] = a / b;
    } else if (!strcmp(command, "%")) {
        int b = stack->vals[--stack->idx];
        int a = stack->vals[--stack->idx];
        stack->vals[stack->idx++] = a % b;
    } else if (!strcmp(command, "swap")) {
        int b = stack->vals[stack->idx - 1];
        int a = stack->vals[stack->idx - 2];
        stack->vals[stack->idx - 1] = a;
        stack->vals[stack->idx - 2] = b;
    } else if (!strcmp(command, "dup")) {
        int a = stack->vals[stack->idx - 1];
        stack->vals[stack->idx++] = a;
    } else if (!strcmp(command, "drop")) {
        stack->idx--;
    } else if (!strcmp(command, "print")) {
        for (int i = 0; i < stack->idx; i++) {
            printf("[%d] = %d\n", i, stack->vals[i]);
        }
    }
}

int main() {
    char line[LINE_SZ];
    char command[128];
    int command_idx = 0;

    int number = 0;
    stack_t stack;
    stack.idx = 0;

    while (fgets(line, LINE_SZ, stdin) != NULL) {
        bool line_done = false;
        bool word_done = false;
        wordtype_t word_type = WORD_NONE;

        int i = 0;
        while (!line_done) {
            char ch = line[i];

            if (!ch) {
                line_done = true;
                word_done = true;
            } else if (ch == ' ' || ch == '\t' || ch == '\r' || ch == '\n') {
                word_done = true;
            } else if (ch >= '0' && ch <= '9') {
                word_type = WORD_INT;
                number = (number * 10) + (ch - '0');
            } else {
                word_type = WORD_COMMAND;
                command[command_idx++] = ch;
            }

            if (word_done) {
                switch (word_type) {
                case WORD_INT:
                    stack.vals[stack.idx++] = number;
                    break;
                case WORD_COMMAND:
                    command[command_idx++] = 0;
                    eval_command(&stack, command);
                    break;
                case WORD_NONE:
                    break;
                }

                number = 0;
                command_idx = 0;
                word_type = WORD_NONE;
                word_done = false;
            }

            i++;
        }
    }

    return 0;
}

After compiling this, we have a basic calculator that's ready to use. For example, we can compute the factorial of 5 by hand:

$ gcc rpn1.c -o rpn1
$ ./rpn1
1 2 3 4 5
print
[0] = 1
[1] = 2
[2] = 3
[3] = 4
[4] = 5
* * * *
print
[0] = 120

Allowing User Definitions

There are already a few things here which will have to change in order to allow users to define their own commands:

Right now all commands are executed as they are entered. In order to define commands, we'll have to support not only the immediate mode that we support now, but also a compile mode where we buffer commands within a function definition instead of running them.
Right now commands are only run within the fixed eval_command function, which statically dispatches on command names into corresponding parts of code. We'll have to have keep something like this since the primitives will still be there, but there will also have to be a place to store function definitions.
Right now all commands are implemented implicitly using code defined within the program itself. The work of implementing these commands on a lower level is done by the C compiler; for example, when we implement + we're relying on the compiler to generate the right code and run it within the branch that tests that the command really is +.

User-defined commands can't be run this way because they won't be implemented in C (even if the primitives they depend upon will be). We'll have to add an interpreter to execute those user-defined commands, and how it works will depend upon how user-defined functions are stored.

To address each of these concerns, we're going to use a simple representation that lets us reuse a lot of the program as it exists now:

The immediate mode syntax will be nothing special, just the same commands that we usually run without any extra adornment. We're going to activate compile mode using a syntax similar to Forth, where : starts a definition and it goes to the end of the line. For example, : incr 1 + defines a word incr which can be used to increment the top value on the stack.
We can add a simple list that stores both command names and the code associated with them (what that code actually looks like will be covered in the next point). To find a definition, we just have to iterate this list and check each command name. There are more efficient ways to do this but we're shooting for brevity here.
We can reuse most of the existing command evaluator if we store user-defined commands as text. We'll still have to rework the dispatching mechanism as described in the last point, but the parsing will mostly stay the same.

With all these changes applied, we get the second iteration of the RPN calculator. Again, this ignores lots of error handling as well as not covering cases which might be useful, like redefining commands within the same session.

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LINE_SZ 4096
#define COMMAND_SZ 128
#define COMMAND_CNT 32

typedef enum {
    WORD_NONE,
    WORD_INT,
    WORD_COMMAND
} wordtype_t;

typedef struct {
    int vals[LINE_SZ];
    int idx;
} stack_t;

typedef struct {
    bool used;
    char name[COMMAND_SZ];
    char defn[LINE_SZ];
} userdef_t;

void eval_primitive(stack_t *stack, char *command) {
    if (!strcmp(command, "+")) {
        int b = stack->vals[--stack->idx];
        int a = stack->vals[--stack->idx];
        stack->vals[stack->idx++] = a + b;
    } else if (!strcmp(command, "-")) {
        int b = stack->vals[--stack->idx];
        int a = stack->vals[--stack->idx];
        stack->vals[stack->idx++] = a - b;
    } else if (!strcmp(command, "*")) {
        int b = stack->vals[--stack->idx];
        int a = stack->vals[--stack->idx];
        stack->vals[stack->idx++] = a * b;
    } else if (!strcmp(command, "/")) {
        int b = stack->vals[--stack->idx];
        int a = stack->vals[--stack->idx];
        stack->vals[stack->idx++] = a / b;
    } else if (!strcmp(command, "%")) {
        int b = stack->vals[--stack->idx];
        int a = stack->vals[--stack->idx];
        stack->vals[stack->idx++] = a % b;
    } else if (!strcmp(command, "swap")) {
        int b = stack->vals[stack->idx - 1];
        int a = stack->vals[stack->idx - 2];
        stack->vals[stack->idx - 1] = a;
        stack->vals[stack->idx - 2] = b;
    } else if (!strcmp(command, "dup")) {
        int a = stack->vals[stack->idx - 1];
        stack->vals[stack->idx++] = a;
    } else if (!strcmp(command, "drop")) {
        stack->idx--;
    } else if (!strcmp(command, "print")) {
        for (int i = 0; i < stack->idx; i++) {
            printf("[%d] = %d\n", i, stack->vals[i]);
        }
    }
}

void eval_line(stack_t *stack, userdef_t *defs, char *line) {
    bool line_done = false;
    bool word_done = false;
    wordtype_t word_type = WORD_NONE;

    char command[128];
    int command_idx = 0;
    int number = 0;

    int i = 0;
    while (!line_done) {
        char ch = line[i];

        if (!ch) {
            line_done = true;
            word_done = true;
        } else if (ch == ' ' || ch == '\t' || ch == '\r' || ch == '\n') {
            word_done = true;
        } else if (ch >= '0' && ch <= '9') {
            word_type = WORD_INT;
            number = (number * 10) + (ch - '0');
        } else {
            word_type = WORD_COMMAND;
            command[command_idx++] = ch;
        }

        if (word_done) {
            switch (word_type) {
            case WORD_INT:
                stack->vals[stack->idx++] = number;
                break;
            case WORD_COMMAND: {
                command[command_idx++] = 0;

                bool found_cmd = false;
                for (int i = 0; i < COMMAND_CNT; i++) {
                  if (defs[i].used && !strcmp(command, defs[i].name)) {
                    eval_line(stack, defs, defs[i].defn);
                    found_cmd = true;
                  }
                }

                if (!found_cmd) {
                    eval_primitive(stack, command);
                }
                break;
            }
            case WORD_NONE:
                break;
            }

            number = 0;
            command_idx = 0;
            word_type = WORD_NONE;
            word_done = false;
        }

        i++;
    }
}

void compile_line(userdef_t *defs, char *line) {
  int def_target;
  for (def_target = 0; def_target < COMMAND_CNT; def_target++) {
    if (!defs[def_target].used) break;
  }

  defs[def_target].used = true;
  int start_cmd;
  int end_cmd;

  for (int i = 1; i < LINE_SZ; i++) {
    if (line[i] != ' ' && line[i] != '\n' && line[i] != '\r' && line[i] != '\t') {
      start_cmd = i;
      break;
    }
  }

  for (int i = start_cmd; i < LINE_SZ; i++) {
    if (line[i] == ' ' || line[i] == '\n' || line[i] == '\r' || line[i] == '\t') {
      end_cmd = i;
      break;
    } else {
      defs[def_target].name[i - start_cmd] = line[i];
    }
  }

  memcpy(defs[def_target].defn, line + end_cmd, 4096 - end_cmd);
}

int main() {
    char line[LINE_SZ];
    stack_t stack = {0};
    userdef_t userdefs[COMMAND_CNT] = {0};

    while (fgets(line, LINE_SZ, stdin) != NULL) {
        if (line[0] == ':') {
            compile_line(userdefs, line);
        } else {
            eval_line(&stack, userdefs, line);
        }
    }

    return 0;
}

Once compiled, we can use this to define and run our own functions!

$ gcc rpn2.c -o rpn2
$ ./rpn2
: double dup +
5 double print
[0] = 10

Compiling More Efficiently

Right now the compiled representation is fairly inefficient. Every time you invoke a function its code has to be loaded and parsed, which includes wasting time on insignificant characters like whitespace.

Instead of working with the raw command line, we can compile all functions into a series of basic operations ahead of time and store that. What kind of operations do we need?

Pushing an integer constant
Invoking a primitive
Invoking a user-defined function
Ending a function

We can represent each of these instructions using a simple code, consisting of one prefix byte and some trailing data:

bytes

Integer constant:

0        1        2        3        4        5
+--------|--------|--------|--------|--------+
|    0   |   32-bit integer constant         |
+--------|--------|--------|--------|--------+

0        1        2        3        4        5
+--------|--------|--------+
|    1   |   16-bit int    |
+--------|--------|--------+

0        1        2
+--------|--------+
|    2   | 8-bit  |
+--------|--------+

Primitives:

0        1
+--------+
|    3   | +
+--------+
|    4   | -
+--------+
|    5   | *
+--------+
|    6   | /
+--------+
|    7   | %
+--------+
|    8   | swap
+--------+
|    9   | dup
+--------+
|   10   | drop
+--------+
|   11   | print
+--------+


User-defined functions:

0        1        2
+--------|--------+
|   12   |  cmd   |
+--------|--------+

Exit function:

0        1
+--------+
|   13   |
+--------+

This allows us to take something like the double function from before:

: double dup +

Which was executed by evaluating the 5-byte string dup + every time, and convert it to this compact 3-byte representation:

{11, 3, 13}

One thing to note is that we could have defined a simpler encoding for integers which stores all values as 32-bit constants. While this would work, this has the effect of significantly bloating certain definitions like this alternative version of double:

: double 2 *

In this case, the fixed-width 32-bit encoding would produce something like this which is larger than the textual version by 5 bytes:

{0, 0, 0, 0, 2, 6, 13}

Instead, we can do much better if we use the more compact integer encoding. This is half the size of the original encoding and only one byte larger than the textual version:

{2, 2, 6, 13}

Also, it's important to notice that by choosing one byte to use for referencing user-defined functions, we're enshrining a limit of 127 functions at a deeper layer than we have up until this point. Before it was easy to change that limit by adjusting a #define, but now that limit is baked into our function representation.

This is the next version of the RPN calculator utilizing a byte-code compiler:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define LINE_SZ 4096
#define COMMAND_SZ 128
#define COMMAND_CNT 32

typedef enum { WORD_NONE, WORD_INT, WORD_COMMAND } wordtype_t;

typedef enum {
  CODE_I32,
  CODE_I16,
  CODE_I8,
  CODE_ADD,
  CODE_SUB,
  CODE_MUL,
  CODE_DIV,
  CODE_MOD,
  CODE_SWAP,
  CODE_DUP,
  CODE_DROP,
  CODE_PRINT,
  CODE_UDF,
  CODE_EXIT
} cmdtype_t;

typedef struct {
  int vals[LINE_SZ];
  int idx;
} stack_t;

typedef struct {
  bool used;
  char name[COMMAND_SZ];
  char defn[LINE_SZ];
} userdef_t;

void exec_code(stack_t *stack, userdef_t *defs, char *codes) {
  int code_idx = 0;

  while (true) {
    cmdtype_t command = codes[code_idx++];
    switch (command) {
    case CODE_I8: {
      int value = codes[code_idx++];
      stack->vals[stack->idx++] = value;
      break;
    }
    case CODE_I16: {
      int value = 0;
      value |= (codes[code_idx++] << 8);
      value |= codes[code_idx++];
      stack->vals[stack->idx++] = value;
      break;
    }
    case CODE_I32: {
      int value = 0;
      value |= (codes[code_idx++] << 24);
      value |= (codes[code_idx++] << 16);
      value |= (codes[code_idx++] << 8);
      value |= codes[code_idx++];
      stack->vals[stack->idx++] = value;
      break;
    }
    case CODE_ADD: {
      int b = stack->vals[--stack->idx];
      int a = stack->vals[--stack->idx];
      stack->vals[stack->idx++] = a + b;
      break;
    }
    case CODE_SUB: {
      int b = stack->vals[--stack->idx];
      int a = stack->vals[--stack->idx];
      stack->vals[stack->idx++] = a - b;
      break;
    }
    case CODE_MUL: {
      int b = stack->vals[--stack->idx];
      int a = stack->vals[--stack->idx];
      stack->vals[stack->idx++] = a * b;
      break;
    }
    case CODE_DIV: {
      int b = stack->vals[--stack->idx];
      int a = stack->vals[--stack->idx];
      stack->vals[stack->idx++] = a / b;
      break;
    }
    case CODE_MOD: {
      int b = stack->vals[--stack->idx];
      int a = stack->vals[--stack->idx];
      stack->vals[stack->idx++] = a % b;
      break;
    }
    case CODE_SWAP: {
      int b = stack->vals[stack->idx - 1];
      int a = stack->vals[stack->idx - 2];
      stack->vals[stack->idx - 1] = a;
      stack->vals[stack->idx - 2] = b;
      break;
    }
    case CODE_DUP: {
      int a = stack->vals[stack->idx - 1];
      stack->vals[stack->idx++] = a;
      break;
    }
    case CODE_DROP: {
      stack->idx--;
      break;
    }
    case CODE_PRINT: {
      for (int i = 0; i < stack->idx; i++) {
        printf("[%d] = %d\n", i, stack->vals[i]);
      }
      break;
    }
    case CODE_UDF: {
      int value = codes[code_idx++];
      exec_code(stack, defs, defs[value].defn);
      break;
    }
    case CODE_EXIT:
      return;
    }
  }
}

int compile_command(userdef_t *defs, char *command) {
  for (int i = 0; i < COMMAND_CNT; i++) {
    if (defs[i].used && !strcmp(command, defs[i].name)) {
      return i;
    }
  }

  if (!strcmp(command, "+")) {
    return -CODE_ADD;
  } else if (!strcmp(command, "-")) {
    return -CODE_SUB;
  } else if (!strcmp(command, "*")) {
    return -CODE_MUL;
  } else if (!strcmp(command, "/")) {
    return -CODE_DIV;
  } else if (!strcmp(command, "%")) {
    return -CODE_MOD;
  } else if (!strcmp(command, "swap")) {
    return -CODE_SWAP;
  } else if (!strcmp(command, "dup")) {
    return -CODE_DUP;
  } else if (!strcmp(command, "drop")) {
    return -CODE_DROP;
  } else if (!strcmp(command, "print")) {
    return -CODE_PRINT;
  } else {
    return -CODE_EXIT;
  }
}

void compile_line(userdef_t *defs, char *codes, char *line, int cmd_start) {
  bool line_done = false;
  bool word_done = false;
  wordtype_t word_type = WORD_NONE;

  char command[128];
  int command_idx = 0;
  int number = 0;
  int code_idx = 0;

  int i = cmd_start;
  while (!line_done) {
    char ch = line[i];

    if (!ch) {
      line_done = true;
      word_done = true;
    } else if (ch == ' ' || ch == '\t' || ch == '\r' || ch == '\n') {
      word_done = true;
    } else if (ch >= '0' && ch <= '9') {
      word_type = WORD_INT;
      number = (number * 10) + (ch - '0');
    } else {
      word_type = WORD_COMMAND;
      command[command_idx++] = ch;
    }

    if (word_done) {
      switch (word_type) {
      case WORD_INT:
        if (abs(number) <= (1 << 7) - 1) {
          codes[code_idx++] = CODE_I8;
          codes[code_idx++] = number & 0xFF;
        } else if (abs(number) <= (1 << 15) - 1) {
          codes[code_idx++] = CODE_I16;
          codes[code_idx++] = (number >> 8) & 0xFF;
          codes[code_idx++] = number & 0xFF;
        } else {
          codes[code_idx++] = CODE_I32;
          codes[code_idx++] = (number >> 24) & 0xFF;
          codes[code_idx++] = (number >> 16) & 0xFF;
          codes[code_idx++] = (number >> 8) & 0xFF;
          codes[code_idx++] = number & 0xFF;
        }
        break;
      case WORD_COMMAND: {
        command[command_idx++] = 0;

        int command_code = compile_command(defs, command);
        if (command_code >= 0) {
          codes[code_idx++] = CODE_UDF;
          codes[code_idx++] = (char)command_code;
        } else {
          codes[code_idx++] = (char)-command_code;
        }
        break;
      }
      case WORD_NONE:
        break;
      }

      number = 0;
      command_idx = 0;
      word_type = WORD_NONE;
      word_done = false;
    }

    i++;
  }

  codes[code_idx++] = CODE_EXIT;
}

int main() {
  char line[LINE_SZ];
  char def[COMMAND_SZ];
  stack_t stack = {0};
  userdef_t userdefs[COMMAND_CNT] = {0};
  char line_defn[LINE_SZ];

  while (fgets(line, LINE_SZ, stdin) != NULL) {
    if (line[0] == ':') {
      int def_target;
      for (def_target = 0; def_target < COMMAND_CNT; def_target++) {
        if (!userdefs[def_target].used)
          break;
      }

      userdefs[def_target].used = true;
      int start_cmd;
      int end_cmd;

      for (int i = 1; i < LINE_SZ; i++) {
        if (line[i] != ' ' && line[i] != '\n' && line[i] != '\r' &&
            line[i] != '\t') {
          start_cmd = i;
          break;
        }
      }

      for (int i = start_cmd; i < LINE_SZ; i++) {
        if (line[i] == ' ' || line[i] == '\n' || line[i] == '\r' ||
            line[i] == '\t') {
          end_cmd = i;
          break;
        } else {
          userdefs[def_target].name[i - start_cmd] = line[i];
        }
      }

      compile_line(userdefs, userdefs[def_target].defn, line, end_cmd);
    } else {
      compile_line(userdefs, line_defn, line, 0);
      exec_code(&stack, userdefs, line_defn);
    }
  }

  return 0;
}

Raw

rpnjit-2.md

In the previous post we walked through constructing a basic programmable RPN calculator, from the most basic form with no user definitions at all, to a more advanced interpreter which runs user code after it's been compiled to byte-code.

We can use the byte-code calculator as a skeleton for our JIT calculator, but getting to that point requires a lot of build-up to discuss issues of instruction encoding and the subtleties of integrating our C driver with code that we're going to produce and execute at runtime.

x64 Instruction Layouts

There are two big options we can take in determining how user-defined functions and primitives work together. One is similar to the method we've been using up to this point, where user-defined code is implemented in our target language (machine code in this case) and primitives are implemented in C:

Driver (parser/compiler)       C
+-- User-defined code          x64
    +-- Primitive operations   C

The problem with this approach is that C has a very specific calling convention which we would rather avoid embedding into our compiled output. Since the C compiler is allowed to clobber several registers we may want to use and has a specific expected stack layout, our machine code would need to interface with that calling convention. That bloats the target code and makes it harder to write our JIT compiler.

The alternative would be to inline the primitive operations into the code we generate for user functions, so that there's no separation and no need to adhere to the C calling convention:

Driver (parser/compiler)                    C
+-- User-defined code + Primitives          x64

This leaves us with the burden of understanding the encoding of more instructions than we would need to know for the first approach, but it's a worthy trade-off to keep our compiler simple.

The instructions that we're going to need here are:

32-bit addition, subtraction, multiplication, division and modulo
Loading constant values
Pushing and popping values from the data stack

One quirk of a lot of CPU architectures is that division and modulo are fused into one operation that calculates both. That means that we really just need to know how to do division, and how to pull out the division result or the remainder result as need be.

In addition to knowing the instructions involved, we'll also have to pick out registers so that we know what the operations will be using. Since we'll be going to this level of detail it will make sense to write out each primitive operation as a series of x64 assembly instructions and then hand-compile them to figure out what should go in our program.

The first simple operation is addition which takes two values off of the stack, adds them, and then pushes the result back onto the stack. At this point we've left the idea of "the stack" implicit, but we'll have to decide now where the stack is actually going to live. We'll also have to decide where to keep the current stack offset pointer.

%rdi and %rsi are good choices for the stack top and offset because they will be easy to fill directly from the driver-side C code. The C calling convention on Linux specifies that the first argument goes into %rdi and the second into %rsi, so it will be easy to call a UDF from C: ((void(*)(int*)) udf_code)(stack_top, stack_idx)

Keeping that in mind, we'll write addition this way:

decl %esi
movl (%rdi, %rsi, 4), %eax
decl %esi
movl (%rdi, %rsi, 4), %edx
addl %eax, %edx
movl %edx, (%rdi, %rsi, 4)
incl %esi

In x64 instructions that operate on 64-bit operands require a special prefix called the RAX prefix. To avoid having to write that down, we use 32-bit increments and decrements when managing the stack offset. This is safe to do partially because the stack offset is not very large, and partially because x64 will clear the upper 32-bits of a 64-bit register when you write to the bottom 32-bits. In this case, decrementing %esi will clear the upper 32-bits of %rsi

Now we have to compile this into machine code. This will require some knowledge of not only how instructions like dec and mov are encoded, but also about the scaled addressing scheme we're using here. The main reference for all these encodings is the IA-64 Architecture Manual which you can get from Intel. I won't go into the details of these encodings, but suffice to say they were by far the most difficult part to decipher!

FF CE    // DEC ModRM(%esi, opcode: 1)
8B 04 B7 // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
FF CE    // DEC ModRM(%esi, opcode: 1)
8B 14 B7 // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
01 C2    // ADD ModRM(%edx, %eax)
89 14 B7 // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
FF C6    // INC ModRM(%esi, opcode: 0)

Subtraction is very similar to addition:

FF CE    // DEC ModRM(%esi, opcode: 1)
8B 04 B7 // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
FF CE    // DEC ModRM(%esi, opcode: 1)
8B 14 B7 // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
29 C2    // SUB ModRM(%edx, %eax)
89 14 B7 // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
FF C6    // INC ModRM(%esi, opcode: 0)

Multiplication:

FF CE    // DEC ModRM(%esi, opcode: 1)
8B 04 B7 // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
FF CE    // DEC ModRM(%esi, opcode: 1)
8B 14 B7 // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
0F AF D0 // IMUL ModRM(%edx, %eax)
89 14 B7 // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
FF C6    // INC ModRM(%esi, opcode: 0)

Signed division is special because there are no two argument forms like the other operations. There's only one argument which serves as the second half of the division, with the first half and the result/remainder going into fixed registers.

That means that we're going to have to adjust our assembly template slightly to ensure that edx is clear:

decl %esi
movl (%rdi, %rsi, 4), %ecx
decl %esi
movl (%rdi, %rsi, 4), %eax
xorl %edx, %edx
idiv %ecx
movl %eax, (%rdi, %rsi, 4)
incl %esi

Which we can encode as:

FF CE    // DEC ModRM(%esi, opcode: 1)
8B 0C B7 // MOV ModRM(%ecx) SIB(%rdi + 4*%rsi)
FF CE    // DEC ModRM(%esi, opcode: 1)
8B 04 B7 // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
31 D2    // XOR ModRM(%edx, %edx)
F7 F9    // IDIV ModRM(%ecx, opcode: 7)
89 04 B7 // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
FF C6    // INC ModRM(%esi, opcode: 0)

Modulo is almost the same as division, except that we take the opposite result register:

FF CE    // DEC ModRM(%esi, opcode: 1)
8B 0C B7 // MOV ModRM(%ecx) SIB(%rdi + 4*%rsi)
FF CE    // DEC ModRM(%esi, opcode: 1)
8B 04 B7 // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
31 D2    // XOR ModRM(%edx, %edx)
F7 F9    // IDIV ModRM(%ecx, opcode: 7)
89 14 B7 // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
FF C6    // INC ModRM(%esi, opcode: 0)

The stack operations are all very simple, and in fact we've seen most of them at this point. First, drop:

FF CE      // DEC ModRM(%esi, opcode: 1)

Then dup:

FF CE    // DEC ModRM(%esi, opcode: 1)
8B 14 B7 // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
FF C6    // INC ModRM(%esi, opcode: 0)
89 14 B7 // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
FF C6    // INC ModRM(%esi, opcode: 0)

And finally swap, which we'll implement like this:

8B 54 B7 FC // MOV ModRM(%edx) SIB(%rdi + 4*%rsi) - 4
8B 44 B7 F8 // MOV ModRM(%eax) SIB(%rdi + 4*%rsi) - 8
89 44 B7 FC // MOV ModRM(%eax) SIB(%rdi + 4*%rsi) - 4
89 54 B7 F8 // MOV ModRM(%edx) SIB(%rdi + 4*%rsi) - 8

We also have to implement pushing 32-bit integer constants, which we'll do like this:

B8 XX XX XX XX // MOV(%eax) IMM32(x)
89 04 B7 // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
FF C6    // INC ModRM(%esi, opcode: 0)

Calling Conventions

With all of the "direct" primitives down, we can focus on the two remaining hard cases: calling user-defined functions and implementing print.

Since we're implementing our own function calls, we'll have to decide on our own calling convention. We could borrow the one from C, but there's not much point since we don't have things like local variables or efficient register allocation which C's calling convention is explicitly designed to support. Instead, we can avoid most of that work by setting a couple of ground rules:

%rdi always contains the address of the top of the stack (the least recently pushed value)
%rsi always contains the current stack offset in terms of integer cells

Encoding UDF

If we keep to these rules by not using %rdi or %rsi except for these purposes then the calling convention is easy. Just use the native call and ret instructions to manage the return address and don't clobber %rdi or %rsi in functions. We satisfied the second of those rules in the above encodings so we just have to implement the first here:

movabs %rax, <udf-jit-address>
call *%rax

Which assembles into:

48 B8 XX XX XX XX XX XX XX XX // REX.W MOVABS(%rax) <udf-jit-address-be>
FF D0                         // CALL ModRM(%rax, opcode: 2)

On the other end, returning from a function is just the ret instruction, which is encoded as:

C3 // RET

Encoding Print

We're in a different situation with print because we have to deal with the C calling convention, regardless of how we handle it. To avoid bloating the generated code with the assembly version of the printing loop, we'll write that bit in C and pass the address of the function down.

That means that we'll need another reserved register. %rbp is a good choice for this purpose since it's usually used for implementing local variables, which our language doesn't support. Also, the C calling convention says that called functions must preserve it which means that there's one less register we have to take care of.

Since %rbp is taken care of by the callee, all we have to do is save %rdi and %rsi. We don't even have to assign them different values. Because the print function will have the prototype void print(int *stack, int offset), the stack will go in %rdi (where it already is) and the offset will go in %rsi (where it also already is).

The only other concern we have to worry about here is stack alignment. When we call print the stack has to be aligned to 16 bytes, which won't automatically be the case if we just pushed %rdi and %rsi with no padding:

0  ------------ previous function alignment boundary
8  return addr  (pushed implicitly by call)
16 saved %rdi
24 saved %rsi

We'll need an extra 8 bytes to get the right alignment:

0  ------------ previous function alignment boundary
8  return addr  (pushed implicitly by call)
16 saved %rdi
24 saved %rsi
32 padding

With all that in mind, this is what the complete call to print looks like:

pushq %rdi
pushq %rsi
subq %rsp, 8 
call *%rbp
addq %rsp, 8
popq %rsi
popq %rdi

After assembling, it looks like this:

57                            // PUSH(%rdi)
56                            // PUSH(%rsi)
48 83 EC 08                   // REX.W SUB ModRM(%rsp, opcode: 5)
FF D5                         // CALL ModRM(%rax, opcode: 2)
48 83 C4 08                   // REX.W ADD ModRM(%rsp, opcode: 0)
5E                            // POP(%rsi)
5F                            // POP(%rdi)

Gluing Our Calling Convention With The C Calling Convention

Because we made the choice to use %rbp for storing the address of print, we'll have to do a few things to make sure that C code can call our JIT output safely:

We'll have to preserve the %rbp from the outer code, since the C calling convention requires it and we'll definitely be overwriting it
We'll have to take in the address of print via the third standard argument register, %rdx, and move it into %rbp

Since this stub is going to be the same everywhere, it makes sense to make it generic. That means that we'll actually be taking 4 arguments instead of 3:

The stack base address (%rdi)
The current stack offset (%rsi)
The address of the print function (%rdx)
The address of the JIT code to chain to (%rcx)

We'll also have to get the current stack offset back out of this function, which just requires copying the stack offset into %eax and using it as our value.

We can use this as our stub:

pushq %rbp
movq %rdx, %rbp
call *%rcx
popq %rbp
movq %rsi, %rax
ret

Which turns into this after assembling:

55       // PUSH(%rbp)
48 89 D5 // REX.W MOV ModRM(%rbp, %rdx)
FF D1    // CALL ModRM(%rcx, opcode: 2)
5D       // POP(%rbp)
48 89 F0 // REX.W MOV ModRM(%rax, %rsi)
C3       // RET

Missing Padding?

After the discussion of padding in the previous sections, one thing stands out about the UDF calling convention we established earlier. It doesn't ensure that the padding of the stack is preserved across calls.

Consider what happens if we have a definition like this:

: myprint print
: dbgsquare dup * myprint

If we invoke it then the call stack will start out 16-byte aligned at the end of our JIT transfer stub:

0   ---------- C code alignment boundary
8   return addr
16  saved %rbp

Then we'll get into dbqsquare and get out of alignment. Everything's going to plan so far: if we called print directly within dbqsquare, the padding we add would align us properly again.

0   ---------- C code alignment boundary
8   return addr
16  saved %rbp
24  stub addr

Instead we call myprint and suddenly we're back in alignment again! At this point it's clear what the problem is: if we want to ensure that we have a specific alignment at one point, we have to preserve that alignment all the way to that point. Otherwise different code paths could put the stack in and out of alignment at times we don't expect.

0   ---------- C code alignment boundary
8   return addr
16  saved %rbp
24  stub addr
32  dbgsquare addr

The easy way to fix this is to advance the stack pointer at the start of every user-defined function and fix it before we return. This means that our function prologue (which we didn't previously have before!) looks like this:

48 83 EC 08 // REX.W SUB ModRM(%rsp, opcode: 5)

Our return code looks like this:

48 83 C4 08 // REX.W ADD ModRM(%rsp, opcode: 0)
C3          // RET

This simplifies the code we use to call other functions though. Now that the stack is properly aligned by the function prologue, each individual call doesn't have to worry about setting up the alignment on its own:

57    // PUSH(%rdi)
56    // PUSH(%rsi)
FF D5 // CALL ModRM(%rbp, opcode: 2)
5E    // POP(%rsi)
5F    // POP(%rdi)

Let's run through the stack mapping exercise one more time to make sure that this change does actually fix things:

0   ---------- C code alignment boundary
8   return addr
16  saved %rbp
24  stub addr
32  padding
40  dbgsquare addr
48  padding

Success!

Invoking JIT Code At Runtime

At this point the last piece of our puzzle is how to figure out how to store this code in such a way that we can execute it. The obvious solution would be to use stack memory:

#include <stdio.h>

void write_machinecode(char *buffer) {
    // MOVL %eax, 42 | MOV(%eax) IMM32(42)
    //   B8 2A 00 00 00
    // RET
    //   C3
    buffer[0] = 0xb8;
    buffer[1] = 0x2a;
    buffer[2] = 0x00;
    buffer[3] = 0x00;
    buffer[4] = 0x00;
    buffer[5] = 0xc3;
}

int main() {
    char code[6];
    write_machinecode(code);

    int (*fptr)() = (int (*)()) code;
    int x = (*fptr)();
    printf("code = %d\n", x);
}

The problem is that this doesn't work if you try it:

$ gcc rpn4.c -o rpn4
$ ./rpn4
segmentation fault

The reason for this failure is due to a policy known as W^X. W^X is meant to avoid attacks where somebody can exploit a buffer overflow by writing some machine code (or shellcode as the security community calls it) to the stack and then run that code. The reason this interferes here is that we're taking that exact strategy to run our own code, minus the part about it being an exploit.

The usual way to avoid this is to use something called mprotect, which lets us set permissions on a chunk of memory that we own. We can't do it with malloc because the permissions are on the level of a memory page, and malloc won't necessarily hand us back our own page. Instead we'll have to use mmap:

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>

void write_machinecode(char *buffer) {
    // MOVL %eax, 42 | MOV(%eax) IMM32(42)
    //   B8 2A 00 00 00
    // RET
    //   C3
    buffer[0] = 0xb8;
    buffer[1] = 0x2a;
    buffer[2] = 0x00;
    buffer[3] = 0x00;
    buffer[4] = 0x00;
    buffer[5] = 0xc3;
}

int main() {
    char *code = mmap(NULL, 6, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    write_machinecode(code);

    int (*fptr)() = (int (*)()) code;
    int x = (*fptr)();
    printf("code = %d\n", x);
}

Note that we map the memory with full permissions upfront. This is generally considered bad practice since that defeats the point of W^X (if someone can get a reference to the page, they can run arbitrary code) but we're going to ignore that concern for brevity.

Bringing It All Together

Now that we have all the pieces in place, we can bring them together into the final result: a programmable RPN calculator backed by a JIT compiler:

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>

#define LINE_SZ 4096
#define COMMAND_SZ 128
#define COMMAND_CNT 32

typedef enum { WORD_NONE, WORD_INT, WORD_COMMAND } wordtype_t;

typedef struct {
  bool used;
  char name[COMMAND_SZ];
  unsigned char *defn;
} userdef_t;

void print(int *stack_top, int stack_offset) {
  for (int i = 0; i < stack_offset; i++) {
    printf("[%d] = %d\n", i, stack_top[i]);
  }
}

int compile_command(userdef_t *defs, char *command, unsigned char *code_base,
                    int code_idx) {
  for (int i = 0; i < COMMAND_CNT; i++) {
    if (defs[i].used && !strcmp(command, defs[i].name)) {
        // REX.W MOVABS(%rax) <udf-address>
        code_base[code_idx++] = 0x48;
        code_base[code_idx++] = 0XB8;

        *((long *)(code_base + code_idx)) = (long)defs[i].defn;
        code_idx += 8;

        // CALL *%rax
        code_base[code_idx++] = 0XFF;
        code_base[code_idx++] = 0XD0;
        return code_idx;
    }
  }

  if (!strcmp(command, "+")) {
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x04;
    code_base[code_idx++] = 0xB7;
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x14;
    code_base[code_idx++] = 0xB7;
    // ADD ModRM(%edx, %eax)
    code_base[code_idx++] = 0x01;
    code_base[code_idx++] = 0xC2;
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x89;
    code_base[code_idx++] = 0x14;
    code_base[code_idx++] = 0xB7;
    // INC ModRM(%esi, opcode: 0)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xC6;
    return code_idx;
  } else if (!strcmp(command, "-")) {
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x04;
    code_base[code_idx++] = 0xB7;
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x14;
    code_base[code_idx++] = 0xB7;
    // SUB ModRM(%edx, %eax)
    code_base[code_idx++] = 0x29;
    code_base[code_idx++] = 0xC2;
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x89;
    code_base[code_idx++] = 0x14;
    code_base[code_idx++] = 0xB7;
    // INC ModRM(%esi, opcode: 0)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xC6;
    return code_idx;
  } else if (!strcmp(command, "*")) {
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x04;
    code_base[code_idx++] = 0xB7;
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x14;
    code_base[code_idx++] = 0xB7;
    // IMUL ModRM(%edx, %eax)
    code_base[code_idx++] = 0x0F;
    code_base[code_idx++] = 0xAF;
    code_base[code_idx++] = 0xD0;
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x89;
    code_base[code_idx++] = 0x14;
    code_base[code_idx++] = 0xB7;
    // INC ModRM(%esi, opcode: 0)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xC6;
    return code_idx;
  } else if (!strcmp(command, "/")) {
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%ecx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x0C;
    code_base[code_idx++] = 0xB7;
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x04;
    code_base[code_idx++] = 0xB7;
    // XOR ModRM(%edx, %edx)
    code_base[code_idx++] = 0x31;
    code_base[code_idx++] = 0xD2;
    // IDIV ModRM(%ecx, opcode: 7)
    code_base[code_idx++] = 0xF7;
    code_base[code_idx++] = 0xF9;
    // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x89;
    code_base[code_idx++] = 0x04;
    code_base[code_idx++] = 0xB7;
    // INC ModRM(%esi, opcode: 0)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xC6;
    return code_idx;
  } else if (!strcmp(command, "%")) {
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%ecx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x0C;
    code_base[code_idx++] = 0xB7;
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x04;
    code_base[code_idx++] = 0xB7;
    // XOR ModRM(%edx, %edx)
    code_base[code_idx++] = 0x31;
    code_base[code_idx++] = 0xD2;
    // IDIV ModRM(%ecx, opcode: 7)
    code_base[code_idx++] = 0xF7;
    code_base[code_idx++] = 0xF9;
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x89;
    code_base[code_idx++] = 0x14;
    code_base[code_idx++] = 0xB7;
    // INC ModRM(%esi, opcode: 0)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xC6;
    return code_idx;
  } else if (!strcmp(command, "swap")) {
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi) - 4
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x54;
    code_base[code_idx++] = 0xB7;
    code_base[code_idx++] = 0xFC;
    // MOV ModRM(%eax) SIB(%rdi + 4*%rsi) - 8
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x44;
    code_base[code_idx++] = 0xB7;
    code_base[code_idx++] = 0xF8;
    // MOV ModRM(%eax) SIB(%rdi + 4*%rsi) - 4
    code_base[code_idx++] = 0x89;
    code_base[code_idx++] = 0x44;
    code_base[code_idx++] = 0xB7;
    code_base[code_idx++] = 0xFC;
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi) - 8
    code_base[code_idx++] = 0x89;
    code_base[code_idx++] = 0x54;
    code_base[code_idx++] = 0xB7;
    code_base[code_idx++] = 0xF8;
    return code_idx;
  } else if (!strcmp(command, "dup")) {
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x8B;
    code_base[code_idx++] = 0x14;
    code_base[code_idx++] = 0xB7;
    // INC ModRM(%esi, opcode: 0)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xC6;
    // MOV ModRM(%edx) SIB(%rdi + 4*%rsi)
    code_base[code_idx++] = 0x89;
    code_base[code_idx++] = 0x14;
    code_base[code_idx++] = 0xB7;
    // INC ModRM(%esi, opcode: 0)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xC6;
    return code_idx;
  } else if (!strcmp(command, "drop")) {
    // DEC ModRM(%esi, opcode: 1)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xCE;
    return code_idx;
  } else if (!strcmp(command, "print")) {
    // PUSH(%rdi)
    code_base[code_idx++] = 0x57;
    // PUSH(%rsi)
    code_base[code_idx++] = 0x56;
    // CALL ModRM(%rbp, opcode: 2)
    code_base[code_idx++] = 0xFF;
    code_base[code_idx++] = 0xD5;
    // POP(%rsi)
    code_base[code_idx++] = 0x5E;
    // POP(%rdi)
    code_base[code_idx++] = 0x5F;
    return code_idx;
  } else {
    // REX.W ADD ModRM(%rsp, opcode: 0)
    code_base[code_idx++] = 0x48;
    code_base[code_idx++] = 0x83;
    code_base[code_idx++] = 0xC4;
    code_base[code_idx++] = 0x08;
    // RET
    code_base[code_idx++] = 0xC3;
    return code_idx;
  }
}

void compile_line(userdef_t *defs, unsigned char *code_base, char *line, int cmd_start) {
  bool line_done = false;
  bool word_done = false;
  wordtype_t word_type = WORD_NONE;

  char command[128];
  int command_idx = 0;
  int number = 0;
  int code_idx = 0;

  // REX.W SUB ModRM(%rsp, opcode: 5)
  code_base[code_idx++] = 0x48;
  code_base[code_idx++] = 0x83;
  code_base[code_idx++] = 0xEC;
  code_base[code_idx++] = 0x08;

  int i = cmd_start;
  while (!line_done) {
    char ch = line[i];

    if (!ch) {
      line_done = true;
      word_done = true;
    } else if (ch == ' ' || ch == '\t' || ch == '\r' || ch == '\n') {
      word_done = true;
    } else if (ch >= '0' && ch <= '9') {
      word_type = WORD_INT;
      number = (number * 10) + (ch - '0');
    } else {
      word_type = WORD_COMMAND;
      command[command_idx++] = ch;
    }

    if (word_done) {
      switch (word_type) {
      case WORD_INT:
        // MOV(%eax) IMM32(x)
        code_base[code_idx++] = 0xB8;
        *((int*) (code_base + code_idx)) = number;
        code_idx += 4;
        // MOV ModRM(%eax) SIB(%rdi + 4*%rsi)
        code_base[code_idx++] = 0x89;
        code_base[code_idx++] = 0x04;
        code_base[code_idx++] = 0xB7;
        // INC ModRM(%esi, opcode: 0)
        code_base[code_idx++] = 0xFF;
        code_base[code_idx++] = 0xC6;
        break;
      case WORD_COMMAND: {
        command[command_idx++] = 0;
        code_idx = compile_command(defs, command, code_base, code_idx);
        break;
      }
      case WORD_NONE:
        break;
      }

      number = 0;
      command_idx = 0;
      word_type = WORD_NONE;
      word_done = false;
    }

    i++;
  }

  // REX.W ADD ModRM(%rsp, opcode: 0)
  code_base[code_idx++] = 0x48;
  code_base[code_idx++] = 0x83;
  code_base[code_idx++] = 0xC4;
  code_base[code_idx++] = 0x08;
  // RET
  code_base[code_idx++] = 0xC3;
}

int main() {
  char line[LINE_SZ];
  char def[COMMAND_SZ];
  int stack[4096];
  int stack_offset = 0;

  unsigned char* glue_page = mmap(NULL, 11, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  int glue_offset = 0;
  // PUSH(%rbp)
  glue_page[glue_offset++] = 0x55;
  // REX.W MOV ModRM(%rbp, %rdx)
  glue_page[glue_offset++] = 0x48;
  glue_page[glue_offset++] = 0x89;
  glue_page[glue_offset++] = 0xD5;
  // CALL ModRM(%rcx, opcode: 2)
  glue_page[glue_offset++] = 0xFF;
  glue_page[glue_offset++] = 0xD1;
  // POP(%rbp)
  glue_page[glue_offset++] = 0x5D;
  // REX.W MOV ModRM(%rax, %rsi)
  glue_page[glue_offset++] = 0x48;
  glue_page[glue_offset++] = 0x89;
  glue_page[glue_offset++] = 0xF0;
  // RET
  glue_page[glue_offset++] = 0xC3;
  int (*glue_func)(int*, int, void(*)(int*, int), unsigned char*) = (void*) glue_page;

  unsigned char *interp_page = mmap(NULL, LINE_SZ, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

  userdef_t userdefs[COMMAND_CNT] = {0};
  char line_defn[LINE_SZ];

  while (fgets(line, LINE_SZ, stdin) != NULL) {
    if (line[0] == ':') {
      int def_target;
      for (def_target = 0; def_target < COMMAND_CNT; def_target++) {
        if (!userdefs[def_target].used)
          break;
      }

      userdefs[def_target].used = true;
      userdefs[def_target].defn = mmap(NULL, LINE_SZ, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      int start_cmd;
      int end_cmd;

      for (int i = 1; i < LINE_SZ; i++) {
        if (line[i] != ' ' && line[i] != '\n' && line[i] != '\r' &&
            line[i] != '\t') {
          start_cmd = i;
          break;
        }
      }

      for (int i = start_cmd; i < LINE_SZ; i++) {
        if (line[i] == ' ' || line[i] == '\n' || line[i] == '\r' ||
            line[i] == '\t') {
          end_cmd = i;
          break;
        } else {
          userdefs[def_target].name[i - start_cmd] = line[i];
        }
      }

      compile_line(userdefs, userdefs[def_target].defn, line, end_cmd);
    } else {
      compile_line(userdefs, interp_page, line, 0);
      stack_offset = (*glue_func)(stack, stack_offset, print, interp_page);
    }
  }

  return 0;
}

It doesn't add any extra capabilities, but knowing how it works internally it's cool just to watch it function:

$ gcc -g rpn5.c -o rpn5
$ ./rpn5
: square dup *
6 square
print
[0] = 36

It's even more interesting single-stepping through the generated assembly in GDB; it's entirely clear that we've left C land by the fact that GDB can't even decipher our calling convention:

$ gdb ./rpn5
(gdb) b 392
(gdb) r
: square dup *
6 square print

Breakpoint 1, main () at rpn5.c:392
392           stack_offset = (*glue_func)(stack, stack_offset, print, interp_page);
(gdb) layout asm
(gdb) si # Repeat this until you're past the callq instruction
(gdb) bt
#0  0x00007ffff7fcf000 in ?? ()
#1  0x0000555555556622 in main () at rpn5.c:392
(gdb) si # Continue until you're past the next callq instruction
(gdb) bt
0x00007ffff7fcf001 in ?? ()
0x00007ffff7fcf004 in ?? ()
0x00007ffff7fce000 in ?? ()
(gdb) q

Performance

Looking back, we've certainly ended up with a more complex program here than we started with. The justification for this increase in complexity has been performance, that at each level by trading code simplicity for more efficient repesentations we could run user programs more efficiently.

The question is, has that actually happened? Or have we been deluding ourselves?

First we need a good test case. I'm going to be taking another page out of the security playbook here and make a highly recursive test based upon the Billion Laughs Attack. The core idea of this test case is that by combining several layers of recursion we can easily run the same test millions of times without having to write all of those invocations:

$ cat stress.txt
: stress 1 1 + drop
: stressA stress stress stress stress stress stress stress stress stress stress
: stressB stressA stressA stressA stressA stressA stressA stressA stressA stressA stressA
: stressC stressB stressB stressB stressB stressB stressB stressB stressB stressB stressB
: stressD stressC stressC stressC stressC stressC stressC stressC stressC stressC stressC
: stressE stressD stressD stressD stressD stressD stressD stressD stressD stressD stressD
: stressF stressE stressE stressE stressE stressE stressE stressE stressE stressE stressE
: stressG stressF stressF stressF stressF stressF stressF stressF stressF stressF stressF
: stressH stressG stressG stressG stressG stressG stressG stressG stressG stressG stressG
stressH

First, let's run it against the naive interpreter. We'll recompile it with full optimizations to give it a fighting chance:

$ gcc -O3 rpn2.c -o rpn2
$ time rpn2 < stress.txt
./rpn2 < stress.txt  23.07s user 0.00s system 99% cpu 23.101 total

Ouch, this test pegs one of the cores of my i5-6500 for over 23 seconds! This leaves a lot of room for improvement with the bytecode interpreter:

$ gcc -O3 rpn3.c -o rpn3
$ time rpn3 < stress.txt
./rpn3 < stress.txt  1.28s user 0.00s system 99% cpu 1.282 total

This is already a substantial improvment from the naive interpreter. It turns out that parsing and reparsing all those command lines really has a cost!

Finally, the JIT compiler:

$ gcc -O3 rpn5.c -o rpn5
$ time rpn5 < stress.txt
./rpn5 < stress.txt  0.19s user 0.00s system 99% cpu 0.192 total

I had to double check this to make sure that it was actually running the code I had provided. A fraction of a second seems way too fast for doing this operation 100 million times. However, after a trip through GDB this time does indeed check out - it performs the sum and drops the result each time.

Just to give an overview of this speed difference, we can line up the numbers together to see how dramatic the result is:

rpn2: 23.101s / 1.000x
rpn3: 1.282s / 0.055x
rpn5: 0.192s / 0.008x