We're going to start with a basic RPN interpreter and add function definitions to it. That will serve as a good starting point for adding on JIT compilation later. There are only going to be a few primitives to start with:
- Integer constants
- Arithmetic operators:
+
,-
,*
,/
,%
- Stack manipulation:
swap
,dup
,drop
- Display:
print
These are simple enough to implement if we don't mind ignoring a lot of error handling:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LINE_SZ 4096
typedef enum {
WORD_NONE,
WORD_INT,
WORD_COMMAND
} wordtype_t;
typedef struct {
int vals[LINE_SZ];
int idx;
} stack_t;
void eval_command(stack_t *stack, char *command) {
if (!strcmp(command, "+")) {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a + b;
} else if (!strcmp(command, "-")) {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a - b;
} else if (!strcmp(command, "*")) {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a * b;
} else if (!strcmp(command, "/")) {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a / b;
} else if (!strcmp(command, "%")) {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a % b;
} else if (!strcmp(command, "swap")) {
int b = stack->vals[stack->idx - 1];
int a = stack->vals[stack->idx - 2];
stack->vals[stack->idx - 1] = a;
stack->vals[stack->idx - 2] = b;
} else if (!strcmp(command, "dup")) {
int a = stack->vals[stack->idx - 1];
stack->vals[stack->idx++] = a;
} else if (!strcmp(command, "drop")) {
stack->idx--;
} else if (!strcmp(command, "print")) {
for (int i = 0; i < stack->idx; i++) {
printf("[%d] = %d\n", i, stack->vals[i]);
}
}
}
int main() {
char line[LINE_SZ];
char command[128];
int command_idx = 0;
int number = 0;
stack_t stack;
stack.idx = 0;
while (fgets(line, LINE_SZ, stdin) != NULL) {
bool line_done = false;
bool word_done = false;
wordtype_t word_type = WORD_NONE;
int i = 0;
while (!line_done) {
char ch = line[i];
if (!ch) {
line_done = true;
word_done = true;
} else if (ch == ' ' || ch == '\t' || ch == '\r' || ch == '\n') {
word_done = true;
} else if (ch >= '0' && ch <= '9') {
word_type = WORD_INT;
number = (number * 10) + (ch - '0');
} else {
word_type = WORD_COMMAND;
command[command_idx++] = ch;
}
if (word_done) {
switch (word_type) {
case WORD_INT:
stack.vals[stack.idx++] = number;
break;
case WORD_COMMAND:
command[command_idx++] = 0;
eval_command(&stack, command);
break;
case WORD_NONE:
break;
}
number = 0;
command_idx = 0;
word_type = WORD_NONE;
word_done = false;
}
i++;
}
}
return 0;
}
After compiling this, we have a basic calculator that's ready to use. For example, we can compute the factorial of 5 by hand:
$ gcc rpn1.c -o rpn1
$ ./rpn1
1 2 3 4 5
print
[0] = 1
[1] = 2
[2] = 3
[3] = 4
[4] = 5
* * * *
print
[0] = 120
There are already a few things here which will have to change in order to allow users to define their own commands:
-
Right now all commands are executed as they are entered. In order to define commands, we'll have to support not only the immediate mode that we support now, but also a compile mode where we buffer commands within a function definition instead of running them.
-
Right now commands are only run within the fixed
eval_command
function, which statically dispatches on command names into corresponding parts of code. We'll have to have keep something like this since the primitives will still be there, but there will also have to be a place to store function definitions. -
Right now all commands are implemented implicitly using code defined within the program itself. The work of implementing these commands on a lower level is done by the C compiler; for example, when we implement
+
we're relying on the compiler to generate the right code and run it within the branch that tests that the command really is+
.User-defined commands can't be run this way because they won't be implemented in C (even if the primitives they depend upon will be). We'll have to add an interpreter to execute those user-defined commands, and how it works will depend upon how user-defined functions are stored.
To address each of these concerns, we're going to use a simple representation that lets us reuse a lot of the program as it exists now:
-
The immediate mode syntax will be nothing special, just the same commands that we usually run without any extra adornment. We're going to activate compile mode using a syntax similar to Forth, where
:
starts a definition and it goes to the end of the line. For example,: incr 1 +
defines a wordincr
which can be used to increment the top value on the stack. -
We can add a simple list that stores both command names and the code associated with them (what that code actually looks like will be covered in the next point). To find a definition, we just have to iterate this list and check each command name. There are more efficient ways to do this but we're shooting for brevity here.
-
We can reuse most of the existing command evaluator if we store user-defined commands as text. We'll still have to rework the dispatching mechanism as described in the last point, but the parsing will mostly stay the same.
With all these changes applied, we get the second iteration of the RPN calculator. Again, this ignores lots of error handling as well as not covering cases which might be useful, like redefining commands within the same session.
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LINE_SZ 4096
#define COMMAND_SZ 128
#define COMMAND_CNT 32
typedef enum {
WORD_NONE,
WORD_INT,
WORD_COMMAND
} wordtype_t;
typedef struct {
int vals[LINE_SZ];
int idx;
} stack_t;
typedef struct {
bool used;
char name[COMMAND_SZ];
char defn[LINE_SZ];
} userdef_t;
void eval_primitive(stack_t *stack, char *command) {
if (!strcmp(command, "+")) {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a + b;
} else if (!strcmp(command, "-")) {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a - b;
} else if (!strcmp(command, "*")) {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a * b;
} else if (!strcmp(command, "/")) {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a / b;
} else if (!strcmp(command, "%")) {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a % b;
} else if (!strcmp(command, "swap")) {
int b = stack->vals[stack->idx - 1];
int a = stack->vals[stack->idx - 2];
stack->vals[stack->idx - 1] = a;
stack->vals[stack->idx - 2] = b;
} else if (!strcmp(command, "dup")) {
int a = stack->vals[stack->idx - 1];
stack->vals[stack->idx++] = a;
} else if (!strcmp(command, "drop")) {
stack->idx--;
} else if (!strcmp(command, "print")) {
for (int i = 0; i < stack->idx; i++) {
printf("[%d] = %d\n", i, stack->vals[i]);
}
}
}
void eval_line(stack_t *stack, userdef_t *defs, char *line) {
bool line_done = false;
bool word_done = false;
wordtype_t word_type = WORD_NONE;
char command[128];
int command_idx = 0;
int number = 0;
int i = 0;
while (!line_done) {
char ch = line[i];
if (!ch) {
line_done = true;
word_done = true;
} else if (ch == ' ' || ch == '\t' || ch == '\r' || ch == '\n') {
word_done = true;
} else if (ch >= '0' && ch <= '9') {
word_type = WORD_INT;
number = (number * 10) + (ch - '0');
} else {
word_type = WORD_COMMAND;
command[command_idx++] = ch;
}
if (word_done) {
switch (word_type) {
case WORD_INT:
stack->vals[stack->idx++] = number;
break;
case WORD_COMMAND: {
command[command_idx++] = 0;
bool found_cmd = false;
for (int i = 0; i < COMMAND_CNT; i++) {
if (defs[i].used && !strcmp(command, defs[i].name)) {
eval_line(stack, defs, defs[i].defn);
found_cmd = true;
}
}
if (!found_cmd) {
eval_primitive(stack, command);
}
break;
}
case WORD_NONE:
break;
}
number = 0;
command_idx = 0;
word_type = WORD_NONE;
word_done = false;
}
i++;
}
}
void compile_line(userdef_t *defs, char *line) {
int def_target;
for (def_target = 0; def_target < COMMAND_CNT; def_target++) {
if (!defs[def_target].used) break;
}
defs[def_target].used = true;
int start_cmd;
int end_cmd;
for (int i = 1; i < LINE_SZ; i++) {
if (line[i] != ' ' && line[i] != '\n' && line[i] != '\r' && line[i] != '\t') {
start_cmd = i;
break;
}
}
for (int i = start_cmd; i < LINE_SZ; i++) {
if (line[i] == ' ' || line[i] == '\n' || line[i] == '\r' || line[i] == '\t') {
end_cmd = i;
break;
} else {
defs[def_target].name[i - start_cmd] = line[i];
}
}
memcpy(defs[def_target].defn, line + end_cmd, 4096 - end_cmd);
}
int main() {
char line[LINE_SZ];
stack_t stack = {0};
userdef_t userdefs[COMMAND_CNT] = {0};
while (fgets(line, LINE_SZ, stdin) != NULL) {
if (line[0] == ':') {
compile_line(userdefs, line);
} else {
eval_line(&stack, userdefs, line);
}
}
return 0;
}
Once compiled, we can use this to define and run our own functions!
$ gcc rpn2.c -o rpn2
$ ./rpn2
: double dup +
5 double print
[0] = 10
Right now the compiled representation is fairly inefficient. Every time you invoke a function its code has to be loaded and parsed, which includes wasting time on insignificant characters like whitespace.
Instead of working with the raw command line, we can compile all functions into a series of basic operations ahead of time and store that. What kind of operations do we need?
- Pushing an integer constant
- Invoking a primitive
- Invoking a user-defined function
- Ending a function
We can represent each of these instructions using a simple code, consisting of one prefix byte and some trailing data:
bytes
Integer constant:
0 1 2 3 4 5
+--------|--------|--------|--------|--------+
| 0 | 32-bit integer constant |
+--------|--------|--------|--------|--------+
0 1 2 3 4 5
+--------|--------|--------+
| 1 | 16-bit int |
+--------|--------|--------+
0 1 2
+--------|--------+
| 2 | 8-bit |
+--------|--------+
Primitives:
0 1
+--------+
| 3 | +
+--------+
| 4 | -
+--------+
| 5 | *
+--------+
| 6 | /
+--------+
| 7 | %
+--------+
| 8 | swap
+--------+
| 9 | dup
+--------+
| 10 | drop
+--------+
| 11 | print
+--------+
User-defined functions:
0 1 2
+--------|--------+
| 12 | cmd |
+--------|--------+
Exit function:
0 1
+--------+
| 13 |
+--------+
This allows us to take something like the double function from before:
: double dup +
Which was executed by evaluating the 5-byte string dup +
every time, and
convert it to this compact 3-byte representation:
{11, 3, 13}
One thing to note is that we could have defined a simpler encoding for integers which stores all values as 32-bit constants. While this would work, this has the effect of significantly bloating certain definitions like this alternative version of double:
: double 2 *
In this case, the fixed-width 32-bit encoding would produce something like this which is larger than the textual version by 5 bytes:
{0, 0, 0, 0, 2, 6, 13}
Instead, we can do much better if we use the more compact integer encoding. This is half the size of the original encoding and only one byte larger than the textual version:
{2, 2, 6, 13}
Also, it's important to notice that by choosing one byte to use for referencing user-defined functions, we're enshrining a limit of 127 functions at a deeper layer than we have up until this point. Before it was easy to change that limit by adjusting a #define, but now that limit is baked into our function representation.
This is the next version of the RPN calculator utilizing a byte-code compiler:
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define LINE_SZ 4096
#define COMMAND_SZ 128
#define COMMAND_CNT 32
typedef enum { WORD_NONE, WORD_INT, WORD_COMMAND } wordtype_t;
typedef enum {
CODE_I32,
CODE_I16,
CODE_I8,
CODE_ADD,
CODE_SUB,
CODE_MUL,
CODE_DIV,
CODE_MOD,
CODE_SWAP,
CODE_DUP,
CODE_DROP,
CODE_PRINT,
CODE_UDF,
CODE_EXIT
} cmdtype_t;
typedef struct {
int vals[LINE_SZ];
int idx;
} stack_t;
typedef struct {
bool used;
char name[COMMAND_SZ];
char defn[LINE_SZ];
} userdef_t;
void exec_code(stack_t *stack, userdef_t *defs, char *codes) {
int code_idx = 0;
while (true) {
cmdtype_t command = codes[code_idx++];
switch (command) {
case CODE_I8: {
int value = codes[code_idx++];
stack->vals[stack->idx++] = value;
break;
}
case CODE_I16: {
int value = 0;
value |= (codes[code_idx++] << 8);
value |= codes[code_idx++];
stack->vals[stack->idx++] = value;
break;
}
case CODE_I32: {
int value = 0;
value |= (codes[code_idx++] << 24);
value |= (codes[code_idx++] << 16);
value |= (codes[code_idx++] << 8);
value |= codes[code_idx++];
stack->vals[stack->idx++] = value;
break;
}
case CODE_ADD: {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a + b;
break;
}
case CODE_SUB: {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a - b;
break;
}
case CODE_MUL: {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a * b;
break;
}
case CODE_DIV: {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a / b;
break;
}
case CODE_MOD: {
int b = stack->vals[--stack->idx];
int a = stack->vals[--stack->idx];
stack->vals[stack->idx++] = a % b;
break;
}
case CODE_SWAP: {
int b = stack->vals[stack->idx - 1];
int a = stack->vals[stack->idx - 2];
stack->vals[stack->idx - 1] = a;
stack->vals[stack->idx - 2] = b;
break;
}
case CODE_DUP: {
int a = stack->vals[stack->idx - 1];
stack->vals[stack->idx++] = a;
break;
}
case CODE_DROP: {
stack->idx--;
break;
}
case CODE_PRINT: {
for (int i = 0; i < stack->idx; i++) {
printf("[%d] = %d\n", i, stack->vals[i]);
}
break;
}
case CODE_UDF: {
int value = codes[code_idx++];
exec_code(stack, defs, defs[value].defn);
break;
}
case CODE_EXIT:
return;
}
}
}
int compile_command(userdef_t *defs, char *command) {
for (int i = 0; i < COMMAND_CNT; i++) {
if (defs[i].used && !strcmp(command, defs[i].name)) {
return i;
}
}
if (!strcmp(command, "+")) {
return -CODE_ADD;
} else if (!strcmp(command, "-")) {
return -CODE_SUB;
} else if (!strcmp(command, "*")) {
return -CODE_MUL;
} else if (!strcmp(command, "/")) {
return -CODE_DIV;
} else if (!strcmp(command, "%")) {
return -CODE_MOD;
} else if (!strcmp(command, "swap")) {
return -CODE_SWAP;
} else if (!strcmp(command, "dup")) {
return -CODE_DUP;
} else if (!strcmp(command, "drop")) {
return -CODE_DROP;
} else if (!strcmp(command, "print")) {
return -CODE_PRINT;
} else {
return -CODE_EXIT;
}
}
void compile_line(userdef_t *defs, char *codes, char *line, int cmd_start) {
bool line_done = false;
bool word_done = false;
wordtype_t word_type = WORD_NONE;
char command[128];
int command_idx = 0;
int number = 0;
int code_idx = 0;
int i = cmd_start;
while (!line_done) {
char ch = line[i];
if (!ch) {
line_done = true;
word_done = true;
} else if (ch == ' ' || ch == '\t' || ch == '\r' || ch == '\n') {
word_done = true;
} else if (ch >= '0' && ch <= '9') {
word_type = WORD_INT;
number = (number * 10) + (ch - '0');
} else {
word_type = WORD_COMMAND;
command[command_idx++] = ch;
}
if (word_done) {
switch (word_type) {
case WORD_INT:
if (abs(number) <= (1 << 7) - 1) {
codes[code_idx++] = CODE_I8;
codes[code_idx++] = number & 0xFF;
} else if (abs(number) <= (1 << 15) - 1) {
codes[code_idx++] = CODE_I16;
codes[code_idx++] = (number >> 8) & 0xFF;
codes[code_idx++] = number & 0xFF;
} else {
codes[code_idx++] = CODE_I32;
codes[code_idx++] = (number >> 24) & 0xFF;
codes[code_idx++] = (number >> 16) & 0xFF;
codes[code_idx++] = (number >> 8) & 0xFF;
codes[code_idx++] = number & 0xFF;
}
break;
case WORD_COMMAND: {
command[command_idx++] = 0;
int command_code = compile_command(defs, command);
if (command_code >= 0) {
codes[code_idx++] = CODE_UDF;
codes[code_idx++] = (char)command_code;
} else {
codes[code_idx++] = (char)-command_code;
}
break;
}
case WORD_NONE:
break;
}
number = 0;
command_idx = 0;
word_type = WORD_NONE;
word_done = false;
}
i++;
}
codes[code_idx++] = CODE_EXIT;
}
int main() {
char line[LINE_SZ];
char def[COMMAND_SZ];
stack_t stack = {0};
userdef_t userdefs[COMMAND_CNT] = {0};
char line_defn[LINE_SZ];
while (fgets(line, LINE_SZ, stdin) != NULL) {
if (line[0] == ':') {
int def_target;
for (def_target = 0; def_target < COMMAND_CNT; def_target++) {
if (!userdefs[def_target].used)
break;
}
userdefs[def_target].used = true;
int start_cmd;
int end_cmd;
for (int i = 1; i < LINE_SZ; i++) {
if (line[i] != ' ' && line[i] != '\n' && line[i] != '\r' &&
line[i] != '\t') {
start_cmd = i;
break;
}
}
for (int i = start_cmd; i < LINE_SZ; i++) {
if (line[i] == ' ' || line[i] == '\n' || line[i] == '\r' ||
line[i] == '\t') {
end_cmd = i;
break;
} else {
userdefs[def_target].name[i - start_cmd] = line[i];
}
}
compile_line(userdefs, userdefs[def_target].defn, line, end_cmd);
} else {
compile_line(userdefs, line_defn, line, 0);
exec_code(&stack, userdefs, line_defn);
}
}
return 0;
}