This project aims to build a compiler for the (yet to be defined) programming language Oak in less than 10.000 lines of Java code. With a focus on simplicity and elegance - not on the performance of the produced assembly (or the pretty error messages).
##Notes on targets, etc. The intended targets are x86 (with stdclib support) and MIPS (MARS and SPIM dialect) assembler. For simplicity reasons, no values are kept in the registers except the ones that are required for each operation.
##Development notes The development is documentation driven (every feature of Oak should first by defined in this document) and semi test driven (every non trivial method of the Code should be tested via a JUnit test).
- The error messages are not pretty and the parser can't recover from syntax errors (like an ANTLR generated one), but that's intended, as it keeps the code base and the level complexity small.
###Compiler backend notes
- It will be splitted into two seperate backends: one for x86 and one for MIPS and other targets.
- The x86 target:
- compiles to C code (or C++ code, if C turns out to be too nasty)
- the built in functions are declared as C functions
- it allows to compile Oak, to lots of other "real world" targets like x64
- is the simplest options and produces probably performant x86 assembly
##License This (and all other texts of this project) are CC-BY licensed. All code is GNU GPL v3 licensed.
##Roadmap Things to be done in chronological order.
- lexer (done)
- parser (done)
- semantic analyser
- compiler frontend
- generic compiler backend
- compiler backends for the intended targets
- x86 (probably compiles first to C)
- MIPS
- (optional) additional compiler backends for targets like
- MIMA
- (optional) simple AST based interpreter
- (optional) higher level language that compiles to Oak
###Status ####Current work
- Finish testing the parser.
- Start the semantic analyser
The development of this project consumes currently more time, than I can effort. It's therefore almost halting for the next two month. But I'm planning to finish the part "generic compiler backend" in mid august.
##Oak Oak is a simple language, that can be parsed using only Java's standard library, without the need of a heavy parser generator (like ANTLR) or a PEG parser (like parboiled).
Oak is
- procedural
- static typed
- whitespace insensitive
###Supported Types
| name | size (in bytes) | notable properties | used for | declaration syntax (as a regular expression) |
|---|---|---|---|---|
| bool | 1 | boolean arithmetic | true¦false |
|
| byte | 1 | signed | strings | [+-]?[0-9]+b |
| int | 4 | signed | calculations | [+-]?[0-9]+ |
| float | 4 | IEE-754 floating point | calculations | [+-]?[0-9]+\.[0-9]+(E[+-]?[0-9]+)? |
| pointer | 8 | range checks | ||
| string | 8 | pointer to bytes | "([^\\"] ¦ \\" ¦ \\n ¦ \\r ¦ \\t ¦ \\\\)*" |
####Pointer In general, a pointer references a memory area as bytes. A pointer consists of two 4 byte pointers, the first pointing at the first byte and the last pointing at last byte of the referenced memory area. The 4 byte pointers are not accesable to the user. Invalid pointers consist of two 4 byte pointers, the last 4 byte pointer pointing before the first 4 byte pointer.
###File structure
Statements are separated by new lines (\n).
A file consist of several function and global variable declarations.
The main function is the entry point into the programm. It's required for code that should be run directly. It's not required for library code.
###Variable declaration and assignment
A variable is declared in the following way:
TYPE NAME = VALUE
TYPE is the type of the variable (see Supported Types).
NAME =~ /[A-Za-z][A-Za-z_0-9]*[!]?/ (NAME has the given format, stated as a regular expression.)
Type names (like int and void) and boolean literals (true and false) can't be used as a NAME.
VALUE is the initial value of the variable (see Supported Types) or a name of another variable (declared in this or an outer scope before).
The VALUE has to have the type TYPE.
A variable can only declared once in a scope and variable declarations of an inner scope hide the declarations of an outer scope.
After a variable is declared, it can be assigned with the following syntax:
NAME = VALUE
NAME is a previously declared variable (in this or an outer scope).
VALUE is the new value of the variable (see Supported Types) or a name of another variable (declared in this or an outer scope before) and has the same type as NAME.
In general, a variable can only be accessed after it has been declared in the same scope or parent scopes.
Be aware that a memory area referenced by a string or a pointer should be freed after use.
####Global variables Global variables are declared in the global scope – they can be accessed from every other scope. Variable assignments and declarations are evaluated from top to down.
###Functions A function can be declared with the following.
_function RETURN_TYPE NAME : [type of argument 1] [name of argument 1], [...], [type of argument N] [name of argument N]
[statement_1]
[...]
[statement m]
[return statement]
_end
It declares the function NAME with a set of arguments, which returns a value of the type RETURN_TYPE. Functions must be declared with a return type but can have zero arguments. The arguments are mapped to variables with the appropriate (given) name inside the function base scope.
RETURN_TYPE can also be void, meaning that the function doesn't return anything.
The last statement has to be a return statement (omittable if the function has the return type void).
A function can also be declared without any arguments:
_function RETURN_TYPE NAME
[see above]
_end
It's allowed to name two functions the same name if they have different sets of argument types, as well as to name a variable and a function the same. But it's not allowed to use a type name (e.g. int or void) or a bool literal (true or false) as a name.
###Statements
Statements are separated by newlines and are only allowed in function bodies.
####Function calls
NAME [first argument] [...] [last argument]
Or without any arguments:
NAME
This calls a function NAME. The return value is stored in the function local variable ret_TYPE with TYPE being the return type of the function NAME (e.g. the return value of a function returning a bool is stored in the variable ret_bool). But this is not the case if the return type is void.
The default values of the ret_TYPE variables are the following:
| variable | default value |
|---|---|
| ret_bool | false |
| ret_byte | 0 |
| ret_int | 0 |
| ret_float | +0 |
| ret_string | "" |
| ret_pointer | pointer to an empty memory area |
####Function return statement This statement returns from its enclosing function to the calling function.
_return VALUE
VALUE is either a valid variable name or a value representation (see Supported Types) and must have the same type as the return type of its enclosing function.
If the return type is void, VALUE has to be omitted.
###Control structures (statements)
It's assumed in the following, that BOOL is a bool variable or a bool value.
####Conditionals
_if BOOL
[block of statements executed if BOOL is true]
_else
[block of statements executed otherwise]
_end
Or
_if BOOL
[block of statements executed if BOOL is true]
_end
####While loop
_while BOOL
[block of statements]
_end
The BOOL is checked at the start of each iteration of the loop, the [block of statements] is then executed if BOOL is true.
###Built in functions The following are the builtin functions, with their function headers (argument names are omitted, as well as the collon between name and argument list) and a short description.
####Type conversions
boolto_boolbyteConverts a byte into a bool,0turns intofalse, every other number intotrue.boolto_boolintConverts an int into a bool,0turns intofalse, every other number intotrue.boolto_boolfloatConverts an float into a bool,+0,-0andNaNturn intofalse, every other number intotrue.byteto_byteboolConvert a bool into an int,trueturns into1andfalseto0.byteto_byteintConvert a byte into an int. If theintis to large for abytethen it throws an exception.intto_intboolConvert a bool into an int,trueturns into1andfalseto0.intto_intbyteConvert a byte into an int.intto_intfloatConvert a float into an int, the actual conversion details are target specific and therefore not specified here.floatto_floatintConvert a int into a float.introundint floatRounds the float to the nearest integer and returns it,-InfinitybecomesINT_MINand+InfinitybecomesINT_MAX. It throws an exception if thefloatisNaN.intfloorfloatReturns the biggest integer smaller than the float,-Infinityturns intoINT_MINand+InfinityintoINT_MAX. It throws an exception if thefloatisNaN.pointerto_pointerstringConverts a string into an equivalent pointer (that references the bytes of the string).stringto_stringpointerConverts a pointer into an equivalent string (treating the pointer as a pointer referencing bytes).stringto_stringbyteConverts a byte into its string representation, thereby allocating a new string on the heap.stringto_stringintConverts an int into its string representation, thereby allocating a new string on the heap.stringto_stringfloatConverts a float into its string representation, thereby allocating a new string on the heap.
####Arithmetic operations
-
byteaddbyte byteCalculates the sum of two bytes (overflows can occur). -
intaddint intCalculates the sum of two ints (overflows can occur). -
floataddfloat floatCalculates the sum of two floats. -
bytesubbyte byteSubstracts the second byte from the first (overflows can occur). -
intsubint intSubstracts the second int from the first (overflows can occur). -
floatsubfloat floatSubstracts the second float from the first (over- and underflows can occur). -
bytemulbyte byteMultiplies the two bytes (overflows can occur). -
intmulint intMultiplies the two ints (overflows can occur). -
floatmulfloat floatMultiplies the two floats (over- and underflows can occur). -
bytedivbyte byteDivides the first byte by the second (over- and underflows can occur). Throws an exception if the divisor is zero. -
intdivint intDivides the first int by the second (over- and underflows can occur). Throws an exception if the divisor is zero. -
floatdivfloat floatDivides the first float by the second. Throws an exception if the divisor is zero. -
bytemodbyte byteCalculatea mod b. Throws an exception if the secondbyteis zero. (Uses the mathematical correct version of this operation.) -
intmodint intCalculatea mod b. Throws an exception if the secondintis zero. (Uses the mathematical correct version of this operation.)
####Boolean operations
boolnotboolNegates the given bool.boolandbool bool. Returnstrueif both given bools aretrue, otherwise it returnsfalse.boolorbool bool. Returnsfalseif both given bools arefalse, otherwise it returnstrue.boolxorbool bool. Returnstrueif both given bools have different values, otherwise it returnstrue.
####Compare operations
-
boolequalbool boolReturnstrueif both bools have the same value, otherwise it returnsfalse. -
boolequalbyte byteReturnstrueif both bytes have the same value, otherwise it returnsfalse. -
boolequalint intReturnstrueif both ints have the same value, otherwise it returnsfalse. -
boolequalfloat floatReturnstrueif both floats have exactly the same value, otherwise it returnsfalse. It also returnsfalseif one of thefloats isNaN.+0and-0are the same. -
boolequalpointer pointerReturnstrueif both pointers reference the same memory area, otherwise it returnsfalse. -
boolequalstring stringReturnstrueif both strings consist of the same sequence of bytes, otherwise it returnsfalse. -
boollessbyte byteReturnstrueif the first byte is smaller than the second, otherwise it returnsfalse. -
boollessint intReturnstrueif the first int is smaller than the second, otherwise it returnsfalse. -
boollessfloat floatReturnstrueif the first float is smaller than the second, otherwise it returnsfalse. It also returns false if one of thefloats isNaNand if one is+0and the other-0.
####Variable references
Variable references are pointers pointing to a single variable.
Use the following function call statement to get a variable reference (stored in ret_pointer) for a variable VAR:
get_ref VAR
The function get_ref is defined for all data types.
get_ref VAR is equivalent to &VAR in C or C++.
####Pointers and heap allocation
-
pointermallocintAllocate a memory area of the given size on the heap and return a referencing pointer. -
voidfreepointerThis function takes a reference to a pointer, frees the memory designated to this pointer and invalidates it. -
pointerclone_mempointerCreate a copy of the memory area referenced by the pointer and return the pointer to the copy. -
voidcopy_mempointer pointerCopy the memory referenced by the first pointer to the one referenced by the second pointer. The last memory area must be at least as big as the first. -
pointerappendpointer pointerCreate a new pointer referencing a memory area with the sum of the sizes of both given pointers as its size and copy the first pointers area to the beginning and copy the second pointers area after it. -
intmem_sizepointerReturns the size of the memory area referenced by the pointer or-1if the pointer is invalid. -
pointeraddpointer intReturns a pointer with the end 4 byte pointer being equivalent to the one of the given pointer and the begin 4 byte pointer incremented byint. There will be an exception if the resulting pointer will be outside the memory area of the first or if the int is negative.
Accessing the area directly referenced by the pointer as a data type:
The area referenced by the pointer has to be at least [size of data type] bytes wide, otherwise there will be an exception.
The first [size of data type] bytes are returned as the data type (in get_TYPE functions), or set to the given value (in set_TYPE functions).
boolget_boolpointerboolset_boolpointerbyteget_bytepointerbyteset_bytepointerintget_intpointerintset_intpointerfloatget_floatpointerfloatset_floatpointerpointerget_pointerpointerpointerset_pointerpointerstringget_stringpointerstringset_stringpointer
####Strings
stringappendstring stringAppend the second string to the first and return a pointer to the new string.
Convert them into pointers, work with them and then convert them back.
####IO
-
voidprintstringPrint the string on the standard output. -
voidprint_errstringPrint the string on the error output. Note: If the error out isn't provided by the target, it uses the standard out. -
byteread_byte Reads a character from standard input. -
stringread_string Reads a string from standard input.
####Misc
intexitintExits the program. The integer is an error code if its not zero.voidassertboolExits and prints some context information (e.g. line and function) if bool isfalse.voidassertbool stringExits and prints the string (with some context information) if bool isfalse.
###Comments
Comments are lines preceeded by a # character.
Function comments are comment blocks (several lines preceeded by a # character) in front of functions, they can be used to clarify the functionality and usage of the function.