Hello! This is a tutorial for the Simple 8-bit Assembler Simulator in Javascript.
The CPU has a few pieces of memory stored inside of it called registers. In this case, these registers hold a single byte (8 bits) of memory. So at any given time each of these 8-bit registers hold a single value from 0
to 255
, or $00
to $FF
in hexidecimal.
This CPU also has a few pieces of memory inside of it called flags, each of which hold a single bit of memory, and are used to represent boolean values. So at any given time each of these 1-bit flags hold a value of either TRUE
or FALSE
.
These registers and flags together constitute the internal state of the CPU at any given time, and have various purposes that you'll see!
This CPU has four general purpose registers called A, B, C, and D. They are called general purpose because it is left up to the programmer to decide how to use them. It is often convenient and even necessary to have a temporary space to hold values being manipulated and that's where these registers come in handy.
This CPU also has two special purpose registers called the instruction pointer (IP) and the stack pointer (SP). They are called pointers because they hold a value that represents a location in RAM. That is to say that they point to a place in memory.
The instruction pointer points to the next program instruction in memory to be executed and the stack pointer points to the current top of the stack (more on both of these later).
This CPU has three flags called the zero flag (Z), the carry flag, (C) and ... honestly I have no idea what the F flag here is for. XD So let's just ignore that one for now. These flags are used to store results from carrying out various operations. The programmer can read these results and use them to decide what to do next. For example, when subtracting two numbers, the zero flag is set to TRUE
if the result is 0. The exact uses of these flags depend on the instruction, so we'll come back to that later.
This simulated CPU has a block of 256 bytes of RAM attached to it. Each of these bytes are arranged in an order from top-left to bottom-right and have an assigned number which is the memory address of that byte. For example, in the screenshot, the value at memory location $00
is $1F
, and the value at memory location $12
is $06
. The programmer is able to instruct the CPU to read and write byte values to and from each memory address at random (hence the name "random access memory").
Actually, this simulated CPU has no means of input, but it does have some form of output: a 24-cell ASCII character display!
This simple character display simply displays the characters corresponding to the encoded ASCII values present in a specific portion of memory, namely the memory locations from $E8
to $FF
. This is an example of memory-mapped IO which means that some form of system input and/or output can be accessed by reading or writing to specific locations in memory.
A program is a sequence of instructions that instruct the CPU what to do. Most intructions consist of an operation and one or more operands depending on the operation.
An operation is like a function that is built into the CPU and provided for the programmer to use right away. Each operation has a short memorable name called its mnemonic. In written assembly language, operations are refered to by this mnemonic.
An operand is like an argument to an operation. An operand might refer to a CPU register, a location in memory, or a literal value.
Ultimately the CPU only understands the numbers. When you press the "Assemble" button, the assembly code is converted into a numerical representation of the program called machine code and then placed into memory.
In addition to its mnemonic, every operation has an associated numerical representation called its opcode. There is a one-to-one correlation between every mnemonic and every opcode. When an instruction is assembled into machine code, mnemonics are systematically replaced with their corresponding opcodes.
An addressing-mode is a way of refering to the actual value used as an operand.
Immediate addressing is when the value is given directly after specifying the operation. It is called immediate addressing, because the encoded value is placed immediately following the opcode in the machine code.
Direct addressing is when the value to be used is located somewhere in memory. Instead of directly specifying the value, the address of the value in memory is specified instead. It is called direct addressing in contrast to indirect addressing.
Indirect addressing is when the value to be used is located somewhere in memory, and the address of this value is also located somewhere in memory. Rather than specifying the value directly, or specifying the address of the value in memory, the address of the address is specified. It is called indirect addressing because... well... its rather indirect, don't you think. XD It's used less often than the others, but it is still useful in situations when you don't know where a value is located ahead of time.
A stack is a data structure that looks like a physical stack of items. It is a LIFO (last in first out) structure, meaning that the first thing you get out of the stack is the last thing you put in. Imagine a stack of blocks. The only block you can take off of the top of the stack is the last block that was placed on top of it.
A stack in memory is implemented as a sequence of values constituting the items in the stack plus a stack pointer. The stack pointer is a pointer that always points to what is considered the top of the stack.
Every time you want to push an item onto the stack (in other words, place a new item on top of the stack), you simply copy the value to the place in memory pointed to by the stack pointer, and then you increment (or decrement, depending on which direction the stack grows) the stack pointer to point to the next free space, the place right above the item you just added to the stack.
Every time you want to pop an item off of the stack (in other words, remove the last item that was placed on top of the stack), you simply move the stack pointer back down. If you want the value you just popped, just reference the value in memory now pointed to by the stack pointer.
This CPU has a single built-in stack! The stack pointer of this built-in stack is contained in the CPU's very own stack pointer register. In this case, the stack grows downward and when the CPU is reset, the stack pointer is initialized to $E7
, which means the bottom of the stack is located at memory location $E7
. But these details aren't important for actually using the stack.
The CPU provides the PUSH
and POP
operations which allow you to push and pop values onto and from the stack. It's up to the programmer to decide how to use the stack, but common uses are to preserve values temporarily, to pass arguments between functions, or to keep track of return addresses (more on these things later).
When the CPU is running, it functions by executing instructions in memory one by one. First it looks up the instruction in memory pointed to by the instruction pointer. This includes the opcode as well as any operand byte values that may follow depending on the operation. Then it carries out this instruction, possibly affecting the internal state of the CPU and/or the contents of memory. Finally, the instruction pointer is set to the location immediately following the instruction that was just executed, and the process continues.
When the CPU is reset, the instruction pointer is set to 0, which means the first instruction that gets executed is the instruction located at the very start of memory. When the CPU is run, it will continue running until encountering a halt (HLT
) instruction, at which point it will freeze. Alternatively you can execute a single instruction at a time (called stepping).
An instruction is written in assembly language starting with a mnemonic followed by any operands separated by commas.
Literal values can be used as operands by simply including a numeric value or an ASCII character by enclosing it in single quotes. This is immediate addressing.
Contents of CPU registers can be used as operands by simply writing the name of the CPU register. For example, A
refers to the A register.
Values present in memory can be used as operands by enclosing another value in [
and ]
. For example, the value at memory address $20
can be used as an operand by writing [$20]
. This is direct addressing.
You can also place a register name between [
and ]
in order to use it as an operand the value in memory located at the address contained in that register. For example, if the A register contains $5
then [A]
refers to the value located in memory at address $5
.
Rather than specifying memory addresses explicitly, it's much more common and convenient to use labels to mark memory locations in the program code. You can place a label in the assembly code to mark the assembled address of what immediately follows by writing a name followed by a colon. For example start:
creates a label called "start". Then you can use the name start
anywhere else in the program in place of a memory address to refer to that place in the code.
You can include arbitrary data in your program using the DB
directive. This stands for data byte and isn't a mnemonic for a CPU operation, but rather instructs the assembler to include some binary data at that point rather than to assemble code there. This is useful for including predefined constant values in your program.
Lastly, comments in assembly language are generally marked with a semicolon.
Let's walk through the Hello World! example!
For reference, here is the example code from the simulator:
; Simple example
; Writes Hello World to the output
JMP start
hello: DB "Hello World!" ; Variable
DB 0 ; String terminator
start:
MOV C, hello ; Point to var
MOV D, 232 ; Point to output
CALL print
HLT ; Stop execution
print: ; print(C:*from, D:*to)
PUSH A
PUSH B
MOV B, 0
.loop:
MOV A, [C] ; Get char from var
MOV [D], A ; Write to output
INC C
INC D
CMP B, [C] ; Check if end
JNZ .loop ; jump if not
POP B
POP A
RET
As mentioned previously, when the CPU is reset, the instruction pointer is set to 0, which means the start of the program is the top of the file. So the first instruction in the program is this:
JMP start
The JMP
operation simply sets the instruction pointer to its operand. That is to say that it jumps to a given place in memory and continue executing the program from there. In this case, the operand is start
, which is a label marking this portion of code:
start:
MOV C, hello ; Point to var
MOV D, 232 ; Point to output
CALL print
HLT ; Stop execution
After executing the jump, the next instruction is then this:
MOV C, hello ; Point to var
The MOV
instruction copies its second operand into the place described by its first operand. That is to say that it moves data around. In this case, the second operand is hello
and the first operand is C
. This means the memory address marked by the hello
label is copied into the C register. hello
marks this in the code:
hello: DB "Hello World!" ; Variable
This is the raw string "Hello World!" that will be printed. So now, the C register contains the location of the string we are going to print. In other words, the C register points to the string.
The next instruction is another MOV
:
MOV D, 232 ; Point to output
This places the value 232
into the D register. 232
($E8
in hexidecimal) is the memory address of our memory mapped character output display. So if we write the string into memory starting at this location, it will show up on our display.
To recap, at this point in execution, the C register points to the string we want to display, and the D register points to the display memory itself. All we have to do now is copy the string pointed to by C to the location pointed to by D.
The next instruction is a call to our print function:
CALL print
The CALL
operation is very similar to the JMP
operation in that it also jumps to another location in memory to continue execution from. The only difference is that before it jumps, it pushes the current value of the instruction pointer onto the stack. The memory location jumped to is intended to be the beginning of a function (also called a subroutine in assembly parlance).
Generally the last instruction in a function is a RET
instruction. The RET
operation, you guessed it, returns from the function. It does this by setting the instruction pointer to the return address retrieved by popping it off the stack. This is possible because the CALL
operation pushes the return address before jumping to the function.
There are multiple ways of passing arguments to functions in assembly language. The way its done here is by preloading CPU registers with values to pass to the function. In this case, our print function takes two arguments: a pointer to the string to print, and a pointer to the location in memory to print it to. It expects these arguments to be given via the C and D registers respectively. That's why we previously loaded those registers with those values!
So in this case, it is calling the function marked with the label print
, so the return address is pushed onto the stack, and then the instruction pointer is set to the start of the print function, and we continue from there.
The first couple of instructions in the print function look like this:
PUSH A
PUSH B
MOV B, 0
This is typical of a function prologue. A function prologue is some initial code in a function that prepares the stack and CPU registers for use.
In this case, we are pushing the A and B registers onto the stack in order to preserve their value. We do this because the body of this function is going to corrupt the contents of these registers. By preserving their values first, we can restore them before returning from the program. This way any part of the code that calls this function does not have to worry about the contents of any registers being modified after calling the function.
After the pushes, we are also initializing the B register to contain the value 0
. You'll see why soon.
After the function prologue, the first thing in the body of our function is a loop:
.loop:
MOV A, [C] ; Get char from var
MOV [D], A ; Write to output
INC C
INC D
CMP B, [C] ; Check if end
JNZ .loop ; jump if not
The .loop
label marks the beginning of the loop. (Don't worry about the dot at the beginning of the name. It's simply a conventional way of indicating local labels, labels whose relevance is confined to some localized portion of code. For example, you may have many loops in your program, none of which need to refer to each other, and you may not want to give each loop a unique name.)
I'll tell you in advance what this loop does: It copies each character from the source string to the destination.
Since C currently points to the source string, the first thing to do is grab the first character, the character at the address of the pointer:
MOV A, [C] ; Get char from var
This copies the value in memory pointed to by the contents of the C register into the A register.
The next step is to copy this retrieved character to the output display, which is currently pointed to by the D register:
MOV [D], A ; Write to output
This copies the character (in A) to the memory pointed to by D.
Now that we've succesfully copied the first character to the output display, it's time to do the next one! To do this we simply need to increment the source and destination pointers by one so that we can retrieve the next character from the source string and write it to the next cell in the character display. The INC
operation does just that, it increments the contents of a register by one:
INC C
INC D
At this point we could simply use a JMP
operation to jump back to the start of the loop at .loop
in order to copy the rest of the characters. The only problem with this is that this would create an infinite loop as there would be no way to know to stop when the end of the string is reached.
Instead we need a way to only jump back to the beginning of the loop if we aren't done copying the string. Furthermore, we need a way to know whether we are done copying the string or not. The way we know is by marking the end of the string with a null terminator which is a byte value of 0
placed directly after the string in memory. This is why the string value in the program is followed by a DB 0
:
hello: DB "Hello World!" ; Variable
DB 0 ; String terminator
Now all we have to do is check and see if the next character is a 0
before continuing the loop. If it is not a 0
, jump back to the beginning of the loop. If it is a 0
, then we are done, continue running the program past the loop.
In order to check whether the next character is a 0
, we can use the CMP
operation:
CMP B, [C] ; Check if end
The CMP
operation compares its two operands and sets the CPU flags accordingly. For our case, all we need to know is that if the two operands are numerically equal, the zero flag is set to TRUE
. (The logic behind it is that in order to compare the two numbers, the CMP
operation internally subtracts the second one from the first, and if the result is 0
, it sets the zero flag. Of course, a difference of 0
means that the two operands are equivalent.)
In this case, the zero flag gets set to TRUE
if the next character in the string is 0
. (Remember, we just incremented the contents of C to point to the next character in the string, and in the function prologue we initialized B to 0
. This is the reason we did that!)
Finally, we use the JNZ
operation to jump back to the beginning of the loop only if the zero flag has been set to TRUE
:
JNZ .loop ; jump if not
The JNZ
operation is just like the JMP
operation except that it only performs the jump if the zero flag is currently FALSE
. In other words, it (j)umps if (n)ot (z)ero. If the zero flag is currently FALSE
, then nothing happens at all, and the program continues with the next instruction.
Which in this case, is the function epilogue! The function epilogue is the counterpart to the function prologue. Here the stack and CPU registers are prepared for returning from the function. In our case, we simply restore the previously preserved values of the A and B registers so that the part of the code that originally called this function will have no knowledge that the A and B register were used at all. This is important in case the calling code happens to be using A and B for purposes of its own:
POP B
POP A
Finally, we return from the function with the RET
operation, which, as previously mentioned, pops the return address from the stack that was originally pushed by the corresponding CALL
instruction, then puts this popped address back into the instruction pointer register.
The very last instruction of our program is then the HLT
instruction:
HLT ; Stop execution
The HLT
instruction halts the CPU, marking the end of program execution.
Oh, thank you so much!! ^_^