Note: this was originally several Reddit posts, chained and linked. But now that Reddit is dying I've finally moved them out. Sorry about the mess.
URL: https://www.reddit.com/r/ProgrammingLanguages/comments/up206c/stack_machines_for_compilers/i8ikupw/ Summary: stack-based vs register-based in general.
There are a wide variety of machines that can be described as "stack-based" or "register-based", but not all of them are practical. And there are a lot of other decisions that affect that practicality (do variables have names or only address/indexes? fixed-width or variable-width instructions? are you interpreting the bytecode (and if so, are you using machine stack frames?) or turning it into machine code? how many registers are there, and how many are special? how do you represent multiple types of variable? how many scopes are there(various kinds of global, local, member, ...)? how much effort/complexity can you afford to put into your machine? etc.)
- a pure stack VM can only access the top element at any time (some instructions pop multiple elements). No practical language uses this; it is only suitable for use in a calculator, and has difficulty even dealing with variables efficiently. (likely they will be done by name, which is inefficient - though I suppose globals could have indices if you're careful)
- a stack VM plus "load/store at offset from top of stack". In languages that use this, functions are likely nothing special (i.e. there are no stack frames). Disassembling the stack code by hand can be confusing, since the ever-changing stack size means the offset to a given variable (= stack slot) changes (but programs can deal with it just as easily as the next type). It is possible for the stack to get "misaligned" if incorrect code is generated.
- a stack VM plus "load/store at offset from bottom of stack" or equivalently "load/store at index within local variables" (which they pretend are not on the stack, but in practice they are). This kind of language has stack frames, making life much saner (both for programmers, and in case of incorrect codegen), though it could be argued that you're paying for an extra hardware register (but if you're writing an interpreter you probably shouldn't care about this, and if you're generating machine code you can do better if you really have to).
- most well-known "stack-based" language VMs use some form of this.
- a register-based VM with an infinite number of "registers" (= local variables or temporaries, = stack slots). This turns out to be very similar to the previous, except expressions generate anonymous local variables rather than having push/pop instructions everywhere - the entire stack frame gets created at function start.
- most well-known "register-based" language VMs use some form of this. Within this category, however, there is a very wide variety of designs.
- a register-based VM where there are only a finite number of registers, with explicit stack spills, like real machine code would do. This is a lot more complexity, and since you have to hard-code the VM ISA's number of registers, it is impossible to generate an architecture-independent interpreter for, since real machines have widely varying numbers of registers (sometimes even the same hardware if you use it differently). I suppose you could say that compilers generating machine code use this as a temporary representation, after all architecture-independent optimizations have been done.
I'll be back to write more later, but at least the basic post is complete.
URL: https://www.reddit.com/r/ProgrammingLanguages/comments/up206c/stack_machines_for_compilers/i8jtxpz/ Summary: general instruction encoding, N-argument, ancillary tables
1/3 Okay, writing more "later" now (ping /u/bare_et_spoergsmaal ; see also replies to this)
You quoted:
large number instructions doing register-to-register moves to prepare for operations
This only makes sense for finite-register machines (and specifically, only for those that use two-address instructions), which are mostly used for real hardware rather than VMs (though I suppose they could be used if an infinite-register VM also supports a finite number of "special" registers).
Now, let's talk about instruction encodings. In some form or another, an instruction consists of an Opcode (e.g. "add") and some number of arguments (usually 0-3). Arguments might specify something to "read", "write", or "modify", or might just specify special flag. We might also speak of instructions having implicit arguments (fixed registers, hidden state, stack slots, etc.), but when counting those often aren't considered arguments.
It is usually a good idea to make it easy to figure out how many arguments (and of what type) an instruction has (thus, it is not solely for neatness that arithmetic instructions are usually grouped together, differing only in a few bits). For simplicity's sake, let us mostly assume that the opcode is one byte and each argument is one byte, though there may be reasons to use other encodings in certain places, and opcodes will be carefully-grouped.
In a stack-based machine, most instructions do not take any immediate arguments; instead, they affect the stack (both the stack pointer as well as the value currently at the top of the stack). Sometimes the only instruction that takes an argument might be "push immediate integer constant"; I have even seen VMs that do not allow any arguments at all (instead there are several instructions that push specific small contstants; larger constants have to be built using arithmetic). Practically, however, you'll want to allow arguments for the "load/store at offset" instructions, and possible more. Notably, in most stack-based VMs, arithmetic instructions never take any arguments (though some stack-based VMs might have separate "add immediate"-style instructions to simplify "push immediate; add" - which I mostly discuss below in the context of register VMs).
In a register-based machine, most instructions do take arguments. Particularly, arithmetic instructions usually take 2 or 3 arguments ("modify, read" or "write, read, read"). However, encoding that directly would make most instructions take several bytes, so often some arguments are implicit. For a finite-register machine, you might be able to encode each register in a few bits, but practical VMs must have infinite registers, so we need a whole byte per argument (to get from "256 registers" to "infinite registers", either have a "extra data" instruction before it (which will require normal immediate reading to use |=
rather than =
), or else set a bit in the opcode to read some bytes following, or else use indirection, or else switch out the data tables (see below). This should be done in opcode-agnostic code. Note that your choices here may affect how easy it is to detect jumping into the middle of an instruction).
My preferred VM design uses a single special register called the "accumulator"; this is an implicit "modify" argument to all arithmetic instructions, so that no opcode ever needs more than 1 explicit argument (and instructions that take 0 arguments are pretty rare). Thus, the bytecode is completely fixed-width (16-bit), except if the "extra data" stuff comes into play (depending on design, if extra data is needed before any instruction, all jumps must verify that the previous instruction was not "extra-data"; if extra data is needed after an instruction, verify that it decodes as a NOP/TRAP. One possibility is "if the high bit is set, it is a NOP/TRAP. TRAP is probably a safer default - and besides, unless somebody is patching bytecode, is NOP even useful?). (note that the design of UTF-8 is interesting here, though I don't recommend following it too closely, since it assumes most data is ASCII but our instruction frequencies are not going to be that nice). Since there's only one machine-like register, we can declare it clobbered by every control flow instruction, which also keeps life simple/sane (it might have a meaningful value after return, but for gotos we should clobber it explicitly to prevent idiots from doing something weird).
Additionally, I strongly recommend having external data tables that the argument is usually looked up in. This severely reduces the need for "extra data" arguments, since e.g. while it's easy to have a function that's more than 256 instructions long, it's much rarer to have one with more than 256 branches. A list of tables you probably want:
- a table of integer constants
- a table of string constants
- (possibly other tables of constants if you support other types. But those two deserve dedicated tables regardless)
- a table of local variable names (solely for debugging)
- a table of global variables that this function accesses
- for simplicity, these are often names since each function usually doesn't know how many total global variables exist, but you could intern them into indices as you load/generate the code
- a table of other (user-defined) functions that this function calls
- side note: I recommend implementing this without touching the machine stack
- a table of builtin/intrinsic functions that this function uses (or, if there are few enough, this table might be global. That said, it's probably worth merging identical (or small) tables regardless???)
- a table of jump targets within this function
- possibly store/generate a name for debugging reasons
- if you want to eliminate "extra data" opcodes entirely, this table might also include data for "switch out the data tables", though this may make it harder to debug - alternatively, you might have a "jump to function" that acts similarly (I admit this idea is a recent development on my part ... wait, how do local variables work if you do this?)
It should be noted that this encoding is optimized for execution, not manipulation. If you do optimizations, they should probably be done earlier in 3-address form (though note that it is quite possible to go from this form back to 3-address form if needed), and likely in SSA as well (but be aware that going back from SSA to mutable-register form isn't entirely trivial if you don't want to waste registers - which we don't, since we have the soft limit of 256 locals)
URL: https://www.reddit.com/r/ProgrammingLanguages/comments/up206c/stack_machines_for_compilers/i8ju0cg/ Summary: Opcode details - binary and unary, register and immediate arguments, commutative, mirrored, and reversed.
2/3 Finally, let's talk about what opcodes to provide:
- Binary operators:
-
Boolean (6 mandatory opcodes, 20 possible opcodes):
EQ
,NE
,LT
,LE
,GT
,GE
- I've seen VM designs where these these are be merged into a single
CMP
opcode, with an argument specifying the kind of comparison. But I think that's pretty dumb for various reasons (note: it does, however, make sense during optimization), and in any case it's not possible since we only get one explicit argument and one implied argument in my design. That said, these are probably encoded adjacent to each other, so we if we really wanted we could say "the type of comparison is an argument that is embedded in the opcode byte itself". - Note that jump instructions can embed a comparison directly; these instructions will only appear if you assign the result of a comparison into a variable or use it in a further arithmetic expression.
- You might need to split out signed vs unsigned comparisons (not needed for EQ and NE). But unlike division/modulus below, you could always just xor the top bit.
- I've seen VM designs where these these are be merged into a single
- optionally
LOGICAL_AND
, but note that the frontend normally won't generate it (the && and || operators usually create branches) - there's less need for
LOGICAL_OR
since often any nonzero value is okay. But if storing a bool back in an integer variable, supporting this removes a coercsion of the output if you don't know the input is already a strict bool - some people might find
LOGICAL_XOR
andLOGICAL_XNOR
interesting, but the are just equivalent toNE
andEQ
after coercing both inputs to bool (if you don't know they already are) - if you're really pushing it, you could add
LOGICAL_{NAND,NOR,LT,LE,GT,GE}
as well so that all the 10 nontrivial truth tables are accessible, but who does that? - there is no need for a reversed version of any of these.
LT
is the reverse ofGT
,LE
is the reverse ofGE
, and the rest are commutative
-
Arithmetic, commutative (5 mandatory opcodes, 14 possible opcodes):
ADD
,MUL
- if statically-typed (please), also
FADD
andFMUL
- if statically-typed (please), also
BITWISE_XOR
,BITWISE_OR
,BITWISE_AND
(I use long names soAND
doesn't get visually confused withADD
)- optionally
BITWISE_XNOR
(implement as~(a ^ b)
) - if you're really pushing it,
BITWISE_{NAND,NOR,LT,LE,GT,GE}
(technically these last 4 aren't commutative, but as above their inverse is in the set)
-
Arithmetic, noncommutative (2x 7 mandatory opcodes, 2x 11 possible opcodes):
SUB
- there is no point in an x86-style
CMP
instruction. Since we don't have a flags register, we need to change the value in the accumulator and branch off of that (overflow? what overflow?)
- there is no point in an x86-style
BITWISE_SHL
,BITWISE_SHR
, and optionallyARITHMETIC_SHR
(let's be honest, do people ever really use that intentionally?)SDIV
,UDIV
,SMOD
,UMOD
- Yes, division/modulus are different for signed and unsigned integers. Unsigned div/mod cannot be implemented in terms of signed divmod. Signed div/mod can be implemented in terms of unsigned by adding branches, but ick.
- Note that this is distinct from the problem of "what's the sign of the result of modulus" (which finally got standardized by C to what the hardware actually does; note that many famous languages do something weird instead)
- You can only merge the signed/unsigned versions (and skip the second trap check) if your language only supports BigIntegers.
- Don't forget to handle division by 0 and division of the most negative number by -1
- Yes, division/modulus are different for signed and unsigned integers. Unsigned div/mod cannot be implemented in terms of signed divmod. Signed div/mod can be implemented in terms of unsigned by adding branches, but ick.
FDIV
if you support floating-point numbers at all. If your language is dynamically-typed (read: chooses to perform badly) you do not need any other floating-point-specific opcodees; if it is statically-typed you also needFADD
,FMUL
(these 2 are commutative),FSUB
, andFMOD
. Note thatFDIV
andFMOD
do not support the divmod identity. Note that all floating-point opcodes should be grouped together separately from the integer instructions, since they take a different type of argument.- if statically-typed, you also need separate opcodes for "adding" strings and other important builtin data types (e.g. lists). But you could instead convert those to intrinsic calls - in fact, if you support operator overloading of user-defined types inside your language, those should get converted to function calls first, then those function calls get treated either as user-defined function calls or intrinsic function calls, so strings/lists are not really special at all (assuming they don't have dedicated opcodes of course))
- since these are noncommutative, you need a separate "reversed" instruction for all of these:
RSUB
,RSDIV
, etc. The forward opcode isaccumulator = accumulator OP argument
, the reverse isaccumulator = argument OP accumulator
-
For all of the above binary operators (including the reversed ones), there should be 2 actual opcode values (separated by a single bit, so the VM can handle them sanely) - one where the argument is interpreted as an index into the (runtime) local variables, and the other where the argument is an index into the appropriate constant data table (usually integer, sometimes float)
- I call these "immediate" (suffix
_IMM
) even though it does a table lookup, since in the assembly language it is an immediate. I don't believe there's a point in separate functions that directly use the raw immediate.
- I call these "immediate" (suffix
-
- Unary operators:
- there is no unary
+x
, that's a NOP and should not be emitted. - there is no unary
-x
, that's just0-x
(RSUB with immediate argument) - there is no unary
!x
, that's justx == 0
- there is no unary
~x
, that's justx ^ -1
- Unlike real machines,
popcount(x)
is probably not important enough to have its own opcode in a VM; it can be an intrinsic. Note thatparity(x)
is justpopcount(x) & 1
.
- there is no unary
URL: https://www.reddit.com/r/ProgrammingLanguages/comments/up206c/stack_machines_for_compilers/i8ju1bm/ Summary: special instructions - memory, control flow, function arguments
3/3 opcode details, continued
- Memory stuff (5 mandatory instructions):
LOAD
,STORE
(taking an index into the local variables)LOAD_GLOBAL
,STORE_GLOBAL
(taking an index into the current function's data tables, which in turn contains the name (or possibly interned index) of a global variable)- alternatively you could demote these to be intrinsics, in hopes that people stop using global variables
- if you support closures, you might also need special load/store for captured arguments (but please, make everyone's life easier and use eager captures, not lazy ones!)
- if you are a multiuser server or something, you probably want
LOAD_USERDATA
andSTORE_USERDATA
(in fact, I've seen 3 levels of this (account, character, session) - though it used sigil-based distinctions rather than something sane). Note there's nothing stopping you from havinguser.foo = x
generate this automagically. - if you support general-purpose objects, you might need some sort of
LOAD_FIELD
andSTORE_FIELD
, but keep in mind that the latter requires 3 arguments no matter how you count it (maybe demote it to an intrinsic ... or maybe just store the extra arguments in the same place the intrinsic does. I am, after all, quite skeptical of trying to cram an Object in the accumulator) LOAD_IMM
(taking an index into the integer constant table)- I don't believe there's any point in having an explicit "zero the accumulator" instruction, since
x OP 0
will be encoded asOP_IMM
, and1 OP 2
will be constant-folded. There may, however, be some use in aSTORE_ZERO
instruction (but it is somewhat less useful if we zero out all local variables while setting up the stack frame, and let's be honest: we should)- but see the below discussion about
ARG
- but see the below discussion about
- I can imagine a
COPY
bytecode that treats both its explicit argument and the accumulator as local variable indices. But since you'd have to set the accumulator somehow in the first place, and there's no other context in which the accumulator might already be set to a local variable index, this currently isn't saving any instructions.- but again, see the
ARG
discussion
- but again, see the
- Control flow:
JUMP
(unconditional; argument is index in the data table of jump targets)- under some circumstances (e.g. if you use the "switch out the extra data" idea), you might wish to enforce that if you ever jump to a bytecode position, you always jump to it. This is as simple as checking that the target's previous instruction (be sure to consider whether extra data is possible or valid!) is some kind of jump (or you're at the start of the bytecode - but are you sure you want to be jumping there?). That said, I have seen VMs that have a
LABEL
instruction even though it does nothing at runtime. (may be useful for debugging too)
- under some circumstances (e.g. if you use the "switch out the extra data" idea), you might wish to enforce that if you ever jump to a bytecode position, you always jump to it. This is as simple as checking that the target's previous instruction (be sure to consider whether extra data is possible or valid!) is some kind of jump (or you're at the start of the bytecode - but are you sure you want to be jumping there?). That said, I have seen VMs that have a
JUMP_EQ
,JUMP_NE
,JUMP_LT
,JUMP_LE
,JUMP_GT
,JUMP_GE
- only the first 2 are strictly necessary
- note that if your frontend actually generates the other 4 (when the argument is not 0 already), you're probably declaring UB (using 8-bit integers for demonstration purposes
127 > -1
is true, but-128 > 0
is false). But if it gives us better codegen, EMBRACE UNDEFINED BEHAVIOR- this is one of the few times it's tempting to have multiple accumulator registers, but it's not worth it. Just generate the separate comparison operators, sigh
- actually, what if we instead just made a "saturating subtraction" instruction? Actually 2, because now we care about unsigned. But call then SCMP and UCMP so we don't have to explain what saturation is? Or at that point should we just use a normal cmp? What generates better machine code?
- jump if
accumulator OP 0
using signed comparisons. There is no need for separate unsigned instructions since the argument is hardcoded as 0:x ULT 0
is always falsex ULE 0
becomesx EQ 0
x UGT 0
becomesx NE 0
x UGE 0
is always true- (what do you mean, you don't want UB for unsigned integers even if you want it for signed ones)
- late edit: it is often worth considering these to take two branch targets and unconditionally jump, at least for analysis purposes. Note that if you actually implement the bytecode that way,
JUMP_LT
andJUMP_GT
happen to merge into a single opcode. - late edit: it might also be useful to consider Fortran-style arithmetic
if
, with 3 total targets, but think: how would that be encoded?
RETURN
(unconditional; argument is ignored)- note that this counts as a jump instruction (technical term "terminator"), unlike calls
- if the function returns an (integer) value, it should probably go in the accumulator, where the caller can store or use it as needed.
- But (especially if it's a complex type) you might instead put it in the same place you put arguments. Or, for that matter, there's nothing stopping you from having an arbitrary number of
out
orinout
arguments! - arguments
- But (especially if it's a complex type) you might instead put it in the same place you put arguments. Or, for that matter, there's nothing stopping you from having an arbitrary number of
- note that, unlike real machines, a VM should not read the return location from the VM data stack. Use a separate stack just for call/return (perhaps arrange the obligatory data structure into an implicit linked list?)
CALL_FUNCTION
,CALL_INTRINSIC
- all functions and intrinsics should specify the expected number and type of arguments. See below for my suggestion for how arguments are passed (note that this is the part where you have the most choice)
- these look identical at the bytecode level, other than which data table the target is looked up in.
- in the runtime, however,
CALL_INTRINSIC
necessarily involves a function call (on the real machine stack), whereasCALL_FUNCTION
is best implemented simply by tweaking the data of the VM stack and remaining in the same machine stack- besides eliminating stack overflows, this also means you can pause/resume the VM at any time, swap between tasks if needed, etc. you might interpret opcodes differently when resuming onto them - or you might require certain opcodes to appear in pairs, or who knows what.
- note that intrinsics should NEVER call back into the VM. If you're tempted to do so, you should probably be adding a special-purpose Opcode instead (you might have e.g. a "prompt the user with a list of menu options, and jump to that target" instruction)
- likewise, while I'm thinking about it: if you support operator overloading, don't dispatch to it from an
ADD
opcode - only use that when you know both arguments are integers (maybe there's a way to avoid this, but it's sure to be complex and thus bug-prone)
- likewise, while I'm thinking about it: if you support operator overloading, don't dispatch to it from an
ARG
,ARG_IMM
, and possibly some helpers: store arguments in preparation for a function/intrinsic call- the most sensible place to store incoming arguments is at the bottom of the current stack frame, below "true" local variables; the most sensible place to store outgoing arguments is at the top of the current stack frame, above all the temporaries (note: in the IR, you might not know how many temporaries there are yet, so you should use refer to them be argument index rather than local-variable index). No copying is necessary if you simply use overlapping (data) stack frames.
- now I hope you're paying enough attention to ask "but wait? how do you specify both the value and the index of the argument?". Well, there are several ways to do it - you could use dedicated opcode bits (see the technicality for what to do if you have more than N (8?) arguments), you could use extra data instructions, you could use a separate
arg_data
table that contains the pairs, or you could use the accumulator (initialize it to zero (or some other value?), then have eachARG
intruction increment (or decrement) it as a side-effect - this does mean you can't do evaluate arguments as you go (though if the inner functions have more arguments you can't do that anyway), since all calls clobber it (hm, but do they really have to?)). Consider also whether arguments are passed in forward or backward order - and also, in what orderfoo(bar(), baz())
andfoo(bar(1, 2, 3), baz(4, 5, 6))
are evaluated.- depending on how exactly these instructions (and their helpers) are implemented, some of them might be generally useful as well
- technically these aren't needed at all, but if we don't have them, there would be a lot of register-register copies, which take 2 instructions each
- there is no need for an "arg from accumulator" instruction, since outgoing arguments are just ordinary (nameless) local variables.
- it may be tempting to use the accumulator for the first (or last) argument, but that adds a surprising amount of complexity, so I'm not sure it's worth it.
If you add all of the above together, including the weird ones, this is around 120 opcodes, which is still under 7 bits (so we can dedicate a single bit for extra data), though with not much to spare for future expansion. If we only use the sensible ones it's about 64, but remember to preserve space for future expansion and not cram it (so give up on the idea of there being two dedicated bits for flagging extra data - that means you cannot specify both "this is extra data, so trap instead of executing" and "this instruction needs extra data, so remember to go look for it" if you want to handle 15 bits of extra data per instruction)
Edit: if you've gotten here, you might also read my thoughts on static typing in relation to VMs (which links back to here)
URL: https://www.reddit.com/r/ProgrammingLanguages/comments/v7q9xg/how_to_implement_static_typing_in_a_c_bytecode_vm/ibmexzq/ Summary: static typing, inside and outside the VM, and their tension. Tagging and how to avoid it, even with unwinding.
There is some tension between "statically-typed inside the VM" and "statically-typed outside the VM".
Most VM languages make the decision to YOLO inside the VM, and use some kind of tagged union, tagged pointer, or NAN tagging outside the VM. Using tagging is mandatory since dynamic languages don't know ahead of time which member will be valid.
However, if you have full static type info, you can use untagged unions to represent VM objects. This is a slight penalty to debugging, and makes unwinding tricky, but it is worth it. Besides, you can still have a debug mode where the full type (or partial type) is stored as part of the object itself (which is then a 2-word-wide tagged union rather than a 1-word-wide raw union as it is in release mode) and asserted on reads, but otherwise the type is never relied upon. But if you support any kind of suspending, you probably need to store the partial types somewhere. You could have an explicit unwind table, or you could just use a parallel bit-vector (as in: struct-of-array, not array-of-struct - with the understanding that array-of-bit is optimized).
Chances are that you can get away with only 2 partial types, but it may be convenient to use 3 or 4, depending on what exact decisions you want to make:
- A primitive type. This includes integers, floats, maybe decimals or fixed-point numbers if you want. These require no lifetime-tracking and no deallocation.
- A simple allocated type. You'll may want to do refcounting on these, but otherwise there is nothing fancy about them; the only thing you need to do with them is deallocate them when their lifetime ends. Strings belong here, as do arrays of primitives ... but if you think about it, you should be able to handle this as a special vtable for one of the below.
- A object whose class is defined in the in-VM language. For these, you fundamentally cannot use the host's vtable mechanism, but you can (and should) create your own vtable system to deallocate the types. You should NOT use a dict-based structure - since you have static type info, you can transform field names into field indices (if you need reflection, make your vtable support that).
- A VM-support object, if this really needs to be handled specially. This includes dicts and arrays for non-primitive types.
- A host object type, strongly owned. If your host exposes any objects to the VM, this is what they'll use.
- A host object type, weakly owned. This may be useful if you're thinking in terms of a multiplayer game, and the player may log out while the script is suspended. Alternatively, you may hard-code setting these in the VM's "context" struct.
- A function pointer (host or in-VM, with or without capture). If these are allowed at all, you should probably desugar them in the frontend to use vtables. For once, Java is a useful inspiration.)
(yes, I listed more than 4 things above. You should be able to eliminate/merge enough easily so your bitvector uses only 2 bits for the partial types. But life will get MUCH easier if you can reduce it to only the 2: "primitives and standard VM objects" - you should be able to support all of 2..6 with a single custom vtable design if you think about it.)
Note that if you use a stack VM, you'll need to use range analysis to determine the type of temporaries on the stack. Depending on what exact unwinding strategy you want, this may be obvious anyway. But I am a strong proponent of register-based VMs (with a single special accumulator) anyway. I wrote up a length thing a while back about them - I recommend reading it all! (note that it links back here now)
Note that within the language that gets compiled to run in the VM, you may offer additional types. However, these should desugar into simpler accesses in the frontend.
For example, if you have a magic player
object, the code:
player.attack = player.strength
might desugar into player_set_attack(player_get_strength())
or even player_set(PLAYER_ATTACK, player_get(PLAYER_STRENGTH))
before the VM sees it at all. A key observation is that the host should probably never expose enums directly into the VM language (an annoyance I learned the hard way from other people's game VM code).
Important note: in general, I assume you can trust your bytecode not to be malicious (since otherwise, VM escapes are trivial), which mostly means "bytecode is generated at startup, not from a cached file", but could also mean "I trust the cached file" or "I have a way to verify the bytecode statically as well" (which is quite doable, but that's more work).
URL: none
Just a brief later update - in the above, I would again emphasize: the thing that is by far the least set is what exactly the function-call sequence looks like. To avoid the need to potentially change stack map (which is otherwise not needed for a register-based language - though even here it is only needed if interruption between arguments is a concern, so doing extra copies is also feasible) for function calls, perhaps allow the CALL opcode to specify a start-of-arguments slot anywhere in the frame (and always do a copy to the new frame?). (note that the number/types of arguments can be read off the function descriptor)