Jake Taylor yupferris

A traditional table-based DFA implementation looks like this:

uint8_t table[NUM_STATES][256]

uint8_t run(const uint8_t *start, const uint8_t *end, uint8_t state) {
    for (const uint8_t *s = start; s != end; s++)
        state = table[state][*s];
    return state;
}

	// Example: Opcode dispatch in a bytecode VM. Assume the opcode case dispatching is mispredict heavy,
	// and that pc, ins, next_ins, next_opcase are always in registers.

	#define a ((ins >> 8) & 0xFF)
	#define b ((ins >> 16) & 0xFF)
	#define c ((ins >> 24) & 0xFF)

	// Version 1: Synchronous instruction fetch and opcode dispatch. The big bottleneck is that given how light
	// the essential work is for each opcode case (e.g. something like ADD is typical), you're dominated
	// by the cost of the opcode dispatch branch mispredicts. When there's a mispredict, the pipeline restarts