So, now we have some notation that defines our instruction set, and we understand how each instruction operates. Last time, in our simple interpreter, we stored our instructions as JS strings. But of course, a real CPU doesn’t have text-based representations of assembly code floating around in it! Instead, machine code is “just a bunch of binary” that the CPU decodes into instructions.
Encoding instructions
So, what do we have to encode? Recall our three-address code notation:
instruction dr sr0 sr1
We have an instruction (one of around twenty), a destination register (one of 64 general purpose registers), and 2 source registers (each of which could be a general purpose register as well). That’s 4 components total, each with a small range. Since we have plenty of space in our 32-bit machine, we’ll encode the instruction in 1 octet, the destination register in 1 octet, and each source register in 1 octet, for a total of 4 octets.
For example, if add
is instruction 0x13
, and we are encoding add r0 r1 r2
, we
get 0x13000102
.
Let’s try it out:
Once again, error handling somewhat falls by the wayside. Preview source »
Implementing this is just a few simple bitwise operations: assuming a valid operation
as well as valid values for each of dr
, sr0
, and sr1
, our encoded instruction is just:
(operation << 24) | (dr << 16) | (sr0 << 8) | (sr1 << 0)
This is a very wasteful encoding – we’re using 32 bits of memory to encode far fewer bits of information. Out there in “real life”, using too many bytes to encode instructions has real ramifications. For instance, less code might fit in a cache, or it might take longer to load a set of instructions. Since we don’t even have a cache, and we’re writing this in fucking TypeScript, we’ll take the easy road.
Constants
Constants are a bit different. We want to be able to encode any 32-bit constant in our
constant
instruction, so we’ll need more than 32-bits. Therefore, instead of encoding
the immediate value in the instruction, we’ll have the immediate value follow it…
immediately! For example, constant r3 0xdeadbeef
will encode to 0x02030000
for the
“instruction” part and 0xdeadbeef
for the immediate part.
Decoding instructions
Now that we understand how instructions are encoded, decoding a number i
that encodes
an instruction is simply the reverse of the encoding process:
operation = i >> 24 & 0xff
dr = i >> 16 & 0xff
sr0 = i >> 8 & 0xff
sr1 = i >> 0 & 0xff
Let’s try it out:
The decoder ignores unused destination and source registers, hence for example
0x4
– that is, 0x00000004
– decodes to halt
. Preview source »
Of course, we can’t really tell the difference between an instruction
and an immediate that happens to equal the value that an instruction encodes
to; instead, we need to know whether the instruction “before” the one we
are decoding is a constant
.
Updating our interpreter
Now we can update our virtual machine’s “fetch” step to simply read from a list of numbers, and the “decode” step to decode each number using the above approach. Then execution proceeds in much the same way.
// Constants for our instructions, since they are no longer strings.
enum Instructions {
halt = 0,
add,
// ...
jmp,
// ...
}
function interpret(instructions: number[]): number {
const registers: number[] = new Array(64).fill(0);
const memory: number[] = new Array(0x1000).fill(0);
let ip = 0;
// Load program into memory.
instructions.forEach((instruction, i) => {
memory[i] = instruction;
});
while(true){
// Fetch.
const instruction = memory[ip];
ip++;
// Decode.
const decoded = decode(instruction);
if(decoded === undefined){
throw new Error(`invalid instruction ${instruction}`);
}
// Execute.
switch(decoded.op){
case Halt:
return registers[0];
case Add:
registers[decoded.dr] = registers[decoded.sr0] + registers[decoded.sr1];
break;
// ...
case Constant:
registers[decoded.dr] = memory[ip];
ip++
break;
// ...
case Jmp:
ip = registers[decoded.sr0];
break;
default:
throw new Error(`invalid instruction operation ${decoded.op}`);
}
}
}
See how we load the program into main memory at the outset? Now our code
and our data live in the same space, a true Von Neumann architecture. For
now we load the program at address 0x0000
.
Pretty straightforward. But, since we won’t have a list of encoded instructions at hand, we’re hard-pressed to try out our improved interpreter (which, by the way, is looking more and more like a proper virtual machine). We need a tool to convert our text instructions to a binary program… enter the assembler!