Our virtual machine can now fetch, decode, and execute the instructions of a binary program. But how are we supposed to write such a program? Enter the assember – its job will be to convert programs written using the mnemonic instructions we’ve already discussed into binary programs.
While we could just assemble the language we’ve already seen, let’s also take this opportunity to improve the language a bit, with “labels”!
Motivating labels
Revisiting the simple counter program we saw previously, have a look at this section:
; ...snip...
eq r6 r2 r1 ; check if we reached the total number of iterations.
constant r5 ; the address of the start of the loop.
0xc
jnz r6 r5 ; if they are not equal, jump to the start of the loop.
; ...snip...
Note the literal value 0xc
, which we use to return to the beginning of
the loop. Suppose we were to add a new instruction to the beginning of
our program, before the loop. Then, because the address of the branch
is absolute, we’d need to change that literal to 0xd
. Annoying.
There are a few improvements we might make here. We could provide for a
relative addressing mode to use with our branching instructions.
In x86-64 for instance you can perform a relative jump to an address a
signed number of bytes away using 0xeb
followed by the number of bytes
you’d like to jump, itself a byte. E.g. 0xeb02
to jump forward 2 bytes.
Using negative numbers to represent backwards jumps, and encoding them
using 2’s complement,
instead of jmp 0xc
we’d have a “relative jump” jmpr 0xfffffffd
to
jump back 3 bytes.
But relative addressing won’t help us if we change the number of instructions within the loop itself. Instead, let’s allow ourselves to avoid hardcoding these addresses entirely, and use labels instead.
Defining and using labels
We define labels in our program in the same way that you might in C: @some_label:
.
We reference labels, for now, only when writing down an immediate. Hence,
our new loop becomes:
; ...snip...
@loop: ; Label definition.
add r4 r4 r3 ;
add r2 r2 r0 ;
eq r6 r2 r1 ;
constant r5 ;
@loop ; Label reference, used with `constant` like an immediate.
jnz r6 r5
; ...snip...
Now we can add instructions wherever we want, and our program will continue to work.
Implementing labels
We’ll implement our assembler, with labels, using two passes. In the first pass, we’ll collect all of the labels defined in the program, along with their addresses. In the second pass, we’ll emit the code, replacing label references with their addresses, and encoding instructions using our encoder.
The only tricky bit is that we must take care to correctly calculate the addresses of the labels in our program. Instructions take up 1 byte, as do immediates, but label definitions take up 0 bytes in the emitted code.
First, we define types for each of the kinds of “directives” that we’ll
parse – label definitions, normal instructions, and numeric or label
immediates (that we’ll use with constant
):
interface Label { type: 'label'; label: string; }
// Now we have both numeric literals *and* label literals.
interface Immediate {
type: 'immediate',
label?: string;
immediate?: number;
}
interface Instruction {
type: 'instruction';
instruction: DecodedInstruction;
}
// Each directive is one of the above types, along with a `size`.
type Directive = (Label | Immediate | Instruction) & {
size: number;
};
Note that that each of our Directive
types has a type
field whose type is not string
but is in fact a string literal. This makes Directive
a discriminated union, which TypeScript treats specially – for instance,
having checked that a particular directive is of type “instruction”, we can directly make use of
directive.instruction
, without checking that it is defined. Cool!
We also use a union type to require that each of our directives include size information, instead of adding it to each of our directive types – the size field is “distributed” to each of the types in the intersection.
Next we perform our passes to assemble a list of directives into a list of encoded instructions:
function assemble(directives: Directive[]): number[] {
// First pass: collect the addresses.
let address = 0;
const labelAddresses: {[label: string]: number} = {};
directives.forEach((directives, i) => {
if(directives.type === 'label'){
labelAddresses[directives.label] = address;
}
address += directives.size
});
// Second pass: emit the encoded instructions.
const encoded: number[] = [];
directives.forEach((directive) => {
switch(directive.type){
case 'instruction':
// Instructions we simply encode.
const i = encode(directive.instruction);
encoded.push(i!);
break;
case 'immediate':
// If we see a label reference, look up
// the address in the table we built.
encoded.push(directive.label ?
labelAddresses[directive.label] :
directive.immediate || 0
);
break;
case 'label':
// We do nothing for label definitions,
// they are not present in the emitted code.
break;
}
});
return encoded;
}
I’ve elided parsing because it’s kind of long not especially
interesting. Since we’ve no matching parentheses or anything else of a context-free nature,
we can get away with hacky text munging for this simple example; the current
version of qasm
actually uses PEG.js for the parser.
Give it a try! You can edit the code below:
Ready.