Version: unreleased (0.0.0-0)
Boot is a low-level 'assembly language'-style virtual machine (VM) that is easy to implement.
Boot is a target language that is easy to implement on a wide variety of platforms, even 'on top of'/within an existing high-level languages such as Python.
Highlights:
[[TOC?]]
Note that:
two primary datatypes:
ptr has two subtypes: - ptrd (data pointers) - ptrc (code pointers)
The notation $n refers to the n-th int32 register, and &n refers to the n-th ptr register, for example the first and last registers in each bank are: $0, $7, &0, &7.
The zero-th register in the int32 bank, $0, is constant zero, and the zero-th register in the ptr bank, &0, is the constant null pointer; writes to these registers have no effect.
Registers $1 and &1 are called the 'smallstack' registers and have special behavior; writes to these registers push the written value onto a stack, and reads from these registers pop a value from a stack. There are two stacks, one for integers and one for pointers; these stacks are called "smallstacks". They each have a capacity of 4 items. At all times there must be at least 1 item on each smallstack; an attempt to pop the last item off the stack is illegal. If more than one operand specifies a smallstack register, they are applied in this order: op2, op1, op0.
At the beginning of the program, the smallstacks have one item, and the value of this item and of every register is arbitrary.
At each instruction in the program, at the end of executing that instruction, the depth of the smallstack when control reaches that location must be the same in every possible execution of the program.
48 instructions
--- | --- |
annotation | ann |
load constants | lm6 lm22 lm32 lpc6 lpc22 lpc32 |
loads and stores and copies | l8 l8u l16 l16u l32 lp s8 s16 s32 sp cp cpp |
arithmetic of ints | add32 sub32 mul32 add32m |
bitwise arithmetic | and or xor shl shrs shru |
adding ints to pointers | ap |
comparision control flow | bne blt bltu beq bnep beqp |
other control flow | j9 j25 j32 jy |
system library call | lib |
stack manipulation | dup_int dup_ptr swap_int swap_ptr over_int over_ptr |
misc | sinfo |
(Notation for the instruction tables below) #imm3, #imm6, #imm9, #imm32 are signed immediate constants, #imm3u, #imm6u, #imm9u are immediate constants interpreted as unsigned, $X is an integer register or its contents, &X is a pointer register or its contents, 0 is an unused argument that must always be 0 (other values are RESERVED for future use). All signed immediate constants are two's-complement.
From left to right, the arguments go into operands op0, op1, op2. Immediate operands are always on the right (the highest-numbered operand). When two immediate operands are combined into an #imm6 (as with instruction li6), op1 is the high-order bits and op2 is the low order bits (imm6 = (op1 << 3) + op2). Similarly for #imm9 (imm9 = (op0 << 6) + (op1 << 3) + op2).
Jump and branch immediate offsets are in units of bytes in the Boot code, where '0' indicates the following instruction location (in the case of j32, that means the location after the embedded 32-bit immediate offset). JREL and branch immediates may not jump into the middle of an instruction. Platforms which compile or represent Boot code in memory in ways such that one instruction spans more or less than 4 memory locations must adjust the jump/branch immediate offsets accordingly before executing them.
The 'm' in the mnemonics lm6, lm32, add32m, apm, appm, ap32m, ap16m stands for 'iMmediate' (although not all instructions with immediates have an 'm' in the mnemonic).
annotation:
load constants:
register loads and stores and copies:
stack manipulation:
arithmetic of ints (result always defined and all results mod 2^32):
bitwise arithmetic:
Adding ints to Pointers (only valid on data pointers, not code pointers):
conditional branches:
unconditional jumps and other control flow:
system library call:
misc:
when query = ..., this returns in &dest ...:
Note that the sinfo query results defined above (and possibly others) are static -- they must never change during the execution of a program.
Number 0 thru 1 are defined below and 2 thru 255 are RESERVED for extensions. libfn numbers 256 thru 511 are implementation-defined and may be used to access linked libraries, if the implementation supports that.
Terminate program, with RESULT code passed in $6. The result code is interpreted in a platform-specific way (however, most typically, success is indicated with a result code of 0).
The caller must assume that the values in registers $4, $5, $6, $7, and &4, &5, &6, &7 may be overwritten during the call.
The following CSV-formatted table contains tuples of the form:
(opcode (as found in the opcode field of the instruction), reference_opcode (a number uniquely identifying the instruction) mnemonic, 1 if the instruction has an embedded 32-bit immediate word following it and 0 otherwise, type of op0, type of op1, type of op2, 1 if the instruction might write to the register specified by op0, 1 if the instruction might read from the register specified by op0, 1 if the instruction might write to the PC, 1 if the instruction might read the PC, 1 if the instruction might write to memory 1 if the instruction might read from memory, )
Type identifiers in the following table:
Note that the opcode field is written in decimal notation (not hexadecimal).
The fields are: (opcode field value) (op0 field value) (op1 field value) (op2 field value): mnemonic (reference opcode)
0 * * *: 0 j9 1 * * *: 1 j25 2 * * *: 2 lib 3 * * *: 3 lpc6 4 * * *: 4 lpc22 5 * * *: 5 lm6 6 * * *: 6 lm22 7 * * *: 7 ann 8 * * *: 8 beq 9 * * *: 9 bne A * * *: 10 blt B * * *: 11 bltu C * * *: 12 beqp D * * *: 13 bnep E * * *: 14 s32 F * * *: 15 s16 10 * * *: 16 s8 11 * * *: 17 sp 12 * * *: 18 add32m 14 * * *: 20 shl 15 * * *: 21 shru 16 * * *: 22 shrs 17 * * *: 23 add32 18 * * *: 24 sub32 19 * * *: 25 mul32 1A * * *: 26 and 1B * * *: 27 or 1C * * *: 28 xor 1D * * *: 29 ap 1E 0 * *: 30 l32 1E 1 * *: 31 l16 1E 2 * *: 32 l16u 1E 3 * *: 33 l8 1E 4 * *: 34 l8u 1E 5 * *: 35 lp 1E 6 * *: 36 cp 1E 7 * *: 37 cpp 1F 0 * *: 38 sinfo 1F 1 * *: 39 in1 (todo add to table below, and above) 1F 2 * *: 40 out1 (todo) 1F 3 * *: 41 1F 4 * *: 42 1F 5 * *: 43 1F 6 * *: 44 1F 7 0 *: 45 lm32 1F 7 1 *: 46 lpc32 1F 7 2 *: 47 jy 1F 7 3 *: 48 1F 7 4 *: 49 1F 7 5 *: 50 1F 7 6 *: 51 1F 7 7 0: 52 j32 1F 7 7 1: 53 dup_int 1F 7 7 2: 54 dup_ptr 1F 7 7 3: 55 swap_int 1F 7 7 4: 56 swap_ptr 1F 7 7 5: 57 over_int 1F 7 7 6: 58 over_ptr 1F 7 7 7: 50 break (todo)
opcode, reference_opcode, mnemonic, has_i16_data, has_i32_data, op0_type, op1_type, op2_type, op0_w, op0_r, op1_w, op1_r, op2_w, op2_r, PC_w, PC_r, mem_w, mem_r 0,0,j9,0,0,i9,_,_,0,1,0,0,0,0,1,1,0,0 1,1,j25,1,0,i9,_,_,0,1,0,0,0,0,1,1,0,0 2,2,lib,0,0,u9,_,_,0,0,0,0,0,0,1,1,1,1 3,3,lpc6,0,0,rp,i6,_,1,0,0,0,0,0,0,1,0,0 4,4,lpc22,1,0,rp,i6,_,1,0,0,0,0,0,0,1,0,0 5,5,lm6,0,0,ri,i6,_,1,0,0,0,0,0,0,0,0,0 6,6,lm22,1,0,ri,i6,_,1,0,0,0,0,0,0,0,0,0 7,7,ann,0,0,?,?,?,0,0,0,0,0,0,0,0,0,0 8,8,beq,0,0,ri,ri,i3,0,1,0,1,0,0,1,1,0,0 9,9,bne,0,0,ri,ri,i3,0,1,0,1,0,0,1,1,0,0 10,10,blt,0,0,ri,ri,i3,0,1,0,1,0,0,1,1,0,0 11,11,bltu,0,0,ri,ri,i3,0,1,0,1,0,0,1,1,0,0 12,12,beqp,0,0,rp,rp,i3,0,1,0,1,0,0,1,1,0,0 13,13,bnep,0,0,rp,rp,i3,0,1,0,1,0,0,1,1,0,0 14,14,s32,0,0,ri,rp,i3,0,1,0,1,0,0,0,0,1,0 15,15,s16,0,0,ri,rp,i3,0,1,0,1,0,0,0,0,1,0 16,16,s8,0,0,ri,rp,i3,0,1,0,1,0,0,0,0,1,0 17,17,sp,0,0,rp,rp,i3,0,1,0,1,0,0,0,0,1,0 18,18,add32m,0,0,ri,ri,i3,1,0,0,1,0,0,0,0,0,0 20,20,shl,0,0,ri,ri,u3,1,0,0,1,0,0,0,0,0,0 21,21,shru,0,0,ri,ri,u3,1,0,0,1,0,0,0,0,0,0 22,22,shrs,0,0,ri,ri,u3,1,0,0,1,0,0,0,0,0,0 23,23,add32,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0 24,24,sub32,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0 25,25,mul32,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0 26,26,and,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0 27,27,or,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0 28,28,xor,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0 29,29,ap,0,0,rp,rp,ri,1,0,0,1,0,1,0,0,0,0 30,30,l32,0,0,0,ri,rp,1,0,0,1,0,0,0,0,0,1 30,31,l16,0,0,1,ri,rp,1,0,0,1,0,0,0,0,0,1 30,32,l16u,0,0,2,ri,rp,1,0,0,1,0,0,0,0,0,1 30,33,l8,0,0,3,ri,rp,1,0,0,1,0,0,0,0,0,1 30,34,l8u,0,0,4,ri,rp,1,0,0,1,0,0,0,0,0,1 30,35,lp,0,0,5,rp,rp,1,0,0,1,0,0,0,0,0,1 30,36,cp,0,0,6,ri,ri,1,0,0,1,0,1,0,0,0,0 30,37,cpp,0,0,7,rp,rp,1,0,0,1,0,1,0,0,0,0 31,38,sinfo,0,0,0,ri,u6,1,0,0,0,0,0,0,0,0,0 31,39,lm32,0,1,7,0,ri,0,0,0,0,1,0,0,0,0,0 31,40,lpc32,0,1,7,1,rp,0,0,0,0,1,0,0,1,0,0 31,41,jy,0,0,7,2,rp,0,0,0,0,0,1,1,0,0,0 31,42,j32,0,1,7,7,0,0,0,0,0,0,0,1,1,0,0 31,43,dup_int,0,0,7,7,1,0,0,0,0,0,0,0,0,0,0 31,44,dup_ptr,0,0,7,7,2,0,0,0,0,0,0,0,0,0,0 31,43,swap_int,0,0,7,7,3,0,0,0,0,0,0,0,0,0,0 31,44,swap_ptr,0,0,7,7,4,0,0,0,0,0,0,0,0,0,0 31,45,over_int,0,0,7,7,5,0,0,0,0,0,0,0,0,0,0 31,46,over_ptr,0,0,7,7,6,0,0,0,0,0,0,0,0,0,0
todo the following may need to have their numbers updated:
Note that the reference opcodes have the following properties:
Note that later additions that assign instruction(s) to the RESERVED opcodes may break the 'only such' parts of these properties.
For example, for multiplication, imagine if we had 3-bit integers instead of 32-bit integers. If we multiply the unsigned representations of 2*3, that is, 010*011, the result is 6, that is, 110. In two's complement, 010 represents 2 and 011 also represents 3, and 110 represents -2; and 2*3 = 6 = -2 mod 2^3. To give another example, if we multiply the unsigned representations of 6*2, that is, 110*010, the result is 12, and 12 mod 2^3 is 4, that is, 100 in binary. In two's complement, 110 represents -2 and 010 represents 2, and the result, 100, represents -4; and -2*2 = -4 mod 2^3 = 4. Do note that these are only correct modulo the bitwidth; for example, 2*3 = 6, but mul 010 011 = 110, which when viewed as two's complement yields 2*3 = -2, an incorrect result in ordinary arithmetic, but -2 is equivalent to 6 mod 8, so the result is correct in mod 8 arithmetic. In the examples in this paragraph we used mod 8 for 3-bit integers, but in reality, we are using mod 2^32 for 32-bit integers, not mod 8.
On many platforms, it may be easiest to implement add32, add32m, sub32, mul32 by viewing the int32s as unsigned integers and then applying unsigned arithmetic operations, because many platforms don't implement wrap-around signed numbers.
Note that many arithmetic operations are provided only for integers; the only arithmetic you can do to pointers is add signed integers to them.
The instructions lb, lbu, lh, lhu guarantee that the numbers read into registers are in certain ranges that fit in 8- and 16-bits, respectively. lb and lh sign-extend the number read to 32-bits, and lbu and lhu zero-extend the number read to 32-bits.
However, lb and lh result in signed two's complement representations in the destination register; note that the bit pattern of a small negative number in a 32-bit register, when coded with signed two's complement, is equivalent to a number larger than 16 bits if interpreted as unsigned. For example, a -1 in a register, signed, would be viewed as (232 - 1 = 4294967295) unsigned.
Boot guarantees that bytes (8-bit integers) have a size 1 in memory (meaning that values that are stored with sb occupy one memory location). INT8_SIZE is not defined because it would always be 1.
Furthermore there is no guarantee that 32-bit integers occupy more than one memory location, or that larger integer bitwidths occupy more memory locations than smaller; it's possible for both of INT16_SIZE, INT32_SIZE to be identically 1 (this can happen if the implementation chooses to make each single memory location large enough to store 32 bits of data). Larger integer bitwidths are guaranteed to occupy at least as many memory locations as smaller.
The smallstacks can be pushed and popped from/to other registers by using cp/cpp and the special smallstack pseudoregister, 1. 'DUP' can be accomplished by pushing a copy of the top-of-stack register, 2, onto the smallstack using cp/cpp and writing to the smallstack pseudoregister, 1, e.g. 'cp $1 $2'.
'DROP' can be accomplished by popping the top-of-stack register, 2, and discarding it by using cp/cpp and writing to the zero pseudoregister, eg 'cp $0 $2'.
These lists are probably accidentially incomplete right now, but we hope to make this list comprehensive as time goes on.
The following are undefined behaviors in Boot. Any program containing undefined behavior on any codepath has undefined behavior as a whole:
The following do not cause undefined behavior and do not make the whole program invalid, but do not define the resulting values of certain operations:
Another category is items which are reserved for future use. These items are reserved for use in future versions of Boot itself, and should not be used by either extensions or by implementations.
Implementations must not define or use items which are reserved for extensions, or items which are reserved for future use; if they do so, they risk incompatibility with extensions or future Boot versions. Extension languages must not define or use items which are defined in Boot to be implementation-dependent.
Boot Assembly is ASCII text. Each line is processed separately; lines are delimited by the newline character, '\n' (a byte with the value 10). Whitespace is defined as one of the characters: ' \t\n\r\f\v' (where \t indicates tab, \n indicates newline, etc). Lines which are all whitespace, or which begin with a semicolon, are skipped. Trailing whitespace on any line is ignored.
Some lines may begin with '.d' (a 'data line'). This is followed by a space and then a 16-bit unsigned hexadecimal number (4 characters which are each digits or letters within a-f). This may be followed by whitespace which may be followed by a semicolon. After a semicolon the rest of the line is a comment (all characters except newline are ignored), up to the first newline, which still terminates the line.
Therefore, data lines must match the following regular expression (regex): ^\.d [0-9a-f][0-9a-f][0-9a-f][0-9a-f]\s*(;.*)?$
Other lines (an 'instruction line') begin with an instruction mnemonic. Instruction mnemonics are at most 12 characters, and are all lowercase alphanumeric. Mnemonics begin with an alphabetic character and are followed by one or more alphanumeric characters. This is followed by a space, possibly followed by a hexadecimal number (a string of digits from 0 to 9 and lowercase letters from a to f, possibly prefixed by one of '-' or '+') denoting the first operand, op0. This may be followed by another space and a second number (op1), and maybe by a another space and a third number (op2). After three operands the usable information in the line is exhausted and the assembler may skip to the next line. If the instruction is followed by a 32-bit embedded immediate constant, this must be included manually on the next line using '.d'. The usable information in the line may be followed by whitespace which may be followed by a semicolon. After a semicolon the rest of the line is a comment (all characters except newline are ignored), up to the first newline, which still terminates the line.
Operands are hexadecimal integer in base 16 and may be prefixed by '+' or '-' to indicate sign. Immediate operand ranges are (note that these ranges are written in hexadecimal):
Operands of register type must be in the range 0 to f, inclusive.
Therefore, instruction lines must match the following regular expression (regex): ^([a-z][a-z0-9]+)( ([-+]?[0-9a-f]+))?( ([-+]?[0-9a-f]+))?( ([-+]?[0-9a-f]+))?\s*(;.*)?$
The last line in the file must end in a newline, unless the last line is all whitespace.