proj-oot-bootReferenceOld201026

  1. Boot (Oot Boot) reference

Version: unreleased (0.0.0-0)

Boot is a low-level 'assembly language'-style virtual machine (VM) that is easy to implement.

  1. # Introduction

Boot is a target language that is easy to implement on a wide variety of platforms, even 'on top of'/within an existing high-level languages such as Python.

Highlights:

[[TOC?]]

  1. # Architecture
      1. Instruction encoding 2 bytes per instruction. The bit fields are:

Note that:

      1. Datatypes ##

two primary datatypes:

ptr has two subtypes: - ptrd (data pointers) - ptrc (code pointers)

  1. ## Registers ## Two banks of 8 registers each; one for int32, one for ptr.

The notation $n refers to the n-th int32 register, and &n refers to the n-th ptr register, for example the first and last registers in each bank are: $0, $7, &0, &7.

The zero-th register in the int32 bank, $0, is constant zero, and the zero-th register in the ptr bank, &0, is the constant null pointer; writes to these registers have no effect.

Registers $1 and &1 are called the 'smallstack' registers and have special behavior; writes to these registers push the written value onto a stack, and reads from these registers pop a value from a stack. There are two stacks, one for integers and one for pointers; these stacks are called "smallstacks". They each have a capacity of 5 items. At all times there must be at least 1 item on each smallstack; an attempt to pop the last item off the stack is illegal. If more than one operand specifies a smallstack register, they are applied in this order: op2, op1, op0.

Registers $2 and &2 are called the 'TOS registers' and have special behavior; they are aliased to the item on top of the respective smallstacks. That is to say, reading and writing to these registers read and write the most recently pushed item on the stack (without pushing or popping).

At the beginning of the program, the smallstacks have one item, and the value of this item and of every register is arbitrary.

At each instruction in the program, at the end of executing that instruction, the depth of the smallstack when control reaches that location must be the same in every possible execution of the program.

  1. # Instructions ##

46 instructions

------
annotation ann
load constants lm6 lm22 lm32 lpc6 lpc22 lpc32
loads and stores and copies l8 l8u l16 l16u l32 lp s8 s16 s32 sp cp cpp
arithmetic of ints add32 sub32 mul32 add32m
bitwise arithmetic and or xor shl shrs shru
adding ints to pointers ap apm app ap32 ap16 appm ap32m ap16m
comparision control flow bne blt bltu beq bnep beqp
other control flow j9 j25 j32 jy
system library call lib
misc sinfo

(Notation for the instruction tables below) #imm3, #imm6, #imm9, #imm32 are signed immediate constants, #imm3u, #imm6u, #imm9u are immediate constants interpreted as unsigned, $X is an integer register or its contents, &X is a pointer register or its contents, 0 is an unused argument that must always be 0 (other values are RESERVED for future use). All signed immediate constants are two's-complement.

From left to right, the arguments go into operands op0, op1, op2. Immediate operands are always on the right (the highest-numbered operand). When two immediate operands are combined into an #imm6 (as with instruction li6), op1 is the high-order bits and op2 is the low order bits (imm6 = (op1 << 3) + op2). Similarly for #imm9 (imm9 = (op0 << 6) + (op1 << 3) + op2).

Jump and branch immediate offsets are in units of bytes in the Boot code, where '0' indicates the following instruction location (in the case of j32, that means the location after the embedded 32-bit immediate offset). JREL and branch immediates may not jump into the middle of an instruction. Platforms which compile or represent Boot code in memory in ways such that one instruction spans more or less than 4 memory locations must adjust the jump/branch immediate offsets accordingly before executing them.

The 'm' in the mnemonics lm6, lm32, add32m, apm, appm, ap32m, ap16m stands for 'iMmediate' (although not all instructions with immediates have an 'm' in the mnemonic).

annotation:

load constants:

register loads and stores and copies:

stack manipulation:

arithmetic of ints (result always defined and all results mod 2^32):

bitwise arithmetic:

Adding ints to Pointers (only valid on data pointers, not code pointers):

conditional branches:

unconditional jumps and other control flow:

system library call:

misc:

      1. Notes on certain instructions ###
    1. sinfo queries ##

when query = ..., this returns in &dest ...:

Note that the sinfo query results defined above (and possibly others) are static -- they must never change during the execution of a program.

  1. # lib calls ## The argument specifies which library function is called.

Number 0 thru 1 are defined below and 2 thru 255 are RESERVED for extensions. libfn numbers 256 thru 511 are implementation-defined and may be used to access linked libraries, if the implementation supports that.

  1. ## lib 0: halt(RESULT: $6) ###

Terminate program, with RESULT code passed in $6. The result code is interpreted in a platform-specific way (however, most typically, success is indicated with a result code of 0).

  1. ## lib 1: memcpy(DST: &5, SRC: &6, SIZE: $5) ### Copy SIZE bytes starting at memory location SRC to memory starting at memory location DST.

The caller must assume that the values in registers $4, $5, $6, $7, and &4, &5, &6, &7 may be overwritten during the call.

  1. # Table of opcodes ##

The following CSV-formatted table contains tuples of the form:

(opcode (as found in the opcode field of the instruction), reference_opcode (a number uniquely identifying the instruction) mnemonic, 1 if the instruction has an embedded 32-bit immediate word following it and 0 otherwise, type of op0, type of op1, type of op2, 1 if the instruction might write to the register specified by op0, 1 if the instruction might read from the register specified by op0, 1 if the instruction might write to the PC, 1 if the instruction might read the PC, 1 if the instruction might write to memory 1 if the instruction might read from memory, )

Type identifiers in the following table:

Note that the opcode field is written in decimal notation (not hexadecimal).

  1. ## Instruction decode table ###

The fields are: (opcode field value) (op0 field value) (op1 field value) (op2 field value): mnemonic (reference opcode)

     0 * * *: 0  j9
     1 * * *: 1  j25
     2 * * *: 2  lib
     3 * * *: 3  lpc6
     4 * * *: 4  lpc22
     5 * * *: 5  lm6
     6 * * *: 6  lm22
     7 * * *: 7  sinfo
     8 * * *: 8  ann
     9 * * *: 9  beq
     A * * *: 10 bne
     B * * *: 11 blt
     C * * *: 12 bltu
     D * * *: 13 beqp
     E * * *: 14 bnep
     F * * *: 15 s32
    10 * * *: 16 s16
    11 * * *: 17 s8
    12 * * *: 18 sp
    13 * * *: 19 l32
    14 * * *: 20 l16
    15 * * *: 21 l16u
    16 * * *: 22 l8
    17 * * *: 23 l8u
    18 * * *: 24 lp
    19 * * *: 25 add32m
    1A * * *: 26 apm
    1B * * *: 27 appm
    1C * * *: 28 ap32m
    1D * * *: 29 ap16m
    1E * * *: 30 shl
    1F * * *: 31 shru
    20 * * *: 32 shrs
    21 * * *: 33 add32
    22 * * *: 34 sub32
    23 * * *: 35 mul32
    24 * * *: 36 and
    25 * * *: 37 or
    26 * * *: 38 xor
    27 * * *: 39 cp
    28 * * *: 40 cpp
    29 * * *: 41 ap
    2A * * *: 42 app
    2B * * *: 43 ap32
    2C * * *: 44 ap16
    3F 7 0 *: 70 lm32
    3F 7 1 *: 71 lpc32
    3F 7 2 *: 73 jy
    
    3F 7 7 0: 80 j32
    3F 7 7 1: 81 swap_int
    3F 7 7 2: 82 swap_ptr
    3F 7 7 5: 83 over_int
    3F 7 7 6: 84 over_ptr
  1. ## Instruction metadata table ###
    opcode, reference_opcode, mnemonic, has_i16_data, has_i32_data, op0_type, op1_type, op2_type, op0_w, op0_r, op1_w, op1_r, op2_w, op2_r, PC_w, PC_r, mem_w, mem_r
    2,2,j9,0,0,i9,_,_,0,1,0,0,0,0,1,1,0,0
    3,3,j25,1,0,i9,_,_,0,1,0,0,0,0,1,1,0,0
    4,4,lib,0,0,u9,_,_,0,0,0,0,0,0,1,1,1,1
    5,5,lpc6,0,0,rp,i6,_,1,0,0,0,0,0,0,1,0,0
    6,6,lm6,0,0,ri,i6,_,1,0,0,0,0,0,0,0,0,0
    7,7,sinfo,0,0,ri,u6,_,1,0,0,0,0,0,0,0,0,0
    8,8,ann,0,0,?,?,?,0,0,0,0,0,0,0,0,0,0
    9,9,beq,0,0,ri,ri,i3,0,1,0,1,0,0,1,1,0,0
    10,10,bne,0,0,ri,ri,i3,0,1,0,1,0,0,1,1,0,0
    11,11,blt,0,0,ri,ri,i3,0,1,0,1,0,0,1,1,0,0
    12,12,bltu,0,0,ri,ri,i3,0,1,0,1,0,0,1,1,0,0
    13,13,beqp,0,0,rp,rp,i3,0,1,0,1,0,0,1,1,0,0
    14,14,bnep,0,0,rp,rp,i3,0,1,0,1,0,0,1,1,0,0
    15,15,s32,0,0,ri,rp,i3,0,1,0,1,0,0,0,0,1,0
    16,16,s16,0,0,ri,rp,i3,0,1,0,1,0,0,0,0,1,0
    17,17,s8,0,0,ri,rp,i3,0,1,0,1,0,0,0,0,1,0
    18,18,sp,0,0,rp,rp,i3,0,1,0,1,0,0,0,0,1,0
    19,19,l32,0,0,ri,rp,i3,1,0,0,1,0,0,0,0,0,1
    20,20,l16,0,0,ri,rp,i3,1,0,0,1,0,0,0,0,0,1
    21,21,l16u,0,0,ri,rp,i3,1,0,0,1,0,0,0,0,0,1
    22,22,l8,0,0,ri,rp,i3,1,0,0,1,0,0,0,0,0,1
    23,23,l8u,0,0,ri,rp,i3,1,0,0,1,0,0,0,0,0,1
    24,24,lp,0,0,rp,rp,i3,1,0,0,1,0,0,0,0,0,1
    25,25,add32m,0,0,ri,ri,i3,1,0,0,1,0,0,0,0,0,0
    26,26,apm,0,0,rp,rp,i3,1,0,0,1,0,0,0,0,0,0
    27,27,appm,0,0,rp,rp,i3,1,0,0,1,0,0,0,0,0,0
    28,28,ap32m,0,0,rp,rp,i3,1,0,0,1,0,0,0,0,0,0
    29,29,ap16m,0,0,rp,rp,i3,1,0,0,1,0,0,0,0,0,0
    30,30,shl,0,0,ri,ri,u3,1,0,0,1,0,0,0,0,0,0
    31,31,shru,0,0,ri,ri,u3,1,0,0,1,0,0,0,0,0,0
    32,32,shrs,0,0,ri,ri,u3,1,0,0,1,0,0,0,0,0,0
    33,33,add32,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0
    34,34,sub32,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0
    35,35,mul32,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0
    36,36,and,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0
    37,37,or,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0
    38,38,xor,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0
    39,39,cp,0,0,ri,ri,ri,1,0,0,1,0,1,0,0,0,0
    40,40,cpp,0,0,rp,rp,ri,1,0,0,1,0,1,0,0,0,0
    41,41,ap,0,0,rp,rp,ri,1,0,0,1,0,1,0,0,0,0
    42,42,app,0,0,rp,rp,ri,1,0,0,1,0,1,0,0,0,0
    43,43,ap32,0,0,rp,rp,ri,1,0,0,1,0,1,0,0,0,0
    44,44,ap16,0,0,rp,rp,ri,1,0,0,1,0,1,0,0,0,0
    
    
    
    
    63,63,,0,0,0,,,0,0,0,0,0,0,0,0,0,0
    63,64,,0,0,1,?,?,0,0,0,0,0,0,0,0,0,0
    63,65,,0,0,2,?,?,0,0,0,0,0,0,0,0,0,0
    63,66,,0,0,3,?,?,0,0,0,0,0,0,0,0,0,0
    63,67,,0,0,4,?,?,0,0,0,0,0,0,0,0,0,0
    63,68,,0,0,5,?,?,0,0,0,0,0,0,0,0,0,0
    63,69,,0,0,6,?,?,0,0,0,0,0,0,0,0,0,0
    
    63,70,lm32,0,1,7,0,ri,0,0,0,0,1,0,0,0,0,0
    63,71,lpc32,0,1,7,1,rp,0,0,0,0,1,0,0,1,0,0
    63,73,jy,0,0,7,2,rp,0,0,0,0,0,1,1,0,0,0
    
    
    63,80,j32,0,1,7,7,0,0,0,0,0,0,0,1,1,0,0
    63,81,swap_int,0,0,7,7,1,0,0,0,0,0,0,1,1,0,0
    63,82,swap_ptr,0,0,7,7,2,0,0,0,0,0,0,1,1,0,0
    63,83,over_int,0,0,7,7,3,0,0,0,0,0,0,1,1,0,0
    63,84,over_ptr,0,0,7,7,4,0,0,0,0,0,0,1,1,0,0
  1. ## Notes about opcodes ### When the opcode field contains 63, the instruction is further dispatched based on the value of op0, interpreted as a u3. The reference_opcode field is used to identify each instruction with a unique integer, even though multiple instructions share an encoding with a 63 in the the 'opcode' field. When the opcode field contains 63 and op0 is 7, the instruction is further dispatched based on the value of op1, and when the opcode field contains 63 and op0 is 7 and op1 is 7, the instruction is further dispatched based on the value of op2. In this way, an additional 6 two-operand, 6 one-operand, and 7 zero-operand instructions can be encoded.

todo the following may need to have their numbers updated:

Note that the reference opcodes have the following properties:

Note that later additions that assign instruction(s) to the RESERVED opcodes may break the 'only such' parts of these properties.

  1. # Semantics details ##
      1. Arithmetic ### Int32 overflow on addition, subtraction, multiplication wraps around (that is, mathematically the operations are done mod 2^32). Note that that the operations of add32, add32m, sub32, mul32 give valid results whether you consider the int32 operands to be signed two's complement or unsigned, as long as you consider the result to be similarly unsigned or signed.

For example, for multiplication, imagine if we had 3-bit integers instead of 32-bit integers. If we multiply the unsigned representations of 2*3, that is, 010*011, the result is 6, that is, 110. In two's complement, 010 represents 2 and 011 also represents 3, and 110 represents -2; and 2*3 = 6 = -2 mod 2^3. To give another example, if we multiply the unsigned representations of 6*2, that is, 110*010, the result is 12, and 12 mod 2^3 is 4, that is, 100 in binary. In two's complement, 110 represents -2 and 010 represents 2, and the result, 100, represents -4; and -2*2 = -4 mod 2^3 = 4. Do note that these are only correct modulo the bitwidth; for example, 2*3 = 6, but mul 010 011 = 110, which when viewed as two's complement yields 2*3 = -2, an incorrect result in ordinary arithmetic, but -2 is equivalent to 6 mod 8, so the result is correct in mod 8 arithmetic. In the examples in this paragraph we used mod 8 for 3-bit integers, but in reality, we are using mod 2^32 for 32-bit integers, not mod 8.

On many platforms, it may be easiest to implement add32, add32m, sub32, mul32 by viewing the int32s as unsigned integers and then applying unsigned arithmetic operations, because many platforms don't implement wrap-around signed numbers.

Note that many arithmetic operations are provided only for integers; the only arithmetic you can do to pointers is add signed integers to them.

  1. ## Integer bitwidths ###

The instructions lb, lbu, lh, lhu guarantee that the numbers read into registers are in certain ranges that fit in 8- and 16-bits, respectively. lb and lh sign-extend the number read to 32-bits, and lbu and lhu zero-extend the number read to 32-bits.

However, lb and lh result in signed two's complement representations in the destination register; note that the bit pattern of a small negative number in a 32-bit register, when coded with signed two's complement, is equivalent to a number larger than 16 bits if interpreted as unsigned. For example, a -1 in a register, signed, would be viewed as (232 - 1 = 4294967295) unsigned.

Boot guarantees that bytes (8-bit integers) have a size 1 in memory (meaning that values that are stored with sb occupy one memory location). INT8_SIZE is not defined because it would always be 1.

  1. ### (not) mixing integer bitwiths #### When in registers and being operated upon, the internal representation of int32s is a defined sequence of bits, however, when in memory, the internal representation of integers is opaque. For example, if a memory location x contains a 32-bit integer, and you read it with lh or lhu, the value that is read is unspecified other than that it's no larger than 16 bits. Similarly, if a memory location contains an 8-bit integer and you read it using l32, the value that is read is unspecified other than that it's some int32 (also, reading a byte using l32 near the edge of accessible memory will cause undefined behavior if there are less than INT32_SIZE memory locations in accessible memory, starting with the location read). You cannot write a sequence of bytes (8-bit integers) into memory and then usefully read it back using l32, and you cannot write a 32-bit integer into memory and then usefully read out its component bytes.

Furthermore there is no guarantee that 32-bit integers occupy more than one memory location, or that larger integer bitwidths occupy more memory locations than smaller; it's possible for both of INT16_SIZE, INT32_SIZE to be identically 1 (this can happen if the implementation chooses to make each single memory location large enough to store 32 bits of data). Larger integer bitwidths are guaranteed to occupy at least as many memory locations as smaller.

  1. ## Stack manipulation tips ### Constants can be pushed onto the smallstacks by using the load-immediate instructions and writing to the special smallstack pseudoregister, 1. For example, to push constant 1 onto the integer stack, 'l9 $1 1'.

The smallstacks can be pushed and popped from/to other registers by using cp/cpp and the special smallstack pseudoregister, 1. 'DUP' can be accomplished by pushing a copy of the top-of-stack register, 2, onto the smallstack using cp/cpp and writing to the smallstack pseudoregister, 1, e.g. 'cp $1 $2'.

'DROP' can be accomplished by popping the top-of-stack register, 2, and discarding it by using cp/cpp and writing to the zero pseudoregister, eg 'cp $0 $2'.

  1. ## Undefined behavior and arbitrary values ###

These lists are probably accidentially incomplete right now, but we hope to make this list comprehensive as time goes on.

  1. ### Undefined behavior ####

The following are undefined behaviors in Boot. Any program containing undefined behavior on any codepath has undefined behavior as a whole:

        1. Arbitrary values ####

The following do not cause undefined behavior and do not make the whole program invalid, but do not define the resulting values of certain operations:

      1. Reserved and implementation-defined items ###
        1. Reserved for extensions, vs. implementation-defined, vs. reserved for future use #### There is a distinction between items that are reserved for extensions, vs. items that are implementation-defined. The former are expected to be defined in extension languages such as BootX?. By contrast, implementation-defined items are allocated for the implementation to do what it wishes and are not expected to be used by extensions such as BootX?.

Another category is items which are reserved for future use. These items are reserved for use in future versions of Boot itself, and should not be used by either extensions or by implementations.

Implementations must not define or use items which are reserved for extensions, or items which are reserved for future use; if they do so, they risk incompatibility with extensions or future Boot versions. Extension languages must not define or use items which are defined in Boot to be implementation-dependent.

  1. ### Reserved instruction encoding space #### The 0 in the most-significant bit is intended to allow Boot to be made a part of other instruction formats which use 0 in this bit to indicate a Boot instruction, a 1 to indicate something else (for example, instructions of different lengths). That is to say, instructions with a 1 in the most-significant bit are reserved for extensions.
    1. Boot Assembly ## Boot Assembly is a plaintext syntax for Boot.

Boot Assembly is ASCII text. Each line is processed separately; lines are delimited by the newline character, '\n' (a byte with the value 10). Whitespace is defined as one of the characters: ' \t\n\r\f\v' (where \t indicates tab, \n indicates newline, etc). Lines which are all whitespace, or which begin with a semicolon, are skipped. Trailing whitespace on any line is ignored.

Some lines may begin with '.d' (a 'data line'). This is followed by a space and then a 16-bit unsigned hexadecimal number (4 characters which are each digits or letters within a-f). This may be followed by whitespace which may be followed by a semicolon. After a semicolon the rest of the line is a comment (all characters except newline are ignored), up to the first newline, which still terminates the line.

Therefore, data lines must match the following regular expression (regex): ^\.d [0-9a-f][0-9a-f][0-9a-f][0-9a-f]\s*(;.*)?$

Other lines (an 'instruction line') begin with an instruction mnemonic. Instruction mnemonics are at most 12 characters, and are all lowercase alphanumeric. Mnemonics begin with an alphabetic character and are followed by one or more alphanumeric characters. This is followed by a space, possibly followed by a hexadecimal number (a string of digits from 0 to 9 and lowercase letters from a to f, possibly prefixed by one of '-' or '+') denoting the first operand, op0. This may be followed by another space and a second number (op1), and maybe by a another space and a third number (op2). After three operands the usable information in the line is exhausted and the assembler may skip to the next line. If the instruction is followed by a 32-bit embedded immediate constant, this must be included manually on the next line using '.d'. The usable information in the line may be followed by whitespace which may be followed by a semicolon. After a semicolon the rest of the line is a comment (all characters except newline are ignored), up to the first newline, which still terminates the line.

Operands are hexadecimal integer in base 16 and may be prefixed by '+' or '-' to indicate sign. Immediate operand ranges are (note that these ranges are written in hexadecimal):

Operands of register type must be in the range 0 to f, inclusive.

Therefore, instruction lines must match the following regular expression (regex): ^([a-z][a-z0-9]+)( ([-+]?[0-9a-f]+))?( ([-+]?[0-9a-f]+))?( ([-+]?[0-9a-f]+))?\s*(;.*)?$

The last line in the file must end in a newline, unless the last line is all whitespace.


  1. # TODO ##