proj-oot-bootReferenceOld201018

  1. Boot (Oot Boot) reference

Version: unreleased (0.0.0-0)

Boot is a low-level 'assembly language' virtual machine (VM) that is easy to implement.

  1. # Introduction

Boot is a target language that is easy to implement on a wide variety of platforms, even on very primitive bare metal, or 'on top of'/within an existing high-level languages such as Python.

Highlights:

[TOC]

  1. # Instruction encoding ## 4 bytes per instruction. The bytes are:

Op0 is restricted to a maximum value of 63 (when interpreted as unsigned), meaning that the 2 most-significant-bits are always zero.

  1. # Datatypes ##

two primary datatypes:

ptr has two subtypes: - ptrd (data pointers) - ptrc (code pointers)

  1. # Registers ## Two banks of 8 registers each; one for int32, one for ptr. The first register in the int32 bank is constant zero, and the first register in the ptr bank is the null pointer; writes to these registers have no effect. The notation $n refers to the n-th int32 register, and &n refers to the n-th ptr register, for example the first and last registers in each bank are: $0, $15, &0, &15.
    1. Instructions ##

47 instructions

==
annotation ann
load constants l32m
loads and stores and copies lb lbu lh lhu lp lw sb sh sp sw cp cpp
arithmetic of ints add sub mul addm
bitwise arithmetic and or xor not shl shrs shru
adding ints to pointers app ap32 ap16 ap8 appm ap32m ap16m ap8m
comparision control flow bne blt bltu beq bnep beqp
other control flow jrls jrlm jrll jy lpc
I/O in inp
interop xlib
misc break impl sinfo
==

(Notation for the instruction tables below) #imm6, #imm8, #imm16, #imm22 are immediate constants using two's complement encoding (#imm6 is 6 bits instead of 8 and #imm22 is 22 bits instead of 24 because op0 can only reach 63, as noted in 'Instruction encoding' above), #imm22u, #imm16u, #imm8u, #imm6u are immediate constants interpreted as unsigned, $X is an integer register, &X is a pointer register, _ is an unused argument that must always be 0 in proper Boot programs (Boot implementation are free to make use of these locations however), and X is an untyped argument. All immediate constants are signed two's-complement ints.

From left to right, the arguments go into operands op0, op1, op2. Immediate operands are always on the right (the highest-numbered operand). When two immediate operands are combined into an #imm16 (as with instruction lm), op1 is the high-order bits and op2 is the low order bits (imm16 = (op1 << 8) + op2). Similarly for #imm22 (imm24 = (op0 << 16) + (op1 << 8) + op2).

JREL and branch immediates are in units of bytes in the Boot code. JREL and branch immediates may not jump into the middle of an instruction. Platforms which compile or represent Boot code in memory in ways such that one instruction spans more or less than 4 memory locations must adjust the jrel and branch offsets accordingly before executing them.

Mnemonics with a trailing 'm' represent instructions involving an 'iMmediate' (although not all instructions with immediates have a trailing 'm' in the mnemonic).

annotation:

load constants:

register loads and stores and copies:

arithmetic of ints (result always defined and all results mod 2^32):

bitwise arithmetic:

Adding ints to Pointers (only valid on data pointers, not code pointers):

conditional branches:

unconditional jumps and other control flow:

I/O:

interop:

misc:

      1. Notes on certain instructions ###
    1. Arithmetic ## Int32 overflow on addition, subtraction, multiplication wraps around (that is, mathematically the operations are done mod 2^32). Note that that the operations of add, addm, sub, mul give valid results whether you consider the int32 operands to be signed two's complement or unsigned, as long as you consider the result to be similarly unsigned or signed.

For example, for multiplication, imagine if we had 3-bit integers instead of 32-bit integers. If we multiply the unsigned representations of 2*3, that is, 010*011, the result is 6, that is, 110. In two's complement, 010 represents 2 and 011 also represents 3, and 110 represents -2; and 2*3 = 6 = -2 mod 2^3. To give another example, if we multiply the unsigned representations of 6*2, that is, 110*010, the result is 12, and 18 mod 2^3 is 4, that is, 100. In two's complement, 110 represents -2 and 010 represents 2, and the result, 100, represents -4; and -2*2 = -4 mod 2^3 = 4. Do note that these are only correct modulo the bitwidth; for example, 2*3 = 6, but mul 010 011 = 110, which when viewed as two's complement yields 2*3 = -2, an incorrect result in ordinary arithmetic, but -2 is equivalent to 6 mod 8, so the result is correct in mod 8 arithmetic. In the examples in this paragraph we used mod 8, but in reality, we are using mod 2^32, not mod 8.

On many platforms, it may be easiest to implement add, sub, addm by viewing the int32s as unsigned integers and then applying unsigned addition, subtraction, because many platforms don't implement wrap-around signed numbers.

Note that many arithmetic operations are provided only for integers; the only arithmetic you can do to pointers is add or subtract integers to/from them.

  1. # (Not) mixing integer bitwidths ##

When in registers and being operated upon, the internal representation of int32s is a defined sequence of bits, however, when in memory, the internal representation of integers is opaque. For example, if a memory location x contains a 32-bit integer, and you read it with lh or lhu, the value that is read is unspecified other than that it's no larger than 16 bits. Similarly, if a memory location contains an 8-bit integer and you read it using lw, the value that is read is unspecified other than that it's some int32 (also, reading a byte using lw near the edge of accessible memory will cause undefined behavior if there are less than INT32_SIZE memory locations in accessible memory, starting with the location read). You cannot write a sequence of bytes (8-bit integers) into memory and then usefully read it back using lw, and you cannot write a 32-bit integer into memory and then usefully read out its component bytes.

Furthermore there is no guarantee that 32-bit integers occupy more than one memory location, or that larger integer bitwidths occupy more memory locations than smaller; it's possible for both of INT16_SIZE, INT32_SIZE to be identically 1 (this can happen if the implementation chooses to make each single memory location large enough to store 32 bits of data).

The instructions lb, lbu, lh, lhu guarantee that the numbers read into registers are in certain ranges that fit in 8- and 16-bits, respectively. However, lb and lh result in signed two's complement representations in the destination register; note that the bit pattern of a small negative number in a 32-bit register, when coded with signed two's complement, is equivalent to a number larger than 16 bits if interpreted as unsigned. For example, a -1 in a register, signed, would be viewed as (232 - 1 = 4294967295) unsigned.

Boot guarantees that bytes (8-bit integers) have a size 1 in memory (meaning that values that are stored with sb occupy one memory location). INT8_SIZE is always 1.

  1. # I/O ## If standard console streams STDIN, STDOUT, exist on the platform and are supported by the implementation, they must be devices #0, #1, respectively, and device #2 must be STDERR if it exists, and otherwise should be an alias to STDOUT or may be a null device (one which never emits anything and to which writing has no effect).

An implementation does not have to support INP, OUTP.

  1. # xlib calls ## Number 0 is defined below and 3 thru 127 are RESERVED for extensions. libfn numbers 128 thru 254 are implementation-defined.
      1. xlib 0: halt(result: int32) ###

End program with result code 'result'. The result code is interpreted as signed.

  1. # The Boot Calling Convention ##

Up to 3 integer arguments and up to 3 pointer arguments are passed in registers.

Registers 1, 2, 4, 5 (all banks) are caller-saved. Registers 3, 6, 7 (all banks) are callee-saved.

Pointer register 3 is used as a memory stack pointer when applicable (TODO what does 'when applicable' mean?), otherwise it is callee-saved.

Pointer register 5 is used as a return address pointer/link register. When using xlib, there is no need to set this register, these instructions will set it if needed.

Registers 1, 2, 4 (both integer bank and pointer bank) are used to pass arguments and return values (from lower to higher number get arguments from left to right). Upon making a call, up to 3 integer arguments are in integer registers 1, 2, 4, and up to 3 pointer arguments are in pointer registers 1, 2, 4, and the return address is found in pointer register 5.

Registers 5,6,7 (both banks) are caller-saved scratch registers and may be overwritten and used for any purpose by the callee.

Upon returning from a call, up to 3 integer and up to 3 pointer return values will be found in registers 1,2,4 using the same convention as for calling.

  1. # Undefined behavior and arbitrary values ##

These lists are probably accidentially incomplete right now, but we hope to make this list comprehensive as time goes on.

  1. ## Undefined behavior ###

The following are undefined behaviors in Boot. Any program containing undefined behavior on any codepath has undefined behavior as a whole:

      1. Arbitrary values ###

The following do not cause undefined behavior and do not make the whole program invalid, but do not define the resulting values of certain operations:

    1. Boot Assembly ## Boot Assembly is a plaintext syntax for Boot.

Boot Assembly is ASCII text. Each line is processed separately; lines are delimited by the newline character, '\n' (a byte with the value 10). Whitespace is defined as one of the characters: ' \t\n\r\f\v' (where \t indicates tab, \n indicates newline, etc). Lines which are all whitespace, or which begin with a semicolon, are skipped. Trailing whitespace on any line is ignored.

Lines begin with an instruction mnemonic, which consists of lowercase letters and digits and is at most 5 characters long. This is followed by whitespace, followed by a number (a string of digits between 0 and 9, possibly prefixed by one of '-' or '+') denoting the first operand, op0. This may be followed by more whitespace and second number (op1), and maybe by more whitespace and a third number (op2). This may be followed by whitespace which may be followed by a semicolon. After a semicolon the rest of the line is a comment (all characters are ignored), up to the first newline, which still terminates the line.

Operands are integers in base 10 and may be prefixed by '+' or '-' to indicate sign. Instructions with exactly one unsigned immediate operand may have any unsigned value from 0 thru 4194303, inclusive, in that operand. Instructions with two operands, where unsigned immediates, may have any value from 0 thru 63, inclusive, in the first operand, and any value from 0 thru 65535, inclusive, in the second operand. Instructions with three operands, where unsigned immediates, may have any value from 0 thru 63, inclusive, in the first operand and any value from 0 to 255 in each other operand. When an immediate operand type for this instruction is signed, unsigned ranges from 0 to 63 are replaced by signed ranges from -32 to 31, unsigned ranges from 0 to 255 are replaced by signed ranges from -128 to 127, unsigned ranges from 0 to 65535 are replaced by signed ranges from -32768 to 32767, unsigned ranges from 0 to 4194303 are replaced by signed ranges from -2097152 to 2097151. Operands of register type must be in the range 0 to 15, inclusive.

Therefore, instruction lines must match the following regular expression (regex): ^([a-z][a-z0-9]+)(\s+([-+]?[0-9]+))(\s+([-+]?[0-9]+))?(\s+([-+]?[0-9]+))?\s*(;.*)?$

The last line in the file must end in a newline, unless the last line is all whitespace.

  1. # Reserved and implementation-defined items ##
      1. Reserved for extensions, vs. implementation-defined, vs. reserved for future use ### There is a distinction between items that are reserved for extensions, vs. items that are implementation-defined. The former are expected to be defined in extension languages such as BootX?. By contrast, implementation-defined items are allocated for the implementation to do what it wishes and are not expected to be used by extensions such as BootX?.

Another category is items which are reserved for future use. These items are reserved for use in future versions of Boot itself, and should not be used by either extensions or by implementations.

Implementations must not define or use items which are reserved for extensions, or items which are reserved for future use; if they do so, they risk incompatibility with extensions or future Boot versions. Extension languages must not define or use items which are defined in Boot to be implementation-dependent.

  1. ## Reserved instruction encoding space ### The limitation of op0 to have zeros in the two most-significant bits is intended to allow Boot to be made a part of other instruction formats which use zero values in these bits to indicate a Boot instruction, and non-zero values to indicate something else (for example, instructions of different lengths). That is to say, instructions with a 1 in either of the two most-significant bits in op0 are reserved for extensions.
      1. Reserved instruction opcodes ### Boot only defines opcodes under 64; higher opcodes are reserved for extensions. Opcode 62 is implementation-defined. Opcode 63 is reserved for future use.
      2. Reserved sinfo query types ### Boot defines sinfo query numbers 0-3. Queries 247 thru 254 inclusive are implementation-defined. Other query numbers are reserved for extensions.
      3. Reserved xlib functions ### Boot defines no xlib functions, but reserves numbers 0 thru 127 for extensions. Numbers 128 thru 254 are implementation-defined.
    1. Misc ## Instruction mnemonics are at most 5 characters, and are all lowercase alphanumeric. Mnemonics begin with an alphabetic character and are followed by one or more alphanumeric characters.

Note that the opcodes (see table below) have the following properties:

Note that later additions that assign instruction(s) to the RESERVED opcode may break the 'only such' parts of these properties.

The short descriptions under Instructions, above, have been kept to at most 80 characters per line.

  1. # Opcodes and argument types ##

Type identifiers in the following table:

Opcode and argument type table:


  1. # TODO ##

later todos (transfer to other file):