Boot Extended (Oot BootX) reference

:hardbreaks: :toc: macro :toclevels: 0

Version: unreleased/in-progress/under development (0.0.0-0)

BootX? extends Boot with more instructions (for a total of about 60), as well as concurrency semantics. Many of the new instructions are functions that could be written as subroutines in Boot, but some of them are OS or VM services.

This documents extends the document boot_reference.adoc.

Instruction encoding

The instruction encoding is mixed-length.

Encodings are little-endian on disk, over the network, or when presented in serialized form, and are typically stored in host byte order in memory.

The least-significant bits of the instruction indicate the instruction's width in the following manner:

If the least-significant bit is 0, the instruction is 16-bits
If the least-significant two bits is 01, the instruction is 8-bits
If the least-significant three bits are 011, the instruction is 32-bits
If the least-significant three bits are 111, the instruction is encoded in some alternate extension encoding

The format and semantics of the 16-bit instructions are identical to Boot.

The encoding format of the 8-bit instructions is (from LSBit to MSBit):

2 format bits
6 opcode bits

Opcodes 0 thru 31 (inclusive) are available for implementation-dependent or user use. The last 32 opcodes (32 thru 63) assigned to the 8-bit instruction opcodes are RESERVED. Our intention is to wait until we have a number of BootX? programs to analyze, and then choose the 32 available 8-bit instruction opcodes to be 'shortcuts' for common sequences of one or more 16- or 32-bit instructions (dictionary compression).

The encoding format of the 32-bit instructions is (from LSBit to MSBit):

3 format bits
3 address mode bits for the opcode
5 operand_value bits for the opcode
3 address mode bits for the op2
4 operand_value bits for op2
3 address mode bits for the op1
4 operand_value bits for op1
3 address mode bits for the op0
4 operand_value bits for op0

(todo; maybe op0 should be least-significant, since in the instructions, its op0 which tends to be an input, so we want this to be loaded first)

The instruction 'ldi-i16' could be considered to have its own encoding format; 5 bits of the operands is used to determine an output target in a unique way.

When operands are used as 'subopcodes' using the INSTR- instructions, then the address mode bits and the operand_value bits are combined for the subopcode (so these subopcodes are 7 bits, not 5).

N-bit instructions must be N-bit aligned. Branch and Jump targets are in units of N-bits. So, for example, 8-bit instructions must be 8-bit aligned, 16-bit instructions must be 16-bit aligned, 16-bit instruction branch and jump targets count in units of 16-bits, 32-bit instruction branch and jump targets count in units of 32-bits.

Registers

Registers are as in Boot, except that in displacement or indexed addressing modes, a base register of 0 refers not to the zero register, but rather to the PC.

TODO: in BootX?, the registers can each hold a 64-bit quantity, or a floating-point quantity -- however, coercing these without conversion has undefined results

TODO: change (in Boot!) to separate reg banks for ints and ptrs

Addressing modes

An "addressing mode" tells how to interpret an "operand_value" to specify how to obtain an input for an instruction, or how to place an output.

There are 8 addressing modes. From 0 to 7, they are:

immediate
register direct
register indirect
smallstack
constant
displacement
indexed
predecrement/postincrement

Sometimes an instruction input value (computed by interpreting an operand_value with an addressing mode) is called an 'effective value'. Sometimes an address (computed by interpreting an operand_value with an addressing mode) in which an instruction output is to be placed is called an 'effective address'; 'address' is used here loosely since registers and smallstack are included in potential 'effective addresses'.

In the following, the bits of an 'operand_value' are interpreted as an unsigned integer.

Here's what each addressing mode means. First we'll give an English-language summary, and then a formal definition in pseudocode.

If an effective address is given but not an effective value, then the effective value is the value found at the effective address.

immediate: when used as an input, the effective value is just the operand_value itself. When used as an output, this indicates that the value produced by the instruction will be sent to a special I/O output; see section IMMEDIATE ADDRESSING MODE DEVICE OUTPUTS.
register direct: the operand_value indicates a register; the effective address is that register
register indirect: the operand_value indicates a register, that register holds a pointer, and the effective address is that pointer
smallstack: like register direct, but the 16 locations within SMALLSTACK are used as registers
constant: like immediate, except that the operand_value is replaced by the n-th constant, where n is the operand_value
displacement: the bits in the operand_value are split in half. The upper half is the base register, and the lower half is the displacement, which is a non-negative integer interpreted in units of words. The base register contains a pointer, and the displacement is added to this pointer to get the effective address. If the base register is 0, then it is interpreted as the PC (program counter), rather than as constant 0.
indexed: the bits in the operand_value are split in half. The upper half is the base register. The number 2 is added to the lower half, and the result is the index register. The base register contains a pointer and the index register contains a non-negative integer interpreted in units of words. The contents of the base register and the index register are added together, and the result is the effective address. If the base register is 0, then it is interpreted as the PC (program counter), rather than as constant 0.
predecrement/postincrement. Similar to register indirect, except when used as an input, the contents of the indicated register is incremented after reading (postincrement), and when used as an output, the contents of the indicated register is decremented before writing (predecrement).

In the above, when the operand_value is interpreted as a register but the register is number 2, then, if that register would be written to, then instead smallstack is pushed, and if that register would be read from, instead smallstack is popped.

Now we'll define the addressing modes more formally. A pseudocode notation based on Python is used. 'addressing_mode' is an integer between 0 and 7, inclusive, and operand_value is an unsigned (non-negative) integer.

....

for when the operand is used as an input def get(addressing_mode, operand_value):

  if addressing_mode == IMMEDIATE_ADDR_MODE:
      return operand_value

  if addressing_mode == REGISTER_DIRECT_ADDR_MODE:
      if operand_value == 2:
        return pop_smallstack()
      else:
        return registers[operand_value]

  if addressing_mode == REGISTER_INDIRECT_ADDR_MODE:
      if operand_value == 2:
        effective_address = pop_smallstack()
      else:
        effective_address = registers[operand_value]]
      
      return memory[effective_address]

  if addressing_mode == STACK_ADDR_MODE:
      return smallstack_as_array[operand_value]

  if addressing_mode == CONSTANT_ADDR_MODE:
      return constants[operand_value];

  if addressing_mode == DISPLACEMENT_ADDR_MODE:
      base_address_register = operand_value >> 2;
      displacement_register = operand_value & 3;

      if displacement_register == 2:
        displacement = pop_smallstack()
      else:
        displacement = registers[displacement_register]

      if base_address_register == 2:
        base_address = pop_smallstack()
      else:
        base_address = registers[base_address_register]
      
      effective_address = base_address + displacement
      return memory[effective_address]

  if addressing_mode == INDEXED_ADDR_MODE:
      base_address_register = operand_value >> 2;
      index_register = (operand_value & 3) + 2;

      if index_register == 2:
        index = pop_smallstack()
      else:
        index = registers[index_register]

      if base_address_register == 2:
        base_address = pop_smallstack()
      else:
        base_address = registers[base_address_register]

      effective_address = base_address + index
      return memory[effective_address]

  if addressing_mode == PREDECREMENT_POSTINCREMENT_ADDR_MODE:
      if operand_value == 2:
        # todo what should we do here?
        return pop_smallstack()
      else:
        result = memory[registers[operand_value]]
        registers[operand_value] = registers[operand_value] + 1
        return result

for when the operand is used as an output def put(addressing_mode, operand_value, value): if addressing_mode == IMMEDIATE_ADDR_MODE: do_special_io_output(operand_value, value)

  if addressing_mode == REGISTER_DIRECT_ADDR_MODE:
      if operand_value == 2:
        push_smallstack(value)
      else:
        registers[operand_value] = value

  if addressing_mode == REGISTER_INDIRECT_ADDR_MODE:
      if operand_value == 2:
        effective_address = pop_smallstack()
      else:
        effective_address = registers[operand_value]]
      
      memory[effective_address] = value

  if addressing_mode == STACK_ADDR_MODE:
      smallstack_as_array[operand_value] = value

  if addressing_mode == CONSTANT_ADDR_MODE:
      memory[constants[operand_value]] = value;

  if addressing_mode == DISPLACEMENT_ADDR_MODE:
      base_address_register = operand_value >> 2;
      displacement_register = operand_value & 3;

      if displacement_register == 2:
        displacement = pop_smallstack()}
      else:
        displacement = registers[displacement_register]

      if base_address_register == 2:
        base_address = pop_smallstack()}
      else:
        base_address = registers[base_address_register]
      
      effective_address = base_address + displacement
      memory[effective_address] = value

  if addressing_mode == INDEXED_ADDR_MODE:
      base_address_register = operand_value >> 2;
      index_register = (operand_value & 3) + 2;

      if index_register == 2:
        index = pop_smallstack()}
      else:
        index = registers[index_register]

      if base_address_register == 2:
        base_address = pop_smallstack()}
      else:
        base_address = registers[base_address_register]

      effective_address = base_address + index
      memory[effective_address] = value

  if addressing_mode == PREDECREMENT_POSTINCREMENT_ADDR_MODE:
      if operand_value == 2:
        # TODO what should this be?
        push_smallstack(value)
      else:
        registers[operand_value] = registers[operand_value] - 1
        memory[registers[operand_value]] = value....

Note that the use of addressing modes sometimes causes side-effects (that is, persistant changes or 'mutations' of system state in addition to the actual production of the input value or placement of the output value).

When there are addressing mode side-effects for inputs to an instruction, inputs are processed in the following order: first op0, then op1, then op2. All input are processed before any outputs. When there are addressing mode side-effects for multiple outputs to an instruction or when the same operand is both an input and an output, the ordering and multiplicity of addressing mode side-effects may vary depending on the instruction; refer to the instruction documentation.

If not otherwise stated, any non addressing mode side-effects of an instruction (for example, pushing a result onto smallstack) are applied before any addressing mode output side-effects.

For conditional instructions, side-effects are applied to all input operands regardless of whether the condition is satisfied.

Immediate addressing mode device outputs

Here is where the result of an instruction is sent when the output operand is in IMMEDIATE mode:

0: /dev/null; that is to say, the result of the instruction is discarded
1: STDOUT, if it exists. This output is byte-oriented (see below)
2: STDERR, if it exists. This output is byte-oriented (see below)
3: log with loglevel 0, if logging is supported
4: log with loglevel 1, if logging is supported
5: log with loglevel 2, if logging is supported
6: assert == 0; if the value being output is not 0, this is a critical error
7: assert == 1; if the value being output is not 1, this is a critical error
8-13: RESERVED
14-15: implementation-dependent

(todo: is it a good idea to intermix 'real' stuff like STDOUT with debugging stuff like STDERR? this prevents compilers from easily removing all of the debugging output instructions)

For byte-oriented outputs, the result of sending a value between 0 and 255 is one byte of output. The result of sending a value less than 0 or greater than 255 results in either a critical error or in zero or more bytes of output being generated; whether or not there is an error, how many bytes of output, and what the contents of thes byets are is implementation-dependent.

Opcode addressing modes

The following opcodes are mapped to ordinary instructions, with the mapping given in the tables below:

when the addressing mode is IMMEDIATE
when the operand_value is 16 or greater, and the addressing mode is REGISTER_DIRECT, REGISTER_INDIRECT, STACK, DISPLACEMENT, INDEX, or PREDECREMENT_POSTINCREMENT

This gives us a total of 128 three-operand instructions.

When the addressing mode is CONSTANT, or the operand_value is less than 16, and the addressing mode is register direct, register indirect, stack, displacemend, indexed, or predecrement_postincrement, this is treated as a function call. The opcode is processed as an ordinary operand read according to address mode, and the result is assumed to be a pointer to a function. The return address (the codepointer pointing to the instruction after the function call) is pushed onto SMALLSTACK. Then, the other three operands of the function call instruction are processed as input operands and the values read are pushed onto SMALLSTACK in order of op2, op1, op0 (so the top-of-SMALLSTACK now contains op0). Then, a jump is executed to the function pointer.

'constant' type arguments

When an instruction takes a 'constant' argument (type 'c' in the tables below), the addressing mode bits are not interpreted as an addressing mode, and instead are concatenated with the operand_value bits to form an immediate constant (the addressing mode bits are the most-significant bits). Therefore these immediate constants are 7 bits, not just 4.

TODO: get rid of stack-aware operands; make predec/postinc addr mode register 2 push/pop from smallstack; leave register direct register 2 RESERVED for now.

Instructions

The 32 three-operand instructions in opcode addressing mode IMMEDIATE:

0. annotate: c c c (can be ignored) 1. INSTR-TWO-ARITH2: c ? ? (used to encode two-operand instructions) 2. INSTR-TWO-BOOTX-1: c ? ? (used to encode two-operand instructions) 3. INSTR-TWO-BOOTX-2: c ? ? (used to encode two-operand instructions) 4. jrel: c c c (unconditional relative jump by a constant signed amount) 5. ldi: c c oi (load immediate 14-bit int) 6. ldabs: c c o (2-operand absolute instruction format; load from absolute 14-bit memory location; negative location are implementation-dependent, positive locations are slots from memory data segment start (if any)) 7. stabs (c c i): (2-operand absolute instruction format; store to absolute 14-bit memory location; negative locations are implementation-dependent, positive locations are slots from memory data segment start (if any)) 8. ld: c o ip (load from memory address plus unsigned constant) 9. st: c ip i (store from register to memory address plus unsigned constant) 10. bne-int: c ii ii (branch-if-not-equal on ints) 11. ble-int: c ii ii (branch if less-than-or-equal on ints) 12. bne-ptr: c ip ip (branch-if-not-equal on pointers) 13. ble-ptr: c ip ip (branch if less-than-or-equal on pointers) 14. addi-int: c io ii (in-place addition of ints and immediate constant) 15. addi-ptr-int: c iop ii (in-place addition of ints and immediate constant to ptr) 16. sll: c io ii (shift left logical (multiplication by 2^c (mod MAX_INT+1))) 17. srl: c io ii (shift right logical (division by 2^c, rounding towards zero)) 18. sra: c io ii (shift right arithmetic (division by 2^c, rounding towards negative infinity)) 19. add-int: oi ii ii (addition of ints) 20. add-ptr-int: op ip ii (add a int to a pointer) 21. bitor: oi ii ii (bitwise OR) 22. bitand: oi ii ii (bitwise AND) 23. bitxor: oi ii ii (bitwise XOR) 24. sub-ptr: op ip ip (subtraction of pointers) 25. mul-int: oi ii ii (integer multiply) 26. add-f: of if if (float addition) 27. csel: o i i (if top of SMALLSTACK is zero, then CPY op1 into op2, otherwise CPY op0 into op2) 28. pop-from-one-stack-push-to-another-multi: sio sio ii (operands are the two stacks, and the number of items to move between them) (note: if either stack is in memory, atomicity may not be supported; see SYSINFO, todo) 29. 30. 31. CAS: iop i i

(todo: the instruction ordering rules haven't been enforced beyond this point)

The 16 three-operand instructions in opcode addressing mode REGISTER_DIRECT:

16. mul-f: of if if (float division) 17. eq-i: oi ii ii (test for equality, resulting in 0 or 1) 18. neq-i: oi ii ii 19. leq-i: oi ii ii 20. le-i: oi ii ii 21. eq-f: oi if if 22. neq-f: oi if if 23. leq-f: oi if if 24. le-f: oi if if 25. div-f: of if if (float division) 26. min-f: of if if (float min) 27. max-f: of if if (float max) 28. mul-f: of if if (float multiplication) 29. copysign-f: of if if (float op1's sign is changed to op2's sign) 30. negcopysign-f: of if if (float op1's sign is changed to the negation of op2's sign) 31. xorsign-f: of if if (float op1's sign is changed to xor of op1's and op2's signs)

The 16 three-operand instructions in opcode addressing mode REGISTER_INDIRECT:

16. loadc: c c o (load constant; the implementation of constant storage is up to the implementation, but one simple idea is to put them in the space before the start of the program) 17. jmp: c c c (jump to absolute address. Positive locations are bytes in the BootX? instruction stream relative to program start. The interpretation of negative locations is implementation-dependent) 18. beq-int: c ii ii 19. blt-int: c ii ii 20. beq-ptr: c ii ii 21. blt-ptr: c ii ii 22. lev: op ii ii (load effective value; takes an addressing mode and an operand_value, and returns an effective address, possibly causing side-effects) 23. lea: op ii ii (load effective address; takes an addressing mode (other than immediate, register, stack, or constant) and an operand_value (other than 2), and returns an effective address, possibly causing side-effects) 24. ld-postincrement-scaled: o iop ii (load from op1, then add op0 to op1) 25. st-predecrement-scaled: iop ii i (add op1 to op2, then store op0 to the (new) op2) 26. ld-idx: o ip ii (load from (memory address plus integer)) 27. st-idx: ip ii i (store to (memory address plus integer)) 28. push-smallstack-idx-scaled: ip ii ii (push onto SMALLSTACK from (memory address plus index*scale) 29. pop-smallstack-idx-scaled: ip ii ii (pop from SMALLSTACK into (memory address plus index*scale) 30. mac-int: ioi ii ii (multiply-accumulate on integer: op2 = op2 + op1*op0) 31. mac-ptr: iop ii ii (multiply-accumulate on pointer: op2 = op2 + op1*op0)

The 16 three-operand instructions in opcode addressing mode STACK:

16. sub-f: of if if (float subtraction) 17. ld-multi: o ii ip (load multiple items starting at the pointer in op0 into a series of places starting with op2, where the number of items is op1) (this may or may not be atomic on a given platform; see SYSINFO, todo) 17. st-multi: op ii i (store multiple items starting at op0 into a series of memory locations starting with op2, where the number of items is op1) (this may or may not be atomic on a given platform; see SYSINFO, todo) 18. ldi-i16: c c c (load signed 16-bit immediate into indicated register or stack location (the most significant bit is register (0) (register addr mode) or smallstack (smallstack addr mode) (1), and the next most significant 4 bits are the register number or smallstack location; selecting register 2 (register/stack bit is 0, location is 2) indicates a push to smallstack; another way of looking at this is that op2 is donating 2 bits of its addr mode to the concatenated constant) 19. ccpy-nz: o i i (conditional copy; if op0 is non-zero, copy op1 to op2) 20. ccpy-z: o i i (conditional copy; if op0 is zero, copy op1 to op2) 21. 22. cadd-nz-int: ioi ii ii (conditional add; if op0 is non-zero, then op2 = op2 + op1) 23. cadd-z-int: ioi ii ii (conditional add; if op0 is zero, then op2 = op2 + op1)) 24. cadd-nz-ptr: iop ii ii 25. cadd-z-ptr: iop ii ii 26. cneg-nz: oi ii ii (conditional negate; if op0 is non-zero, then op2 = -op1) 27. cneg-z: oi ii ii (conditional negate; if op0 is zero, then op2 = -op1) 28. cbitnot-nz: oi ii ii (conditional bitnot; if op0 is non-zero, then op2 = bitnot(op1)) 29. cbitnot-z: oi ii ii (conditional bitnot; if op0 is zero, then op2 = bitnot(op1)) 30. csel-pushsmallstack: i i i (conditional select and push to smallstack; if op0 is 0, then push op1 to smallstack; otherwise push op2) 31. csel-multi: o i ii (take 2 items from op1 in the same manner as st-multi, and conditionally select; if op0 is 0, then copy the first item to op2; otherwise copy the second item to op2)

The 16 three-operand instructions in opcode addressing mode DISPLACEMENT:

16. ld-idx2: o ip ii (multiply the index op0 by 2, then execute ld-idx) 17. ld-idx4: o ip ii (multiply the index op0 by 4, then execute ld-idx) 18. ld-idx8: o ip ii (multiply the index op0 by 8, then execute ld-idx) 19. st-idx2: ip ii i (multiply the index op1 by 2, then execute st-idx) 20. st-idx4: ip ii i (multiply the index op1 by 4, then execute st-idx) 21. st-idx8: ip ii i (multiply the index op1 by 8, then execute st-idx) 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.

The 16 three-operand instructions in opcode addressing mode INDEX:

16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.

The 16 three-operand instructions in opcode addressing mode PREDECREMENT_POSTINCREMENT:

16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.

The 16 two-operand INSTR-TWO-BOOTX-1 instruction opcodes and mnemonics and operand signatures are: 0. INSTR-ONE-BOOTX: c ? (used to encode one-operand instructions) 1. 2. pickk: c sio (pick c on stack) 3. rollk: c sio (roll c on stack) 4. csel-stack: sio ii (pop two items from stack op1; if op0 == 0, push the item that was popped first back onto op1, otherwise push the second item) 5. sysinfo (query system metadata): c o 6. pop: o sim 7. push: som i 8. cpy: o i (copy from register to register) 9. neg: oi ii (arithmetic negation) 10. bitnot: oi ii (bitwise negation) 11. neg-f: of if (arithmetic negation of float) 12. cvt-int-f: of ii 13. fclass-f: oi if (classify a floating point number; see eg FCLASS in RISC-V, eg FCLASS.S in section 8.9) 14. 15.

The 16 two-operand INSTR-TWO-BOOTX-2 instruction opcodes and mnemonics and operand signatures are:

0. sqrt-f 1. roundeven-f-int ('round' in wasm) 2. trunc-f-int 3. ceil-f-int 4. roundawayfrom0-f-int ('round' in C) 5. sqrt-f 6. 7. 8. 9. 10. malloc: op ii 11. mrealloc: iop ii 12. cret-nz: sio ii (if op0 is nonzero, pop CODEPTR from indicated stack, then jmp to it) 13. cret-z: sio ii (if op0 is zero, pop CODEPTR from indicated stack, then jmp to it) 14. 15. 16.

The 16 two-operand INSTR-TWO-BOOTX-3 instructions are:

. pushselif-eq: i i (pop two items from smallstack; if item0 == item1, push op0 to op2, otherwise, push op1 to op2) 5. pushselif-ne: i i (pop two items from smallstack; if item0 != item1, push op0 to op2, otherwise, push op1 to op2) 6. pushselfif-le: i i (pop two items from smallstack; if item0 <= item1, push op0 to op2, otherwise, push op1 to op2) 7. pushselfif-lt: i i (pop two items from smallstack; if item0 < item1, push op0 to op2, otherwise, push op1 to op2) 8. 9. 10. 11. 12. 13. 14. 15.

The 16 INSTR-ONE-BOOTX one-operand instruction opcodes and mnemonics and operand signatures are: 0. instr-syscall: c (used to encode zero-operand misc instructions) 1. instr-syscall2: c 2. 3. 4. 5. 6. 7. 8. 9. 10. ret: sio (pop codeptr from indicated stack, then jmp to it) 11. 12. jd: ipp (dynamic jump) 13. 14. mdealloc: ip 15.

The 16 SYSCALL2 zero-operand instructions are: 0. 1. exec 2. get 3. put 4. spawn 5. pctrl (process control, eg join/wait, kill, etc -- or should these each be separate?) 6. time 7. rand 8. environ 9. getpid 10. signal (?? not sure if we want to do it this way -- signal handler setup) 11. create 12. delete 13. floor-f-int 14. 15. divmod-int: (on SMALLSTACK; consume 2 items and push dividend, then push remainder)

The 16 SYSCALL zero-operand instructions are the same as in Boot.

Semantics

Concurrency semantics

In shared memory (memory allocated with malloc-shared), LOAD, STORE, CAS, and atomics (the operations labeled -atomic) are atomic within the chunk allocated by malloc in one allocation. All implicit loads and stores of single Boot words (ints, or pointers) are tear-free/atomic. Nothing else in shared memory is guaranteed to be atomic. Some or all of the -atomic operations may not be available on all implementations. Some implementation may implement some atomics, or even all memory accesses to shared memory, via a lock at the granularity of the entire region allocated by one call to malloc-shared.

todos

go thru 180328 and copy the relevant todos into here
go thru 180328 and copy the useful stuff into here. I already copied the second block of instruction listings (but not yet the todos in that section) (by second block of instruction listings, i mean the one starting with "The 32 three-operand instruction opcodes and mnemonics and operand signatures are (note: in 32-bit instructions, constants ('c') are 7 bits, not 3 or 4 bits, because the addressing mode is treated as part of the constant):")
http://mrmgroup.cs.princeton.edu/papers/ctrippel_ASPLOS17.pdf describes some issues with the RISC-V memory consistency model; do these issues apply to us too? Ppl say that RISC-V is taking these comments into account and fixing the issues; what did they end up changing?
generalize FORKDONE. Probably using DEVOP.
consider stream versus packet (erlang style channel with mailbox?)
should DEVOP 15 or mb DEVOP 0 return a device type for any device? should DEVOP 15 at least be RESERVED?
should some of the first 255 devices be signal handler devices? maybe 16, 8 types reserved for system and 8 types for user?
specify little-endian
we probably want 'spawn' instead of 'fork', because some underlying archs dont offer fork?

probably won't do these b/c they can be emulated with existing instructions, and there isn't much opcode space here so these should go in OVM instead:

consider adding DIV and REM/MOD (division and remainder/mod)
consider adding a QADD saturated addition instruction
consider adding an integer MAC (multiply-accumulate; A = A + B*C) instruction
consider adding signed integer arithmetic
consider adding floats (half precision?)
secure mode, which can be dynamically turned on or off, during which the implementation ensures that this thread:
- does not leak data (eg when you POP SMALLSTACK, it actually overwrites the old value, it doesn't just move the pointer) (eg old values of registers/stacks are not cached) (eg caching is either turned off (esp. if other processes are running and could do a timing attack based on bus contention) or flushed at the exit of secure mode or before switching to any other thread, eg which memory addresses you access do not affect what is cached)
- does not vary timing based on data (no timing channel attacks)
- operates on a 'best effort' basis (eg even if it only makes the system more secure but does not do all of the above, it doesn't have to return an UNSUPPORTED error), but the feature support flags say whether or not the implementation purports to do all of the above
since the implementation is not required to keep track of the size of SMALLSTACK (and might use a ring buffer), there need to be a way for the program to tell it "SMALLSTACK is only of size X now" to direct the implementation to actually erase everything above that so that it gets garbage collected/deallocated.
one way to do primitive polling (for read/write) would be to read "# of messages waiting in the mailbox" instantaneously, and then provide a "block until nonzero" primitive
32-bit note: replace the constant-0 register (R0) with a PC register. But a register direct write to PC is still 'discard', like it was with 0
format note:
when bit 0 (1st format bit) is 0, it's 16-bit format. (Boot)
when bit 0 is 1 and bit 1 is 0, it's 8-bit format. (reserved, probably for the 64 most common instructions (64 comes from the 6 free bits after the 2 format bits))
when bit 0 is 1 and bit 1 is 1, and bit 2 is 0, it's 32-bit format. ('BootX?'? 3 format bits, 5 opcode bits, 3 * (3 addr mode bits + 4 value bits) operands = 32 bits total; addr modes are immediate, register, register indirect, stack, constant, displacement??, indirect index??, postincrement? -- note that displacement and register indirect are only a tiny bit useful when they can only specify the first 4 registers as the indirection register due to their being 'split' addressing modes (2 bits for the indirection register, and 2 for the displacement or indexRegister)! These modes will be more useful in 64-bit OVM, which has 8-bit operands (so 4 bits for each split); also these modes encourage use of ERR as a GPR, since it is in these 4; interestingly both roles of R2 (as SMALLSTACK and as ERR) can be used simultaneously in indirect indexed (the indirection register can be SMALLSTACK and ERR can hold the index); should it be illegal to assign a PTR to ERR? i think no, we can use it as an indirection reg in displaced mode or indirect mode; also For indirect indexed mode, simply make the index register start from 2 since it can never be the PC or data stack, since it must be an int rather than a pointer. Also allow the interaction register to be 2 ,meaning small stack, even though small stack is not really a pointer.)
- put DUP, SWAP, ROLL3, etc into the first few 8-bit commands. Look at JVM, 6502, etc, and frequent assembly instructions, for more ideas there.
note that zero register is replaced by (readonly) PC register; but a register direct write to PC is still 'discard', like it was with 0
when bit 0 is 1 and bit 1 is 1, and bit 2 is 1, and bit 3 is 0, it's 64-bit format (OVM)
when bit 0 is 1 and bit 1 is 1, and bit 2 is 1, and bit 3 is 1, it's extended format (reserved)
GETPID in SYSINFO
mb have a MEMCPY (note: this could be used to save and restore SMALLSTACK)
the instruction whose encoding consists of 32 1s is BAD1 and is illegal
PICK and ROLL ?
note somewhere that just because we have a 32-bit instruction encoding in BootX?, this does NOT imply 32-bit registers; register size is guaranteed to be >= 16 bits, just like Boot.

so with the current plan, in 32-bit, we'll get 16 more 3-operand instructions plus 8 more 2-operand ones. This will allow us to make a lot more things 3-operand, so may as well rewrite the whole thing. So we got tons of space; after just moving stuff around (and freeing up loadpc), we have 8 3-operand opcodes, 9 2-operand opcodes, 7 1-operand opcodes are open. Ideas:

with addition of:

3-operand: 4 16-bit arithmetic, 5 f64 arith
2-operand: 4 f64 arith (including peek and poke for f64), 3 i16 arith, 2 stack ops
1-operand: 4 f64 arith, 1 syscall2 (instr-/misc), 2 i16 arith (peek and poke)
0-operand: 11 syscall2s, 1 divmod, 1 divmod i16
totals: 1 divmod, 7 16-bit arithmetic, 13 f64, 12 syscall2s

note: the f64 instructions operate on 16 separate floating point registers

note: what about using half-precision (16-bit) floats instead? then we wouldn't need the separate registers. 64-bit floats ('doubles') could be available in OVM.

we could even just do unspecified 'native precision', but i'd prefer not to -- the point of native precision is to have a portable way to express computations, which may involve distances between pointers, and we dont know the pointer size so we cant know the integer size. but once we move outside of that, it's probably better to know how much precision you're dealing with.

but a problem with this is that i bet there are more architectures with f64 support than f16 support. In fact, it's worth than that; a quick check shows that ARM Cortex-M4F only supports f32 (single-precision floating point). Whereas Javascript and Lua and Python only support f64.

so, i think we'll stick with f64.

note: what about sign-extension and zero-extension of integers of various bitwidths? what about coerce-i16-i and coerce-i-i16? i guess those aren't very useful without knowing the bitwidth of the native ints (which a program could get from sysinfo and then have a switch statement based on the result, but it's probably simpler not to)

note: instead of all the 3-operand i16 and f64 arithmetic, perhaps we'd like more kinds of compare-and-branch instructions on native ints and ptrs?

new instructions:

==
arithmetic on native ints	divmod-int
stack ops	pickk, rollk (note: these obsolete the 1-operand dup, swap, over instructions)
64-bit floating point (optional)	add-f64, mul-f64, div-f64, bne-f64, ble-f64, neg-f64, peek-f64, poke-f64, fclass-f64, cvt-f64-i16, cvt-i16-f64, coerce-f64-i16, coerce-i16-f64
16-bit integers	add-i16, mul-i16, cvt-i16-i, cvt-i-i16, neg-i16, peek-i16, poke-i16, bne-i16, ble-i16, divmod-i16
syscalls (optional)	exec, delete, get, put, spawn, pctrl, time, rand, environ, getpid, signal
==

new critical errors: divide-by-zero, (??: conversion-out-of-range? should that really be a critical error?)

note: does float divide by zero cause a critical error, or does it just create Inf? Can you choose?

todo: specify that native ints are 2s complement, and also that you can give native ints as inputs to -i16 operations on native ints, and the effect is to take just the lowest-order 16 bits of the native int. You can also give -i16s as input to native int operations, and the effect is to (zero-extend or sign-extend?) the i16. This allows us to use bne-int to compare i16s (but watch out comparing an i16 to an int; it may compare as unequal even if the low-order bits are the same, if int has any high-order bits set). However, since the i16 has a sign bit where longer ints have a '128' bit, ble-int won't work right on i16s; it will interpret i16 -128 as int 128. So, we want a ble-i16, but we have no room for that. Alternately, we could change the 'i16's to 'u16's; but then (a) we want a sub-i16, and we have no room in the 3-operands for that, and also (b) if the native ints happen to be 16-bit then ble-int wont work again. So, maybe we should say that we CANNOT use bne-int or ble-int on i16s (it's a type error), and that we must use cvt-i16-i first, which i guess sign-extends the i16 to an int. Is this a significant enough hit to i16 efficiency such that we should provide bne-i16 and ble-i16, and move mul-f64 and div-f64 (or something else) to make room? Should we consume our last 3-operand RESERVED? ok so far i consumed the RESERVED and moved divmod-int.

note: we clearly can't use most of the addr modes on float regs, should we use the bits differently? like, mb just offer register direct, immediate (which are still interpreted as signed integers), and constant, and allow 5 value bits instead of 4? Should we then have 32 f64 regs instead of 16 (eg RISC-V has 32)? Alternately, if the addr mode has any inirection, we treat this as a PEEK or a POKE to normal memory through the normal registers. I guess the latter is more consistent.

todo: instead of offering PEEK and POKE for -i16 and -f64, we could just provide bitwise truncation and sign- and zero-extension, and say, just use those with indirect addressing modes to load and store. That makes more sense since these loads and stores are to ordinary memory, right? Well, no, not with f64; how many native memory spots it will occupy is non-portable (it will be 1 on 64-bit machines and 4 on 16-bit machines). So, maybe do that with i16 but not f64?

todo: hmm aside from f64 and i16 we're looking pretty stable. But there are still some decisions to be made then regarding f64 and i16. Throwing in f64 and i16 adds some complexity, is it worth it? If so, i16 or u16? How do f64s get read from and stored into memory, and how many spaces do they take up? How about u16s? What happens if you attempt an int operation on an int16? Are there sign-extend and zero-extend ops for in16? How do you convert from ints to int16s and could this cause a critical error? How many f64 operations do we expose? Is f64 divide by 0 a critical error or does it produce inf, and can this be configured?

note: Will we be IEEE 754 compliant? I don't think so; it seems to me that IEEE-754-2008 may require SQRT and ABS and multiple rounding modes. Also, the RISC-V spec comments, "The C99 language standard effectively mandates the provision of a dynamic rounding mode register". Perhaps the RISC-V floating point operations would be the simplest way that would support the standard. I guess we could say we support it if OUR standard required various assembler intrinsics that computed in 'software' for what the VM itself doesn't do at runtime. I'd rather just keep BootX? simple, though, and say that we don't support IEEE-754-2008, although we do provide a subset of the operations defined there. If we wanted to be as complex as RISC-V, why would we even create BootX? at all? OVM will have more opcode space and can have all those other operations.

note: in fact, it seems that even Python doesn't support IEEE-754 out of the box: [1]. And Python is used a lot for numerical computing. My motto: if Python doesn't support some numerical thing, then we really don't need it (at least not at the OVM level; maybe Oot stdlib could have it).

note: we need another feature flag in SYSINFO to state whether or not instructions (aside from CAS) are still atomic when they access memory multiple times, eg CPY (r6) (r7), which copies a word from the address pointed to by r7 to the address pointed to by r6; another example is LD R6 (R7), which loads a word from the address pointed to by the address pointed to by R7 into R6 (double indirection, b/c LD itself is already indirect once, and indirect addr mode was used).

note: we may want to add instructions for bitwise set, clear, test, and toggle; note that bitwise test would be another 3-operand branching instruction. And we may want to add bitwise rotate left, bitwise rotate right, and a bitwise rotate-through-carry.

note: the 'float' below is a 'native float', just like the 'int' is a 'native int' and 'ptr' is a 'native pointer'. ints, pointers, and floats can all fit in one memory location. 'float' is guaranteed to have at least the precision and range of an IEEE half-precision (binary16) floating point number (eg integers between 0 and 2048 can be exactly represented; integers at least up to 65504 round to no more than a multiple of 32). On a 64-bit platform, pointers might be 64-bits and so C doubles might be used as 'floats'; on a 32-bit platform, C floats might be used as 'floats'. Note that, although both 'native ints' and 'native floats' must fit within one memory location, the bitwidth of our 'native ints' is not guaranteed to match that of our 'native floats'; e.g. on a platform such as Javascript in which the only number type is a 64-bit C double, C doubles might be used as 'floats' even though our 'ints' might be 32-bits (because 32 is the largest power of two that fits within the 53-bit significand precision of 64-bit double-precision floating point).

The floating point instructions are optional.

We do not provide all floating point operations and modes required by IEEE 754-2008, but where provided, floating point operations match the behavior demanded by IEEE 754-2008. Regarding mode restrictions, we follow WebAssembly?; quoting from them:

The IEEE 754-2008 section 6.2 recommendation that operations propagate NaN? bits from their operands is permitted but not required.
(we use) "non-stop" mode, and floating point exceptions are not otherwise observable. In particular, neither alternate floating point exception handling attributes nor the non-computational operators on status flags are supported. There is no observable difference between quiet and signalling NaN?. However, positive infinity, negative infinity, and NaN? are still always produced as result values to indicate overflow, invalid, and divide-by-zero conditions, as specified by IEEE 754-2008.
(we use) the round-to-nearest ties-to-even rounding attribute, except where otherwise specified. Non-default directed rounding attributes are not supported. " -- [2]

and

" When the result of any arithmetic operation other than neg, abs, or copysign is a NaN?, the sign bit and the fraction field (which does not include the implicit leading digit of the significand) of the NaN? are computed as follows:

If the fraction fields of all NaN? inputs to the instruction all consist of 1 in the most significant bit and 0 in the remaining bits, or if there are no NaN? inputs, the result is a NaN? with a nondeterministic sign bit, 1 in the most significant bit of the fraction field, and all zeros in the remaining bits of the fraction field.

Otherwise the result is a NaN? with a nondeterministic sign bit, 1 in the most significant bit of the fraction field, and nondeterminsitic values in the remaining bits of the fraction field. " -- [3]

and

" min and max operators treat -0.0 as being effectively less than 0.0.

In floating point comparisons, the operands are unordered if either operand is NaN?, and ordered otherwise. " -- [4]

note: 'unordered' means that eq, lt, le/leq, gt, ge/geq return false, and ne/neq returns true (matching the behavior of [5])

note: as of now we don't have round, min, or max, so some of the above is not applicable; however i've kept it in in case we add those later, or for me to copy-and-paste into OVM's spec.

---

consider adding restriction of CVT-INT-CODEPTR to only come immediately after a sequence of LOADIs and SLLs and ADDs.
if you have any fixed-width types, do bytes, not i16
actually, yknow what we need, is not computation on bytes, but rather pointer arithmetic in units of bytes (in addition to the currently provided pointer arithmetic in units of native pointers). This would let us use LOADPC for jumps. Also, we may want unsigned arithmetic.
consider adding restriction that the map (types) of SMALLSTACK and of the registers be staticly known
if we really do still have access to the 16-bit instructions too, then we dont need to duplicate any INSTR-ZERO instructions in the 32-bit ones.

---

note that the 8-bit instructions are RESERVED. The intent is to get working BootX? programs, and then profile them to see which instructions and sequences of instructions frequently appear (static frequency) and are frequently executed (dynamic frequency, although i guess dynamic is not as important for this?) across many types of programs, and then assign those to the 64 8-bit instructions.

---

consider adding bit-set, bit-clear, bit-test, bit-compare-test, bit-toggle, MAC (multiply accumulate)

---

golang's calling convention?

" Go has it's own ABI calling convention:

All registers are caller-saved
All parameters are passed on the stack
Return values are also returned on the stack, in space reserved below (stack-wise; higher addresses on amd64) the arguments. " -- [6]

---

mb 'dropc: c sio (drop c items from stack)'? but that's only useful on SMALLSTACK, b/c o/w just use ADD on the stack pointer with an immediate-mode operand

---

consider adding in more comparison types (less-than, to start with)

---

consider adding some 2-operand (non-destructive except for CCPY) conditionals: CCPY (conditional CPY), CINC (conditional increment), CINV (conditional bitnot), CNEG (conditional arithmetic negation). But what would the condition be based on, since we don't have a status register for CMP operations? Pop a boolean off of SMALLSTACK (or just look at the contents of T)?

Or we could have 3-operand destructive compare-and-conditionals, eg CINC-LE x a b: if a <= b then x=x+1

and/or a 3-operand CMOVnz: CMOVnz dest src thing_that_might_be_nonzero: if thing_that_might_be_nonzero != 0 then dest = src

maybe also CSEL, CSINC, CSINV, CSNEG, which are 3-operand (like ARM64) and also pop a boolean off of SMALLSTACK.

maybe also 1-operand CINC, CINV, CNEG which pop a boolean off of SMALLSTACK, and then conditionally mutate the operand-indicated register (or effective address) in the respective way (increment, etc). And a 2-operand CCPY as above. And a 2-operand CSEL which pushes the selected result to SMALLSTACK.

if we did any of that 'bool on SMALLSTACK' stuff we'd want to add a CMP operation.

seems like many of these would eat up a lot of 3-operand instructions. So probably save it for OVM. But consider the non-3-operand ones a little longer. Like bool-on-SMALLSTACK-based 2-operand CSEL-push-to-smallstack, 2-operand CCPY (i tentatively added those two).

---

should we have FSUB?
when an output operand is given in immediate mode, what should happen? I think it should be an absolute address in the data segment, eg like ldabs and stabs. constant mode can be an absolute address item in the constant table (a label for a data storage location). i guess where the constant table is stored, how items are stored into it, and how it is looked up would have to be implementation-dependent?
CALL3 can pass in the values in the effective addresses of op1 and op0, but pass the effective address itself for op2
i think maybe CALL3 should just use the smallstack; then instructions can use ordinary addressing modes. immediate mode would just be ordinary opcodes, though

---

what about instructions to (atomically?) load/store multiple words to/from memory at once?
since SMALLSTACK addr mode can reference all 16 locations on smallstack as if they were registers, i guess we should be able to access them regardless of the current depth of SMALLSTACK. Which suggests that there is no need for SMALLSTACK to always have at least one item just so that T can be defined; SMALLSTACK can be completely 'empty' yet T and all of the other locations allocated to SMALLSTACK can still be used.
mb we should define LIBRARY to translate our ABI calling convention into the platform's?
i think we should just use the 128 instruction to put a substantial portion of the RISC-V and WASM ISAs in here. Fancy stuff like function composition, map, reduce should be saved for OVM
should we add Pop_jump_if_false?
mb add some sort of "load with PC-relative addressing"
mb generalize PUSHPC to RISC-V-style special register GET and SWAPs
branch-with-link
instructions: loadlow-signextend, loadhigh-signextend, loadlow-zeroextend, loadhigh-zeroextend, storelow, storehigh. This would allow eg. a machine with 16-bit registers to address 64k of 16-bit words and yet also treat every byte separately.
really should have an UNDEF (or FREE) instruction/annotation to annotate when a register/stack loc/memory loc is no longer in use.
- this can be used to help specify e.g. how 64-bit arithmetic undefines multiple registers at once; you could say that it is equivalent to a bunch of UNDEFs combined with the actual operation
consider having 8 integer regs and 8 ptr regs, all 32-bit. This way we fit into amd64 (which only has 16 GPRs) or arm64 (which only has 31) rather than the idea of having 16 of each kind. OTOH this is a waste of encoding space... if we do that then mb only have 3 bits for operand encoding.. which frees up 4 bits... allowing tons of instructions in Boot without BootX?... hmmm...
- i like this idea a lot. Now we can fit all the main instructions in 3-operand form in 16-bits. In 32 bits we can also have more instructions. We'll have more register pressure but the situation is closer to amd64. We'll have to make SMALLSTACK slots the larger of ptrs and 32-bit ints (and 32-bit floating points), and we'll have to have separate push/pop operations to move between the smallstack and integer registers or between the smallstack and ptr regs (and also floating point regs i guess?!). We'll have 8 32-bit floating-point registers (which can be used as 4 64-bit ones). My only big issues here is that it'll probably make it hard to use Boot as a compile target for other platforms, like LLVM, WASM, C; because compilers will expect GPRs that can hold either integers or pointers. We can fake it by having any instruction from those platforms be lowered to a series of writes to both kinds of registers, and by exposing only 4 GPRs, but this will be tremendously inefficient. By doing this, what we are saying is that no, we will not in any way support treating pointers as integers within registers (a program that really wanted to could just write the pointer to memory and then read it in as an integer). I think it might be worth it though; something along those lines is necessary anyways if we want to support target architectures with opaque or weird pointers (which is probably going to come up with interop with HLLs in any case).
- another nice thing is that this makes the semantics cleaner for those special-purpose registers that change meaning when accessed by certain instructions, e.g. that reg which was ERR/SMALLSTACK
- also i think we can have a constant zero reg, and a PC reg in the same place?
- one downside is that we have to burn multiple regs, one in each register bank, if we want a TOS
- so then we'd have: ZERO/PC; ERR/SMALLSTACK; TOS; so we'd burn 3/8 regs in each bank, and have 5 GPRs left of each type. Not much! Otoh since in practice a Boot interpreter will often be running on a CPU with 16 GPRs and needs some regs for its own use too, in those cases we could probably fit all of the interpreter regs in real regs (although not on the ARM Cortex, where we only have 12 real GPR regs; because in Boot we need to store at least the PC, ERR, SMALLSTACK, and 10 other regs)
- also in boot_extended, there's no longer an obvious way to split the 3 register bits into two halves for the indexed addr mode; and displacement mode only has a range of 2, which is less useful. We might want to do something funky and combine the operand and addr mode bits in a way that depends on addr mode.
- also in boot_extended, our immediate constant range is smaller, which affects branch distances
- i think the solution may be to go back to 16 regs in boot_extended. This lets Boot serve as a smaller code form that can only access the most commonly-used registers. But it gives up the property that lets BootX? be compiled into Boot programs without using any memory outside of the registers. Hmm...

proj-oot-bootExtendedReferenceOld200622