Boot (Oot Boot) reference

:hardbreaks: :toc: macro :toclevels: 1

Version: unreleased/in-progress/under development (0.0.0-0)

THIS DOCUMENT IS OUT OF DATE. The markdown version, boot_reference.md, has replaced it (although that's not finished yet).

Boot is a low-level 'assembly language' virtual machine (VM) that is easy to implement.

toc::[]

TODO rewrite to current design -- the pyboot implementation, particularly the instruction listings in constants.py, is (mostly) more up to date than this document. Note also that a simpler 32-bit encoding has been introduced; this 16-bit encoding has been made an optional 'compressed' encoding (mb since it's optional, it should be moved to BootX?) later: actually now i'm thinking of simplifying even further and removing stacks, and adding floating point, see ootAssemblyNotes26.txt

TODO actually let's go back to defining int32 as '32-bits OR MORE' and make integer overflow fully undefined behavior. This is because signed integer overflow is undefined behavior in C and we want to allow porters to compile arithmetic straightforwardly to their unchecked C equivalents; and also because [1] says that in many cases it's much more efficient for variables to be register width. I also made a note about this at the end.

Introduction

Boot's purpose is to promote the portability of OVM (Oot Virtual Machine) by serving as a portable target language underneath it.

The reason for writing a new assembly language, rather than adopting LLVM or Webassembly or similar, is that we want the entire Oot toolchain to run on top of many platforms, including 'on top of'/within existing high-level languages such as Python. This means that the entire Boot implementation needs to be quick and easy to port (not just retarget to a new backend, but entirely port), and it needs to support opaque platform-native pointers of unknown size.

Highlights:

3-operand fixed-length register machine
about 40 TODO instructions
instructions include TODO signed 32-bit integer arithmetic, stack operations, I/O and memory allocation
RISC-like; no addressing modes (or rather, each instruction has one fixed addressing mode), and only a few instructions access (non-register) memory
16 registers, including 10 GPRs and 6 special-purpose registers
two fixed-capacity 8 element stacks (one for ints, one for ptrs) with opaque representation
- todo: remember we need 6 items on ea. stack for implementation usage. If each stack is o.w. 16 items, that's 22 items total. These are not for user use, only for the toolchain (e.g. compilation of BootX? into Boot)! The 'user' program should only assume 16 items are available.
opaque pointer representation
instructions for compare-and-swap and memory fence
instructions for I/O and memory allocation

Cheatsheet

Instruction encoding: 16 bits, little endian. LSB bit (the 'instruction encoding length format bit') is always 0.

Four formats:

I11: 2 I11-opcodes, one 11-bit immediate operand
I8: 2 I8-opcodes, one register operand which can address 8 registers, and one 8-bit immediate operand
R-format: 32 opcodes, two register operands which can each address 16 registers
P-format: 32 opcodes, three register operands which can each address 8 registers

For each format, from LSB to MSB, the fields are:

I11-format (11-bit Immediates): 0000, I11-opcode (1 bit), signed immediate operand (11 bits)
I8-format (8-bit Immediates): 0001, I8-opcode (1 bit), signed immediate operand op1 (8 bits), operand 0 (3 bits)
R-format (many Registers): 001, R-opcode (5 bits), operand 0/1 (4 bits), operand 0 (4 bits)
P-format (many oPerands): 01, O-opcode (5 bits), operand 2 (3 bits), operand 1 (3 bits), operand 0 (3 bits).

In graphical form (from MSB to LSB) (0,1 = constants, o = opcode, x = operand 2, y = operand 1, z = operand 0:

I-formats:

I11-format: iiiiiiiiiiio0000
I8-format: xxxiiiiiiiio1000

R-formats:

R2-format: xxxxzzzzooooo100 (operand 1, yyyy, is not encoded because it is assumed to be the same as operand 0)
R1-format: xxxxoooo11111100
R0-format: oooo111111111100

P-formats:

P3-format: xxxyyyzzzooooo10
P2-format: xxxyyyooo1111110
P1-format: xxxooo1111111110
P0-format: ooo1111111111110

TODO out of data; this is now the 'compact' format, which should go in a different document

Note that some implementation could choose to consider the subinstructions of otwo, oone, ozero as separate encoding formats.

Note that some implementation could choose to consider the I-format as consisting of two subformats, depending on how many bits the immediate constant is.

The I11 and I8 instructions are:

I11: jpc and jk (jumping)
I8: ldi sai (loading constants)

Datatypes: int32, ptr

Registers: two banks of 16 registers each; one for int, one for ptr. The first 5 registers in each bank are special. The next 3 are GPRs. Some operations are only available on the first 8 registers.

Instructions:

TODO: add the other new instructions

==
annotation	ann
constants	ldi sai
loads and stores and moves	ld ldh ldhu ldb ldbu ldp st sth stb stp cp cpp
arithmetic of ints	add addi sub mul
bitwise arithmetic	sl sru srs and or xor
arithmetic of pointers	aip
comparision control flow	beq beqp bne blt
control flow	jpc jy
concurrency	CAS fence acquire release
misc	more
sys	sysinfo
RESERVED	3 R instructions, 1 I instruction
RESERVED	6 two-argument instructions
RESERVED	7 one-argument instructions
RESERVED	8 zero-argument instructions
RESERVED	1 instruction
==

other ideas: in out push pop dup drop swap over read write halt impdep poll sysinfo xcall devop, CAP register/stack cmov select call ret cmov malloc dealloc realloc

In the following, #X is an immediate constant, $X is an integer register, &X is a pointer register, and x is an untyped argument. All immediate constants are signed two's-complement ints.

From left to right, the arguments go into operands op0, op1, op2. Immediate operands are always on the right (the highest-numbered operand). When two immediate operands are combined into an #imm6 (as with instruction jd), op1 is the high-order bits and op2 is the low order bits (imm6 = (op1 << 3) + op2).

The branch immediates have a special coding, because -2, -1, 0 are not needed; at the time of branch instruction execution the PC is pointing to the beginning of the instruction after the branch instruction. Since instructions are 16-bits, -2 would branch right back to the branch instruction itself, causing an infinite loop; -1 would branch into the middle of the branch instruction; and 0 would cause the branch to end up at the same place as if there were no branch. Therefor the branch instructions interpret the numbers 0..7 in op2's binary representation to code for: [-6, -5, -4, -3, 1, 2, 3, 4]. The program counter is in units of bytes, and instructions are 16-bits (2 bytes), so a branch immediate of +2 skips over one instruction, and a branch of +4 skips over two instructions, and a branch of -4 skips backwards two instructions, and a branch of -6 skips backwards three instructions. The odd-numbered branch immediates (-5, -3, 1, 3) are illegal in Boot but are present for compatibility with BootX?, which is a variable-length instruction set including some 1-byte instructions.

Jump and branch offsets are in units such that an offset of 2 represents one Boot instruction. Platforms which represent Boot code in memory in ways such that one instruction spans more or less than 2 memory locations should adjust the jump and branch offsets accordingly when executing instructions jpc beq beqp bne blt, and also should emit the correct INSTRUCTION_SIZE in SYSINFO. Program code which dynamically computes code pointer by adding integers to the PC or to other code pointers should be sure to take account of INSTRUCTION_SIZE.

meta:

ann #imm9: ANNotation; no effect on execution

constants:

li $reg #imm8: LoaD? 8-bit Immediate int constant
sai $reg #imm8: Shift left 8-bits then Add 8-bit Immediate int constant
lk $k: LoaD? K-th ptr constant and push to &stack. The pointer constants can range from -32768 to 32767 (some Boot implementations may support a larger range). The positive pointer constants are pointers to the program locations corresponding to the jump table (these are suitable to be fed to JY instructions). The negative pointer constants are defined in an implementation-dependent manner (these could be used for external function entry points, constant data pools, special platform-specific memory locations, etc).

lw $dest &addr: LoaD? 32-bit int from memory (or stack index) to register
lh $dest &addr: LoaD? 16-bit int from memory (or stack index) to register
lhu $dest &addr: LoaD? 16-bit unsigned int from memory (or stack index) to register
lb $dest &addr: LoaD? 8-bit int from memory (or stack index) to register
lbu $dest &addr #imm3: LoaD? 8-bit unsigned int from memory (or stack index) to register
lp &dest &addr #imm5: LoaD? Pointer from memory (or stack index) to register
sw &addr $src #imm3: STore 32-bit int from register (or stack index) to memory
sh &addr $src #imm3: STore 16-bit int from register (or stack index) to memory
sb &addr $src #imm3: STore 8-bit int from register (or stack index) to memory
sp &addr &src #imm3: STore pointer from register (or stack index) to memory
cp $dest $src: CoPY? int from register to register
cpp &dest &src: CoPY? Pointer from register to register

stack and register/stack loads and stores and moves:

lsi $dest #imm3: Load from int Stack, Integer (copy a value from deep in int_stack without popping)
lsp &dest #imm3: Load from ptr Stack, Pointer (copy a value from deep in ptr_stack without popping)
ssi $src #imm3: Store to int Stack, Integer (overwrite a value deep in int_stack without pushing)
ssp &src #imm3: Store to ptr Stack, Integer (overwrite a value deep in ptr_stack without pushing)
swapi $reg: swap two items (note: can be used as swapi $s $s to swap top two items on intstack)
swapp &reg: swap two items (note: can be used as swapp &s &s to swap top two items on ptrstack)

(note: you can use 'cp' and 'cpp' to achieve DUP, and by also using the zero register, DROP)

arithmetic of ints:

add $dest $src1 $src2: $dest = $src1 + $src2 mod 2^32
addi $dest $src1 #imm3: $dest = $src1 + #imm3 mod 2^32
sub $dest $src1 $src2: $dest = $src1 - $src2 mod 2^32
mul $dest $src1 $src2: $dest = $src1 * $src2 mod 2^32

bitwise arithmetic:

sll $dest $shiftamount $src: shift left
sru $dest $shiftamount $src: shift right unsigned
srs $dest $shiftamount $src: shift right signed
and $dest $src1 $src2
or $dest $src1 $src2
xor $dest $src1 $src2

arithmetic of pointers:

ap &dest $src1 &src2: &dest = $src1 + &src2 (Add to Pointer) (note: some pointers may not be able to be added to)

(?: i'm not sure if these are worth including; the idea is to make up for the fact that the platform word size and pointer size is not known at compile time. Note also that a fancy VM could easily search for these at loadtime and substitute in the appropriate platform-specific immediate constants. These also serve to provide an ADDI on pointers, which is probably very useful)

appi &dest &src #imm3: &dest = &src + #imm3*PTR_SIZE (Add to Pointer, units of Pointer, Immediate)
apwi &dest &src #imm3: &dest = &src + #imm3*I32_SIZE (Add to Pointer, units of Word, Immediate)

conditional branches:

beq $src0 $src1 #imm5: branch-if-equal
beqp &src0 &src1 #imm5: branch-if-equal on ptrs
bne $src0 $src1 #imm5: branch-if-not-equal
blt $src0 $src1 #imm3: branch-if-less-than

unconditional jumps:

jpc #imm11: jump (PC-relative)
jy &target: dYnamic (indirect) jump. &target must be a code pointer provided at runtime (either by PUSHPC or by platform-specific or foreign code or lk) and which did not have further pointer arithmetic performed on it; otherwise, results are implementation-dependent (note that dynamic pointer arithmetic on code locations is non-portable because a compiler may translate Boot instructions into platform instructions of unknown and differing size).
jk $reg #imm11: Jump to the pointer that would be loaded by "lk $imm11". $imm11 is signed (so, it can reach the first 1KiB? entries in the global jump table, and the first 1KiB? implementation-dependently-defined negative pointer constants).
jt $reg: Jump to index within local jump table (jump table follows JT instruction, in the form: 16-bit integer with table length N, in units of number of 16-bit entries; then a series of N 16-bit integer offsets, in bytes, from the program location following the end of this jump table. N must be between 0 and 32767, inclusive. If the index provided in $reg is less than N, a jump is performed to the program location specified by the offset in the table entry at the given index; if the index provided is greater than or equal to N, then execution continues from the program location following the end of this jump table (equivalent to a jump to a table entry of offset 0))

concurrency:

casat &dest $new $old: compare-and-swap atomic. Pushes intstack with $new upon success and with the value of &dest otherwise.
casat: like casat, but without ordering (casat has sequentially consistent ordering) todo
fence $memory_domain: instruction/memory access reordering barrier; prevents any memory operations on the given memory_domain from appearing to be reordered across the FENCE instruction.
malloc_local &dest $size: Like malloc but the allocated memory must only be used for thread-local storage. The semantic guarantees of FENCE do not apply to accesses to memory allocated by malloc_local. All other operations also lose their atomicity and ordering guarantees (e.g. cas, lpa, lparc, lwa, lwarc, spa, sparc, swa, swarc) (e.g. so lpa and lparc become equivalent to lp, etc). The motivation for malloc_local is that, in order to be able to provide the concurrency guarantees required by this spec, some Boot implementations may create and use locks to control access to blocks of shared memory returned by malloc; in some cases even non-atomic, unordered load or store instructions could cause the Boot implementation to acquire a lock. malloc_local lets such an implementation know that it does not have to setup and use locks for this memory segment, and represents an assurance by the programmer that this memory segment will only ever be accessed by the same thread that called malloc_local.
{lp,lw,sp,sw}a are like {lp,lw,sp,sw} but atomic.
{lp,lw,sp,sw}rc are like {lp,lw,sp,sw} but have Release Consistency semantics (specifically, RCpc; that is, the ordering operations themselves are ordered with Processor Consistency semantics), that is to say, the loads are load-acquires and the stores are store-releases. These operations are also atomic.
{lp,lw,sp,sw}sc are like {lp,lw,sp,sw}rc but also have Sequential Consistency semantics, that is to say, they have Release Consistency as above and also appear in program order in a single total order over this memory_domain observed by all threads along with all other sequentially consistent instructions. These operations are also atomic.
addat $dest &addr $src2: Atomically, the integer value at memory location &addr has $src2 added to it, and $dest = the old value that was at memory location &addr.
apat: Like add_amo, except that the value at memory location &addr is a pointer.
andat: like add_amo, except that the operation is bitwise AND rather than addition
orat: like add_amo, except that the operation is bitwise OR rather than addition
xorat: like add_amo, except that the operation is bitwise XOR rather than addition

placeholders for other formats:

ptwo x0 x1 #subopcode
pone x0 #subopcode
pzero #subopcode
rone x0 #subopcode
rzero #subopcode

meta:

sysinfo $dest #query: $dest = sysinfo(#query) (query 0 = PTR_SIZE, 1 = I32_SIZE, 2 = I16_SIZE, 3 = INSTRUCTION_SIZE)
lpc $dest: $dest = PC (program counter)
halt: end program, returning result code $3

I/O:

in $dest $device: read one integer of input from a device
out $src $device: write one integer of output to a device
inp &dest $device: read one pointer of input from a device
outp &src $device: write one pointer of output to a device
devop $value $device: device-specific control operation on a device

memory allocation:

malloc &dest $size $memory_domain: &dest = pointer to a memory allocation of $size memory locations; also pushes a result code to $s (0 is success). memory_domain is RESERVED for future use; always use $0 for now.
mfree &src: deallocates &src

interop:

xentry: enter Boot function from external platform
xret &return_address: return from Boot function to external platform
xcall &target_address: call external subroutine
xpostcall: enter Boot function after returning from external subroutine call

General architecture

Memory locations are of platform-specific size. The SYSINFO instruction can be used to query constants which specify how many memory location are needed to store a PTR (PTR_SIZE), a 32-bit quantity (I32_SIZE), or a 16-bit quantity (I16_SIZE). A memory location must fit at least an 8-bit quantity, so there is no I8_SIZE (it would always be 1).

If a memory location holds more than 8-bits, the instructions lw, lh, lhu, lb, lbu, will load the quantity in the memory location mod 232, 216, 216, 28, 28 respectively.

The integer registers hold 32-bits and are treated as signed integers (twos complement).

Data that needs to be accessed concurrently must be aligned; other data has no alignment restrictions. Instructions must be 16-bit aligned.

Loads and stores of a single 32-bit word or of a single pointer are guaranteed to be atomic (or, if not, the Boot implementation must not be considered thread-safe and its documentation must state this prominently).

The only pointer manipulation instructions provided by Boot is the 'aip' instruction (todo: rename to ap), which adds a (signed) integer to a pointer. However, on some implementations it may be possible for Boot programs to manipulate pointers by reading in a memory location that contains a pointer as if it were an integer. The result of this is undefined/implementation-defined behavior.

Instruction encoding

The instruction encoding is fixed-length 16-bits.

Encodings are little-endian on disk, over the network, or when presented in serialized form, and are typically stored in host byte order in memory.

The instruction with an all-zero bitpattern (jpc 0) is illegal, and programs which might execute this instruction are illegal. The mnemonic alias 'BAD' is assigned to this bitpattern. The execution of this instruction is a critical error, and implementations should, but are not required to, terminate with an error if such an instruction is executed. This may assist in detecting situations in which uninitialized or non-coding memory is accidentally executed (because such memory may be more likely to contain zero than any other value), or in which some types of hardware errors are present.

Reserved instructions are treated similarly to the all-zero bitpattern.

Note that program code IS allowed to contain bit patterns of 16 0 bits, provided that these are not executed as Boot instructions; this provides for platforms which store data in the instruction stream, and also provides for forward compatibility with mixed-length encoding extensions which use Boot for 16-bit instructions but also include other longer instructions that sometimes contain 16 0 bits within them.

The beginning of the program code is a table of offsets into program code. The table starts with a 32-bit unsigned integer specifying the length of the table (in units of 32-bit integers). Each entry in the table is a 32-bit unsigned integer specifying an offset from the first location after this jump table, in units of bytes. There is at least one entry in the table and the first entry in the table is a pointer to the program entry point.

The jk instruction can be used to jump to the locations indicated in this table.

This table must remain constant and should not be read or modified during program execution; the Boot toolchain is permitted to read the table once at compile time or load time and then discard it, and use internal mechanisms to implement the jk command (for example, a compiler could get rid of the table and compile each jk instruction into a platform-native absolute jumps).

Although the length of the constant table is stored as a 32-bit integer, the official Boot implementations only support constant tables of length up to and including 32767 (whichis equal to 2^15 - 1, also the largest integer that can be represented as a signed 16-bit integer).

Jumps into the middle of instructions, or into a table of program code offsets, are not allowed (jumps to the addresses indicated by the entries of such tables are allowed, of course).

Note that self-modifying code, or jumps to Boot code which was written at runtime, may not be permitted on some platforms (for example, a compiler which compiles Boot code to native code may not be able to support interpreting additional Boot code at runtime).

Data types

There are two defined data types:

INT32 (signed 32-bit integer)
PTR (pointer)

INT32 can hold at least the range -32768 thru 32767, inclusive. Arithmetic whose mathematical result would be a value outside of that range has undefined behavior (except for 'mul', which returns the 32 low-order bits of the result; even there, however, if the mathematical result would be outside of the range of an int32, the sign of the result may be incorrect).

Registers

There are two banks of 8 registers each. One bank can only hold ints and the other can only hold ptrs. We denote the integer register bank with a '$' prefix and the pointer register bank with a & prefix.

In each bank, the first 3 registers are special:

$0: zero register. This register always holds the integer zero. Writes to this register have no effect.
$1 (also called $S): INTSTACK pseudoregister. Reads from this register pop from INTSTACK, and writes to it push to INTSTACK.
$2 (also called $T0): top-of-INTSTACK pseudoregister. This register holds the item on top of INTSTACK.
$3 (also called $T1): INTSTACK-2 pseudoregister. This register holds the second item of INTSTACK.
$4 (also called $T2): INTSTACK-3 pseudoregister. This register holds the third item of INTSTACK.
&0 (also called &G): A pointer to a small pre-allocated block of memory called the zero segment.
&1 (also called &S): PTRSTACK pseudoregister. Reads from this register pop from INTSTACK, and writes to it push to INTSTACK
&2 (also called &T0): top-of-PTRSTACK pseudoregister. This register holds the item on top of PTRSTACK.
&3 (also called $T1): PTRSTACK-2 pseudoregister. This register holds the second item of PTRSTACK.
&4 (also called $T2): PTRSTACK-3 pseudoregister. This register holds the third item of PTRSTACK.

The other 6 registers, $5 thru $7 and &5 thru &7, are GPRs (general purpose registers). In addition there are two more secondary banks of 8 registers each, $8 thru $15 and &8 thru 15, that can only be used as operands to some instructions (2-operand instructions, except for SYSINFO and 2-operand I/O instructions and RESERVED_P1_6).

When the Boot program terminates with a HALT instruction, &3 should hold the same value that it held upon the start of the program, and $3 should hold a result code (the convention is that 0 indicates success, unless the purpose of the program is to compute and return an integer, in which case the result code should be that integer).

If an instruction interacts with INTSTACK or PTRSTACK multiple times, the convention for the ordering of the mutations is: 1) first, all input operands are read, in the order operand 0, operand 1, operand 2 (or whichever subset of these are input operands), which is the same as reading the assembly language operands from right to left -- if any of these are $S or &S, they pop the relevant stack in turn. There is no 'short-circuit evaluation'; all input operands are read (and popped from smallstack if needed) 2) then, the instruction executes, perhaps interacting with a stack as an effect of its execution 3) finally, all output operands are written to, in the same order given by (1) (that is, right to left in Boot assembly). If an instruction only conditionally writes to some of its outputs, then giving it a destination operand of $S or &S is illegal for that instruction. If the instruction can trap, that is legal; if it does trap, then the stack is not pushed for outputs.

INTSTACK and PTRSTACK

These are fixed-length stacks. They each start the program with a depth of 1. They each have a maximum capacity of at least 16 items. There is no way to check the current depth of these stacks.

A stack underflow or overflow, that is, an operation that would make one of these stacks' depths, respectively, less than 1 (one stack element is always present so that the $T and &T registers are always meaningful) or larger than its fixed capacity, is illegal and the result is undefined behavior.

Each slot in INTSTACK can contain a 32-bit integer and each slot in PTRSTACK can contain a pointer.

The top 16 slots in the stacks can be directly referenced through the lsi/ssi/lsp/ssp instructions. An immediate operand of 0 to these instructions indicates the top of the stack (for example, in int_stack, lsi #0 recalls the same value as $t). Higher numbers refer to locations further down in the stack; these locations are expressed as depth offsets relative to the top of the stack. Depths past the bottom of the stack are legal, and you can store values to these locations, and these values are accessible. For example, if the stack is currently depth 1, the following sequence of code is guaranteed to end up with 111 in register 4:

li $s #22 ; push 22 onto the stack li 3 #111 ; load 111 into register 3 ssi 3 #14 ; store 111 into the stack location at depth 14 li $s #33 ; push 33 onto the stack lsi 4 #15 ; load 111 back from the same stack location that it was stored into

Note that even though 111 was stored into stack depth 14, and then loaded back from depth 15, because these depth offsets are relative to the top of the stack, and because the stack was pushed in between those two instructions, the location that depth 14 used to refer to is now referred to by depth 15.

However, as the stack gets larger, any value at a location deeper than #15 may (or may not) be lost; for example, if after executing the previous sequence of code, we were to do:

li $s #44 ; push 44 to the stack cp $0 $s ; pop the stack lsi 5 #15 ; load whatever is at stack depth #15 into register 5

Now an arbitrary value is loaded into register 5, because after we pushed 44 to the stack, the value 111 was at a stack depth greater than 15, and there is no guarantee that any values stored at stack depths greater than 15 remain stored. Perhaps register 5 will hold 111; perhaps it will hold 44; perhaps it will hold something else. There is no guarantee. This is done so that implementations can pre-allocate a fixed number of locations to hold the stacks.

Note: "user code" should act as if the capacity of the stacks is only 16; the other 16 locations are intended to be used by virtual machine compilation toolchains to implement macroinstructions.

Note: some implementations may implement these stacks as circular buffers of size >=32.

I/O

The in, inp, out, outp instructions read input and write output from/to 'devices'.

Some devices might only accept/emit ints, and some might only accept/emit pointers, and some might accept/emit both.

The devop instruction executes a device-specific control operation or metaoperation on a device.

The interpretation of devices and devops is platform-specific. It is suggested than device 0 be the console (stdin/stdout) on platforms that support that.

Note that official Boot tooling only supports device numbers up to 32767.

Memory allocation

The malloc, mdealloc instructions support allocation and deallocation of memory. Malloc pushes a result code to intstack.

Zero segment

When a Boot program starts, &0 points to a small block of pre-allocated memory called the zero segment.

The zero segment is guaranteed to be big enough to hold the larger of 16 PTRs or 16 INT32s.

Despite its name, the zero segment is not necessarily initialized and may hold arbitrary data at the start of the program.

The zero segment is shared memory. If a new Boot thread is spawned, it will inherit the same zero segment. If a new Boot process is spawned, it will get a new zero segment.

Calling conventions and returning conventions

Boot has 4 calling conventions.

TODO: too many... just look at how many r1_instructions and r0_instructions we need for this.

TODO: tail calls make callee-saved registers hard to use or useless, so need to change these to have more caller-saved.

Calling convention r12

Calling convention r12 passes up to 12 integer arguments and 12 pointer arguments in registers.

All registers aside from r0, r1, r2, r3 are used to potentially pass arguments (some might also think of these as caller-saved). Only pointer register r3 (which is suggested to be used for a memory stack) is callee-saved.

Upon returning from a call, one integer return value, found in integer register r3, will be passed.

This can be thought of as the calling convention used to initially call/start a Boot program.

Calling convention r5

Calling convention r5 passes up to 5 integer arguments and 5 pointer arguments in registers.

Register 3 (both banks) is callee-saved. Pointer register 3 is used as a memory stack pointer when applicable, otherwise it is callee-saved.

Integer register 4 is caller-saved. Pointer register 4 is used as a return address pointer/link register.

Registers 5, 6, 7, 8, 9 (both banks) are used to pass arguments and return values (see below) (from lower to higher number get arguments from left to right).

Registers 4, 5, 6, 7, 8, 9 (both banks) are caller-saved. Registers 3, 10, 11, 12, 13, 14, 15 (both banks) are callee-saved. The integer and pointer stacks are also callee-saved.

Upon making a call, up to 5 integer arguments are in integer registers 5, 6, 7, 8, 9, and up to 5 pointer arguments are in pointer registers 5, 6, 7, 8, 9.

Registers 4-9 are caller-saved registers and may be overwritten and used for any purpose by the callee.

Upon returning from a call, up to five integer and up to five pointer return values will be found in registers 5-9 using the same convention as for calling.

Calling convention rv

Calling convention rv passes a variable number of arguments in registers and in memory.

Calling convention rv is like calling convention r5, except that registers 10 and 11 (both banks) are used to pass pointers to additional arguments. The caller/callee status and function of all registers is the same as in calling convention r5 except for registers 10 and 11 (both banks).

Upon making a call, the first 5 integer arguments are in integer registers 5, 6, 7, 8, 9, and the first 5 pointer arguments are in pointer registers 5, 6, 7, 8, 9. The total number of integer arguments is in integer register 10 and the total number of pointer arguments is in integer register 11. If there are more than 5 total integer arguments then pointer register 10 contains a pointer to the remaining integer arguments, and if there are more than 5 total pointer arguments then pointer register 11 contains a pointer to the remaining pointer arguments.

Unconventionally, although registers 10 and 11 are involved in argument-passing, they are callee-saved.

If there are more than 5 integer or pointer arguments, respectively, then the space for the remaining arguments pointed to by pointers in pointer registers 10 and 11, respectively, are available to the callee for overwriting, similar to caller-saved registers (this allows for constant-space recursive 'tail calls'). The callee must not deallocate the memory pointed to by pointer registers 10 or 11.

Upon returning from a call, the convention for return arguments is the same as in calling convention r5. Up to 5 integer and up to 5 pointer arguments may be returned.

Calling convention s

Calling convention s passes up to 2 arguments on the stacks, and returns up to one return value on the stacks. Which stacks are affected dependends on the types of arguments being passed. The callee pops/consumes the arguments, if any, and pushes the return value, if any.

All registers are callee-save, and both stacks are callee-save, except for the arguments and return values.

Mnemonic shorthands for the possibilities are:

Calling:

s0: call with no arguments
si: call with one integer argument (on intstack)
sp: call with one pointer argument (on ptrstack)
sii: call with two integer arguments (on intstack)
spp: call with two pointer arguments (on ptrstack)
sip, spi: call with one integer and one pointer arguments (one on each of intstack and ptrstack) (sip is for calling external functions where the first positional argument is an integer and the second is a pointer, and spi is the reverse)

Returning:

s0: return with no return value
si: return with one integer return value (on intstack)
sp: return with one pointer return value (on ptrstack)

Interoperation

The xentry* instructions are a family of instructions which vary by calling convention. The instruction should be the first instruction in Boot subroutines intended to be called directly from an external source. It implements a platform-specific calling convention, and translates it to a Boot calling convention. The instruction takes no operands. The members of the family are xentrys0, xentrysi, xentrysp, xentrysii, xentryspp, xentrysip, xentryspi, xentryr5, xentryr12, xentryrv.

The xret* instructions are a family of instructions which vary by calling convention. The instruction returns to an external caller using a platform-specific calling convention. The program should be ready to return using a Boot calling convention, and this will be translated by xret into the platform's calling convention. xret takes one pointer operand, which is the return address. The members of the family are xrets0, xretsi, xretsp, xretri, xretr5. xretr5 should be used if xentryr5 or xentryrv was used, xretr12 should be used in xentryr12 was used, and otherwise one of the xrets* members should be used.

The xcall* instructions are a family of instructions. The instruction calls an external subroutine, translating a Boot calling convention into the platform calling convention. xcall* takes one pointer operand, which is a pointer to the subroutine being called. The members of the family are xcalls0, xcallsi, xcallsp, xcallsii, xcallspp, xcallsip, xcallspi, xcallr5, xcallr12, xcallrv.

The xpostcall* instructions are a family of instructions. The xpostcall* instruction should be placed immediately after the xcall* instruction if the Boot program expects that an external subroutine may return after an xcall. It translates the platform-specific after-call-return calling convention into the Boot return convention. The instruction takes no operands. After xpostcall, return values will be found in locations specified by the given Boot calling convention, and callee-saved registers and stacks will be restored to their original condition. The members of the family are xpostcalls0, xpostcallsi, xpostcallsp, xpostcallr12, xpostcallr5. xpostcallr5 should be used if xcallr5 or xcallrv was used, xpostcallr12 should be used if xcallr12 was used, otherwise one of the xpostcalls* members should be used.

Neither xcallrv nor xpostcallr5 allocate nor deallocate any memory for extra arguments which may be pointed to by pointer registers 10 and/or 11.

It is legal for a Boot program to call itself or other Boot code using xcall, provided that the called subroutine begins with a matching xentry and (if it returns to the caller) the subroutine ends with an xret and returns to a matching xpostcall instruction in the caller.

Concurrency

Unfortunately the memory consistency design of Boot was kind of thrown together and may be incorrect.

For Boot programmers: the simplest way

If you don't want to think about this stuff at all, only use memory operations with "sequentially consistent ordering". The memory operations with sequentially consistent ordering are:

lpa, lwa, spa, swa, casamo, addamo, apamo, andamo, oramo, xoramo, fence

(and you won't have any need for the 'fence' instruction if you restrict yourself to the memory operations are from this list)

These memory operations must only be used on aligned addresses. All addresses returned by malloc are guaranteed to be aligned. &0 is guaranteed to be aligned. Any address generated from an aligned address of the form new_addr = old_addr + a*ALIGN_SZ (todo: add ALIGN_SZ to SYSINFO), where a is an integer, is guaranteed to be aligned. TODO: no, align per data type, e.g. align ptrs to ptr_sz, ints to int_sz.

TODO: change this to align to the size of the operand, rather than ALIGN_SZ; e.g. 16-bit data should be 16-bit aligned, 32-bit data should be 32-bit aligned, etc

Also, put a fence instruction after you receive any newly allocated memory that you plan to use as shared memory, and before you free it, or before you transfer control to any external code that might access it.

If you do this (that means, for example, your program should not contain the LW instruction; you would always use LWSC in its place), then the Boot implementation guarantees that every execution of your program will be sequentially consistent, that is, the following property will always hold:

(SC) All memory accesses in all threads will always appear to take place in a single total order, all threads observe these memory accesses in this same total order, and the operations of each thread appear in that total order in program order.

In other words, the Boot implementation makes the following promise to you:

(guarantee: SC execution if only SC instructions) If your program doesn't contain any memory operations except for the sequentially consistent ones, and your program only operates on one memory_domain, then every execution of your program will be SC.

You can imagine that when each thread wants to perform a memory operation, it submits this request to a global queue, and the implementation goes thru and services these requests one by one in the order it received them. This is probably not how your platform actually works, but if you following the rules above, you can reason about your program as if it were.

For Boot programmers: a less simple but still recommended way

On many platforms, the use of sequentially consistent (SC) ordering instructions is significantly less efficient than their non-SC variants. A reasonable way to improve performance is given here. If you don't really care about performance, our advice is to not bother about all this and just do what the previous section suggests, and skip the rest of this section.

Still here? First, some definitions.

Definition of "sequentially consistent (SC) execution": An SC execution is one in which all memory accesses in all threads appear in a single total order, all threads observe these memory accesses in this same total order, and the operations of each thread appear in that total order in program order.

Definition of "competing accesses": a pair of potentially simultaneous accesses to the same memory location by different threads in which one of the accesses is a write.

Now, what we recommend is that you ensure that your program uses only sequentially consistent instructions for (both accesses in each) any possible competing access in any possible SC execution. Furthermore, in between any possibly competing access and any non-sequentially consistent memory instruction to the same location, place a FENCE instruction. If you do this, then the Boot implementation once again promises that your program will always enjoy sequential consistency, just as above. In other words, the Boot implementation makes the following promise to you:

(guarantee: SC execution if only SC competing accesses and FENCES) If, in every SC execution, every competing access involves only sequentially consistent instructions, and there is a FENCE in between every competing access and every non-sequentially-consistent memory access instruction, and your program only operates on one memory_domain, then every execution of your program will be SC.

So, for example, you can use the faster LW instruction to access a memory location (which doesn't have to be aligned) when you are absolutely sure that no other thread will ever access the same memory location at the same time (for example, for thread-local storage). If two threads might both simultaneously access a location, though, then it must be aligned, and both of them must use the slower LWSC instruction.

Note: this is our variant of what other languages call the "DRF-SC", "DRF0", or "SC for DRF" property.

(todo: is this condition sufficient? this seems like overkill with the FENCEs, is it? also can we make a less strict requirement than 'only one memory domain'?)

For Boot programmers: the most performant way (not recommended)

todo the following (and maybe some of the previous) is out of date

If you really need the highest performance, and you are willing to think really hard to get it, then you can use a wider set of memory access instructions at the cost of losing sequential consistency. We warn that programs without sequential consistency are really hard to reason about, even for experts. We recommend not doing this; instead, use one of the methods given in the previous two sections, and skip the rest of this section.

Still here? If you do do this, we recommend encapsulating the non-sequentially-consistent parts of your program within a library that provides concurrency primitives (for example, mutexes, or for example, parallel map/reduce), and within this library try to use only well-known algorithms proven to work in similar settings.

The word 'synchronization' is used in memory consistency models in a different way than in ordinary English. In ordinary English, to 'synchronization' typically refers to making events happen at the same time. In memory consistency models, 'synchronization' typically refers to PREVENTING events from happening at the same time! For this reason, we'll use the word 'order' where other authors would say 'synchronize' (e.g. 'ordering instructions' where other authors would say 'synchronization instructions' and 'ordered' where they would say 'synchronized'). We'll use the word 'relaxed' for instructions that you might expect ordering from but that don't provide any (nor do they anti-order, they just don't affect it), and we'll still use the word 'asynchronous' as other authors use it.

(an etymological aside: this makes more sense if you imagine that we generalize 'synchronization' to mean merely that there is some constraint on the relative ordering of events (just a guess, but perhaps this peculiar phrasing came about via the the intermediate concept of a synchronous circuit, in which a synchronous (in the ordinary English sense) global clock signal is used to effect various constraints on the ordering of various events). Simultaneity becomes merely one constraint among many that might be imposed upon an ordering of events. Most often in memory models the constraint imposed by 'synchronization' is the negation of simultaneity through a happens-before relation; so two events are said to be ordered by synchronization when we know that they cannot happen at the same time. This is almost the opposite of the word's use in ordinary English! I say almost because in both ordinary English and memory consistency models, 'desynchronized' might be used refer to events that have no constraints on their relative ordering; although even here, in ordinary English, 'desynchronized' might also be used to refer to events that are definitely not simultaneous.)

Definition of "data race": in some SC execution, there exists a pair of competing accesses, and neither of these accesses are by atomic instructions; and neither of these events happens-before the other.

Data races are undefined behavior. (our motivation for specifying it this way is that C specifies that (see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf page 14, 1.10.23), and we want it to be legal for a Boot compiler to blindly compile Boot to C without the Boot compiler having to check for the absence of data races first)

Definition of "data-race-free": A program is data-race-free if there are no data races.

All memory operations are atomic except for the following non-atomic operations: byte and halfword loads/stores; also lp, lw, sp, sw.

The 'rlx' instructions ({casamo, lp, lw, sp, sw}rlx) are atomic but relaxed. They do not affect the happens-before ordering.

The 'rc' instructions ({lp, lw, sp, sw}rc) have RCpc semantics (Release Consistency with Processor Consistency between the timing operations themselves). sprc and swrc have release semantics, meaning that no memory operations can be reordered from before to after the releasing instruction (that is, every memory operation in the same thread that occurs before the execution of the releasing instruction in program order happens-before the releasing instruction), and furthermore, all writes in the current thread before the releasing instruction become visible to other threads that do an acquiring instruction on the same memory location afterwards (that is, if the releasing instruction happens-before any acquiring instruction at the same memory loation, then every memory operation in the same thread as the releasing instruction that happens-before the execution of the releasing instruction also happens-before the acquiring instruction in the thread with the acquiring instruction). lprc and lwrc have acquire semantics, meaning that no memory operations can be reordered from after to before the acquire operation, and furthermore, all writes in any other thread that executed a releasing instruction before the acquire instruction become visible to the thread doing the acquring instruction. Note that the acquiring and releasing instructions themselves have Processor Consistency, meaning that all acquiring and releasing instructions in any one thread are observed by all threads in program order, however different threads may observe different orderings of acquiring and releasing instructions that are executed by distinct threads.

The sequentially consistent instructions TODO.

The FENCE instruction prevents reordering across the fence; that is, for every pair of instructions in the same thread as the fence with some instruction before the fence in program order with some instruction after the fence in program order, the instruction before the fence happens-before the instruction after.

The sequentially consistent ordering instructions (a list of these is given above) have the following properties:

each of these instructions has both acquire and release semantics with respect to the memory location &addr (except for fence, which has no &addr operand (todo does fence have acquire/release semantics with respect to all memory operations in the memory_region (that is, it's as if it had an &addr operand with all &addrs in the memory_region at once)? i think it does, but i'm just wondering about the fact that an acquire orders with a release in other threads; does FENCE imply similar cross-thread order on all of C, x86, ARM, RISC-V? I think so, but should check)

Definition of sequentially consistent ordering: For the instructions with sequentially consistent ordering, (a) and in addition (b) each time the program is executed, for each memory_domain a single total ordering of all executions of any of these instructions in that domain exists, and all threads observe these executions in that same global ordering.

casamorlx is provided for situations in which CAS functionality is desired without ordering; relaxed versions of the other atomics can be built using the casamorlx primitive, see [2].

The platform is not allowed to reorder the effects of instructions within a single thread in any way that would be visible in single-threaded programs.

todo talk about torn writes, reordering of instructions, and different threads observing different reorderings.

Implementing Boot

todo

Informal description of relevant primitive memory operations

TODO: describe each of these in a less formal way (less formal than the semiformal model below)

This section is intended to say the same thing as the parts of the section "semi-formal memory consistency model" that defines the semantics of the instructions, but in a less formal way, and with more examples.

In this section, when we talk about instruction reorderings, we are speaking dynamically, not statically; that is, a single static instruction in a certain line of code may be executed many times over the course of a thread or program, and we are considering each of these dynamic instances of execution as separate.

Non-atomic load/store: lw, sw

lw: non-atomic load
sw: non-atomic store

These are ordinary load/store with no provisions for concurrency. On some platforms they are faster than atomics, however if one thread could access some memory location while another thread is writing to it, and if either of the two threads are using non-atomic load/stores, then the result is undefined behavior. Therefore, these should not be used to access memory which is being shared, except when other concurrency instructions are being used to introduce ordering constraints in such a way so as to ensure that there can be no simultaneous access.

From the perspective of the thread executing them, lw/sw are always observed to execute in program order. However, from the perspective of other threads, an lw or sw could in some cases be reordered, and in some cases different threads could observe different reorderings. Other concurrency instructions can be used to block some or all of the possible reorderings.

In any case, for any given memory location, the sequence of writes to that memory location is globally observed by all threads to occur in the same order (we call this 'coherence').

Atomic load/store: lw.a, sw.a

lw.a: atomic_load(memory_order_seq_cst)
sw.a: atomic_store(memory_order_seq_cst)

These are sequentially consistent load/store. They are the slowest way to load/store, but if you need to access a memory location which is being shared, these are the safest way to do so. Most programmers should always use only these for loads/stores of memory which might be accessed concurrently by multiple threads.

They are atomic, which means that even if one thread accesses a memory location while another thread is writing to it, one of the accesses will execute to completion before the other begins (undefined behavior does not result).

They are sequentially consistent, which means that you can imagine that they, along with all other sequentially consistent operations in the program, execute in some sequential order, that this same sequential ordering is observed by all threads, and that the operations of each thread appear in this sequence in program order.

"Synchronization operations" refers to the subset of instructions lw.a.rc, sw.a.rc, lw.a, sw.a, casrmw, casrmw.rc. Synchronization operations in the same thread may not be reordered with each other.

lw.a is an acquire operation and sw.a is a release operation. Instructions after lw.a may not be reordered before the lw.a. Instructions before sw.a may not be reordered after the sw.a. If lw.a takes its value from sw.a (or from any other release operation), then that release operation is ordered before lw.a; a release operation ensures that previous writes are visible to any other thread that later acquires on the same memory location. These instructions operate in a paradigm called RCsc, short for "Release Consistency with Sequential Consistency between synchronization operations".

Atomic load/store, Release Consistency: lw.a.rc, sw.a.rc

lw.a.rc: atomic_load(memory_order_acquire)
sw.a.rc: atomic_store(memory_order_release)

These are Release Consistency load/store. On some platforms they are slower than lw.a.rlx and lw but faster than lw.a. They are difficult to reason about and hence their use is discouraged.

These are not sequentially consistent, and therefore two threads may observe different orderings of operations.

For example, the IRIW (Independent Reads of Independent Writes) anomaly is permitted with these operations (although it would not be permitted when all operations are lw.a, sw.a):

x := 1

a := x 1		b := y 1		y := 1
c := y 0		d := x 0

In more detail, consider the following IRIW program consisting of the following 4 threads:

Thread 0: sw.a.rc 0, &X ldi $5, #1 sw.a.rc $5, &X

Thread 1: sw.a.rc 0, &Y ldi $5, #1 sw.a.rc $5, &Y

Thread 2: lw.a.rc $5, &X lw.a.rc $6, &Y

Thread 3: lw.a.rc $6, &Y lw.a.rc $5, &X

At the end, it is possible for Thread 2 to have ($5, $6) = (1, 0) and also simultaneously for Thread 3 to have ($5, $6) = (0,1). That is to say, Thread 2 observed Thread 0's write of #1 to X Thread 0 to execute before Thread 1's write of #1 to Y, yet Thread 3 observed the opposite ordering.

These instructions operate in a paradigm called RCpc, short for "Release Consistency with Processor Consistency between synchronization operations". "Release Consistency" means that the operations have acquire/release semantics. "Processor Consistency" means that although each thread's order of synchronization operations within the thread is observed globally by all other threads, different threads may perceive different interleavings of synchronization operations between threads. "Synchronization operations" refers to the subset of instructions lw.a.rc, sw.a.rc, lw.a, sw.a, casrmw, casrmw.rc.

Note that, although writes to different memory locations may be reordered with respect to each other, writes to the same memory location may not; that is, for each memory location, all threads globally observe all writes to that one memory location in the same order (we call this 'coherence').

"Synchronization operations" refers to the subset of instructions lw.a.rc, sw.a.rc, lw.a, sw.a, casrmw, casrmw.rc. Synchronization operations in the same thread may not be reordered with each other.

lw.a.rc is an acquire operation and sw.a.rc is a release operation. Instructions after lw.a.rc may not be reordered before the lw.a.rc. Instructions before sw.a.rc may not be reordered after the sw.a.rc. If an execution of lw.a.rc takes its value from sw.a.rc (or from any other release operation), then that release operation is globally observed by all threads to occur before that lw.a.rc; a release operation ensures that previous writes are visible to any other thread that later acquires on the same memory location.

Atomic load/store, Relaxed: lw.a.rlx, sw.a.rlx

lw.a.rlx: atomic_load(memory_order_relaxed)
sw.a.rlx: atomic_store(memory_order_relaxed)

These operations are atomic, but do not impose any ordering constraints. They are neither acquires nor releases. They are just like lw and sw, except atomic (and so they can be used to prevent data races. Recall that a data race requires two accesses to the same memory location which are potentially simultaneous, and at least one of these accesses is non-atomic).

CAS: casrmw

casrmw: atomic_compare_exchange_weak(memory_order_seq_cst)
casprmw: atomic_compare_exchange_weak(memory_order_seq_cst)

casrmw Reads the memory location &dest, and compares it to the argument $expected. If the value read from memory matches the $expected, then it may write the $new argument to memory location &dest; we say 'may' because it's also possible that it fail to do this. If it wrote the $new argument to memory, it returns $new, otherwise it returns the value read from &dest. All of these things (the read from memory, the comparison, and potentially the write to memory) are done atomically, which is why we call this an atomic RMW operation.

It is guaranteed that this operation will not consistenly fail to write the other argument to memory if the value at the memory location is indeed equal to $expected, and there are not concurrent modifications to that memory location.

The reads and writes to memory are sequentially consistent.

casprmw is the same except that the parameters 'expected' and 'new', and the return value, are pointers instead of integers.

CAS, release consistency: casrmw.rc

casrmw.rc atomic_compare_exchange_weak(memory_order_acq_rel)
casprmw.rc atomic_compare_exchange_weak(memory_order_acq_rel)

Same as casrmw and casprmw except that the reads and writes to memory have release consistency instead of sequential consistency.

CAS, relaxed: casrmw.rlx

casrmw.rlx: atomic_compare_exchange_weak(memory_order_relaxed)
casprmw.rlx: atomic_compare_exchange_weak(memory_order_relaxed)

Same as casrmw and casprmw except that the reads and writes to memory have relaxed consistency instead of sequential consistency.

fence

fence(memory_domain): atomic_thread_fence(memory_order_seq_cst), with an extra memory_domain argument indicating which shared memory domain to sync

Fences prevent reordering across the fence within the given memory_domain; that is, no operation that occurs before the fence can be reordered after another operation after the fence, if both of these operations are in the same memory_domain. Fences are sequentially consistent, which means that they stand in a total order with all other sequentially consistent operations in each memory domain.

Semi-formal memory consistency model

Unfortunately, this was just sort of thrown together, and is likely to be inconsistent. Feedback welcome.

The 12 primitive memory operations in Boot, and the C/C++ operations that they are most similar to, are:

lw: non-atomic load
lw.a: atomic_load(memory_order_seq_cst)
lw.a.rc: atomic_load(memory_order_acquire)
lw.a.rlx: atomic_load(memory_order_relaxed)
sw: non-atomic store
sw.a: atomic_store(memory_order_seq_cst)
sw.a.rc: atomic_store(memory_order_release)
sw.a.rlx: atomic_store(memory_order_relaxed)
casrmw: atomic_compare_exchange_weak(memory_order_seq_cst)
casrmw.rc atomic_compare_exchange_weak(memory_order_acq_rel)
casrmw.rlx: atomic_compare_exchange_weak(memory_order_relaxed)
fence(memory_domain): atomic_thread_fence(memory_order_seq_cst), with an extra memory_domain argument indicating which shared memory domain to sync

program-order, single-threaded semantics, memory_domain, happens-before

We define "program-order" to be the per-execution, per-thread, dynamic (rather than static) partial order in which instructions are specified to be executed by the program. For example, in the infinite loop program 'cpy $5 $6; jpc -4', the program-order is "cpy $5 $6; jpc -4; cpy $5 $6; jpc -4; ...".

For any Boot program without undefined behavior, the semantics of any thread of execution is given by its semantics if considered as a single-threaded program, except that which values are returned by reads from shared memory are defined here in this memory consistency model. (we call this the single-threaded-program-order property).

Each location in memory belongs to exactly one memory_domain.

For each execution, for each memory_domain, we introduce a partial order 'happens-before'. Informally, happens-before represents the ordering between events that must be respected by every execution of a program and observed by every thread. In the following, to be concise, we omit the 'for each memory_domain' quantifier before each statement regarding happens-before, and we omit the memory_domain subscripts distinguishing the happens-befores.

program-order

happens-before is transitively closed; that is to say, if x happens-before y and y happens-before z, then x happens-before z.

Which values can be read by reads

For any read r, if there is some write w to the same location, then r must take its value from w if both of the following are satisfied by w:

w happens-before r
there does not exist any other write w' to the same location s.t. (w happens-before w') and (not r happens-before w')

If for any read r, if there is no write w meeting the above conditions, then r must take the value from some write w to the same location such that all of the following are satisfied by w:

not (r happens-before w)
there does not exist any other write w' to the same location s.t. (w happens-before w') and (w' happens-before r)
if r and w are from the same thread, then, in program-order, w is the most recent write to that memory location preceding r
if earlier in the thread of r (earlier in program order) there is a prior read r' of the same location which read some write w', then w is either the same as w' or later than w' in the coherence order of the location

In case the program contains some reads with no corresponding writes for them to read, we introduce dummy writes of random values at the beginning of the program to every memory location. Each of these dummy writes happen-before every other read and write in the program.

Coherence

In any given execution, for every memory location, all writes to that memory location are totally ordered, and this total order appears within the happens-before partial order (this property is called coherence).

Informally, note that all operations within any single thread are, from the perspective of that thread, totally ordered (by the single-threaded-program-order property), and also that all operations to any single memory location are totally ordered (by the coherence property). So, all that remains to be done is to specify the relationships between events that span multiple threads and multiple memory locations.

Classification of memory operations

The 'atomic' primitive operations are lw.a, lw.a.rc, lw.a.rlx, sw.a, sw.a.rc, sw.a.rlx, casrmw, casrmw.rc, casrmw.rlx, fence. 'acquire' primitive operations are lw.a, lw.a.rc, casrmw, casrmw.rc, fence. The 'release' primitive operations are sw.a, sw.a.rc, casrmw, casrmw.rc, fence. The 'RMW' primitive operations are casrmw, casrmw.rc, casrmw.rlx, fence. The sequentially consistent primitive operations are lw.a, sw.a, casrmw, fence.

Acquire and release operations

For any two instances of operations in the same thread r1, r2, if r1 is either a 'require' or a 'release', and r2 is either a 'require' or a 'release', and r1 is before r2 in program-order, and r1 and r2 operate on the same memory domain, then r1 happens-before r2.

For every instance 'a' of an acquire instruction in an execution, and for every other instruction i in that same thread, and that is a memory operation on the same memory domain, and that is after the 'a' instruction in program-order, a happens-before i.

For every instance 'r' of a release instruction in an execution, and for every other instruction i in that same thread, and that is a memory operation on the same memory domain, and that is before the 'r' instruction in program-order, i happens-before r.

For every acquire 'a' that takes its value from a release 'r', r happens-before a.

RMW operations

The read r and write w generated by a single RMW operation have the properties (this is called the RMW-atomic property):

r happens-before w
for every other write w' to the same location, either (w' happens-before r), or (w happens-before w')

Sequentially consistent operations

In any given execution, within each memory domain, all instances of the sequentially consistent primitive operations on that memory domain are totally ordered, and whenever one sequentially consistent primitive operation is before another one in the same thread on the same memory domain in program-order, it is also before the other one in this total order. For each memory domain, this total order is contained within happens-before.

fence

A fence has the semantics of "lw.a $0 &dummy; sw.a $0 &dummy" (todo doublecheck syntax) where &dummy is a distinguished (dummy) memory location within the given memory_domain that cannot be accessed in any other way. Recall that fence also is all of an acquire, a release, a RMW, and a sequentially consistent primitive.

thread-local memory_domains

Some memory_domains have the 'thread-local' property. The thread that allocates a thread-local memory domain owns it. If any thread-local memory_domain is ever accessed by any thread other than the thread that owns that memory_domain, this is undefined behavior.

Data races; programs with data races are undefined

A data race is when there are two instructions in different threads, at least one of which is a write or RMW, at least one of which is non-atomic, which are not ordered by happens-before. A data race is undefined behavior.

The behavior of (any program whose possible executions contain any execution that contains undefined behavior) is undefined.

Programs with OOTA are invalid

An OOTA (out-of-thin-air) cycle is a read instruction r that takes its value from a write instruction w whose value syntactically depends on r. By 'syntactic' is meant that syntactic dependencies are determined by static form of the program, regardless of whether it is possible to realize the dependency in any possible execution. We include control dependencies in 'syntactic'. Programs with possible executions with OOTA cycles are invalid (an implementation may refuse to compile them, or may halt while executing the program with a runtime error) but not undefined (if an implementation does not halt with an error, it must execute the program according to the specified semantics).

Progress

No memory operation may be preceded in happens-before by an infinite sequence of other memory operations.

For each memory location, the latest value (in the total order guaranteed by coherence) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

SC for DRF

A 'non-SC race' is a generalization of 'data race' where 'non-atomic' is replaced by 'non-sequentially consistent'.

A sequentially consistent execution is defined as an execution for which all memory operation execute in a total order, and for each thread, that thread's program order is contained within this total order.

If no sequentially consistent executions contain any non-SC races, then only sequentially consistent executions are permitted.

Memory model discussion and comparison to other languages' models

The memory operations follow C/C++20 and consist of non-atomic loads/stores, loads/stores/RMWs in each of {relaxed, RCpc, and RCsc} memory orders, and a sequentially-consistent/strong/cumulative 'fence'.

(in the previous sentence, where I said RCsc, I meant operations with C/C++ memory_order_seq_cst, and where I said RCpc, I meant loads/stores/RMWs with memory_order_acquire, memory_order_release, and memory_order_acq_rel).

Another way to say this:

The memory operations follow C/C++20 and consist of:

non-atomic loads/stores
atomic_load(memory_order_acquire)
atomic_store(memory_order_release)
CAS_weak(memory_order_acq_rel)
atomic loads/stores/weak-CAS each in each of the {relaxed, seq_cst} memory orders
fence(memory_order_seq_cst)

We provide all of the memory_orders that C/C++20 does, except for memory_order_consume. That is, we provide relaxed, acquire, release, acq_rel, seq_cst; although we don't provide the same amount of flexibility that C/C++ does to mix-and-match operations and memory orders.

'lw.a', 'sw.a', 'casrmw' are seq_cst (sequential consistency memory order); like C/C++'s memory_order_seq_cst, these also imply acquire, release, and both acquire and release, and so they represent the RCsc paradigm (Release Consistency, with Sequential Consistency between the synchronization operations).
'lw.a.rc', 'sw.a.rc', 'casrmw.rc' are C/C++'s acquire, release, acq_rel memory order, respectively; these represent the RCpc paradigm (Release Consistency, with Processor Consistency between the synchronization operations).
lw.a.rlx, sw.a.rlx, casrmw.rlx are relaxed_memory_order; these are atomic, but impose no ordering; lw.a.rlx and sw.a.rlx can be treated the same as non-atomic lw, sw, except that since they are atomic, they don't cause data races.
The only 'fence' provided is a sequentially consistent one; it is full/strong and cumulative.

For lw and sw, the 'default' (shortest mnemonic) is non-atomic, the weakest one; this is a concession to readability because most loads/stores will be non-atomic; however, once you get into the atomics, the 'default' (shortest mnemonic) is the strongest, and the longer the mnemonic gets, the weaker the memory_order gets. This is intended to serve as a guide or memory aid for the programmer; stick with the atomics with the shorter mnemonics unless you are sure you want the other ones.

I defined 'fence' to be a C atomic_thread_fence with memory_order_seq_cst, but then I also defined it as equivalent to a dummy acquire/release pair. I'm not sure if C fences are actually equivalent to that; 'Strengthen SC fences' in [3] suggests that they are defined independently because it suggests changing the definition of fences without touching the definition of acquire/releases, and also the RC11 paper says "Among these, Lahavet al.[17] proposed strengthening the semantics of SC fences in a different way from the way we do here, by treating them as read-modify-writes to a distinguished location. That strengthening, however, was considered in the restricted setting ofonly release/acquire accesses, and does not directly scale to the full set of C11 access modes. In fact, for the fragment containing only SC fences and release/acquire accesses, RC11-consistency is equivalent to RA-consistency that treats SC fences as RMWs to a distinguished location". I don't fully understand that, I don't know what the difficulty about scaling "to the full set of C11 access modes" was, or if that difficulty applies here, so unless/until I learn more, I'll leave it like this.

I took the definition of guaranteed and allowed reads from Golang, even though the memory operations are adapted from C/C++. Those are probably incompatible in some way that I didn't notice. Also, compared to Golang, i weakened guaranteed reads so that there is no guarantee if there is some other w' which is happens-before-incomparable to r (w' cannot be incomparable to w because, by coherence, all writes to one memory location are totally ordered), and then i changed the RMW axiom to match. Golang's guaranteed reads would have had "there does not exist any other write w' to the same location s.t. (w happens-before w') and (w' happens-before r)", and i changed it to "there does not exist any other write w' to the same location s.t. (w happens-before w') and (not r happens-before w')".

Another model i saw somewhere apparently had something to the effect of "SC read operations cannot read from overwritten writes" [4]. I think this is provided (not just for SC) by my 'allowed reads' axiom.

Regarding the out-of-thin-air problem (OOTA) my thoughts are (a) it's an open research problem, (b) C/C++20 doesn't say much about it, and we want to be able to compile to C/C++. My guess is that a program that admits OOTA executions should be considered invalid; usually in Boot we have to say that invalid programs are undefined behavior, because we want to be able to compile to C and C says that; but here we have the freedom to not say that because C/C++ doesn't say that OOTA programs are invalid. Note that in order to fully formalize the semi-formal definition that i've given, you have to get into dependencies, which i think are probably not worth the effort to formalize.

We provide a weak CAS but not a strong CAS so as to make implementation easy/possible on platforms without strong CAS.

I'm not sure if i specified 'program order' or the single-threaded-program-order property very well. I mean to say that no reorderings are allowed unless they would be correct in a single-threaded program. Or maybe that each thread sees its own reads and writes in program order.

Note that, outside of the single-threaded-program-order property and OOTA, there is no dependency ordering here.

I've included an 'SC for DRF' guarantee. Rather than requiring only SC synchronization operations, i guaranteed SC as long as all conflicting accesses including one non-SC synch operation are ordered by happens-before. I'm not sure if a 'SC for DRF' theorem holds here (my intuition is that it does), but in any case rather than leave that for a theorem I just included it in the specification.

Some 'litmus tests' and how (I think) they play out under this spec: IRIW (independent reads of independent writes) will be permitted whenever the sequentially-consistent operations aren't used to rule it out. Roach motel movement of stuff outside of any acquire..release critical section into an adjacent critical section will be permitted. RCpc (lw.a.rc) acquire..releases to different variables can overlap, RCsc (lw.a) ones cannot.

I'm not sure if FENCE is really needed but i'm puzzled about how else to efficiently deal with the situation in which your code calls other concurrent code that makes relaxed accesses to some various memory location within a block of memory, and then later you want to deallocate that memory. You need to have some way to make sure that you order all of those earlier relaxed accesses before you free the memory, and you don't really know which relaxed accesses were made because that wasn't your code making them, and you don't want to have to iterate over the whole block and do an acquire/release on each location within it.

I'm disappointed that i ended up including so many memory_orders because i think it makes the model more complicated, but i think they each have a reason to be there. Non-atomic loads/stores are for ordinary, non-concurrent loads/stores. RCsc (lw.a, casrmw) is for when the programmer is doing concurrent programming but doesn't want to reason about weak memory models (which should be the vast majority of the time). RCpc (lw.a.rc) is for when the programmer needs to order stuff between two threads but doesn't need to incur the cost of forcing all threads to stand in a global ordering. Relaxed is for when the programmer doesn't need any ordering at all; for when they just want something like a non-atomic load/store but atomic.

Note: one difference between us and C/C++11 is that we don't support C/C++11's notion of 'release sequences'.

TODO: acquire-release chains are local (but cumulative) TODO: acquire/release can be thought of as delineating critical sections. Stuff outside of the critical sections can leak in (be reordered in), but not vice versa.

RISC-V and ARMv8 have something a type of consistency that not well represented in this model, called "multicopy-atomic" or "multicopy-atomic-other" in which each write may be observed by the writing thread sooner than other threads observe it, but whenever any other thread observes a write, all other threads must then also observe that write (the intuition is that each thread can have its own thread-local writeback cache, but there can't be any writeback caches shared between multiple threads; all writes have to hit main memory before any thread besides the writer can see them, which induces a total ordering between all writes that hit main memory). These architectures get a much simpler memory model by disallowing the need to represent the possiblity of IRIW-ish scenarios in which various subsets of threads can each observe different orderings of writes; for example, most of RISC-V's model simply has to specify which orderings from Program Order are 'preserved' into a global ordering. To blindly and safely compile code written for those memory models to this one, i think you'd have to use the SC operations in this model, which are less performant than the non-SC-but-multicopy-atomic operations of RISC-V and ARMv8. I would have liked to extend this memory model to include operations that are non-SC-but-multicopy-atomic; I believe this would require the introduction of another total order to represent the total ordering of write operations at the time they hit main memory, and then modification of the which-reads-are-guaranteed-and-allowed axioms to respect that order for inter-thread reads while also allowing each thread to observe its own writes earlier. I didn't add that in because (a) i'm running out of opcodes, (b) i doubt anyone will want to blindly compile RISC-V and ARMv8 code to this, and (c) we already have what i regard as the bare essentials; local memory operations, sequential consistency, shared memory operations that don't force any ordering, and some way to synchronize between threads without forcing full sequential consistency (d) compared to RCpc, which is the mode in my model that provides-some-ordering-but-not-SC, multicopy-atomic-other seems to me to be less appropriate for massive parallelism, because there is still something global that has to be synchronized over all threads and all memory locations; so i don't want to get rid of RCpc to make room for multicopy-atomic-other, because i want to focus on massive parallelism (e) one of the main benefits of multicopy-atomic-other is much simpler models, but by including weaker modes our model would remain complicated anyways (f) disallowing a multicopy-atomic-other mode will force some algorithms to use our SC primitives where multicopy-atomic-other would have done, which will mean inefficiencies when Boot code is run on multicopy-atomic-other platforms, but higher-level concurrency libraries that really want to squeeze out the last drop of performance could always just bypass the Boot layer and use platform-specific code if they want the highest performance.

The first Progress axiom (about a finite number of memory actions) is taken almost verbatim from RISC-V, and the second (about a finite amount of time before visibility) is taken almost verbatim from C++.

I didn't include program-order in happens-before. I think other memory models include program-order in happens-before (e.g. https://www.hboehm.info/misc_slides/10-pldi-adve-boehm-tutorial.pdf ). But then how do you say that you can't reorder SC operations but you can reorder some other operations? What i did above is create happens-before edges for operations in the same thread that can't be reordered, and make another axiom that says that single threads must observe the same semantics as usual w/r/t reading their own writes. So i'm probably missing something here.

To the allowed-reads axiom, i added the conjunct:

if earlier in the thread of r (earlier in program order) there is a prior read r' of the same location which read some write w', then w is either the same as w' or later than w' in the coherence order of the location

(i'm not sure if this is required b/c i didn't see it elsewhere yet; many other models have the program order be included in happens-before, maybe that subsitutes for this? Not sure)

'atomic' operations are these which (appear to) execute atomically, that is, they never (appear to) overlap in time with any other operation.

todo:

seqcst lets you see your program as a simple interleaving of threads

acquire/release is good for acquiring and releasing locks and in this way making sure critical sections dont overlap

TODO

the following is out of

casamo is intended to be like C++14's std::atomic<>::compare_exchange_strong() (with memory_order_seq_cst)
casamousync is intended to be like C++14's std::atomic<>::compare_exchange_strong() with memory_order_relaxed
fence is intended to be like:
- RISC-V's FENCE rw,rw instruction
- C++14's atomic_thread_fence with memory_order_seq_cst
- x86's MFENCE instruction
- ARMv7's DMB instruction
- Power's SYNC instruction
lwrc and lprc are intended to be like C++14's std::atomic<>::atomic_load with with memory_order_acquire
swrc and sprc are intended to be like C++14's std::atomic<>::atomic_store with with memory_order_release
adda, apa, anda, ora, xora are intended to be like C++14's atomic_fetch_{add, and, or, xor} (with memory_order_seq_cst), or like RISC-V's amoadd, amoand, amoor, amoxor with both aq and rl flags set

An implementation is of course perfectly free to implement stronger concurrency guarantees than required. For example, an implementation is free to have a global interpreter lock (todo is that enough?), or to surround each and every load and store (and the loads/stores within the implementation of CAS) in the program with full sequentially-consistent rw,rw memory barriers (todo is that enough?). Such techniques allow a Boot implementor to provide correctness without having to learn about and understand the details of memory consistency models/memory orders, at the cost of removing some or all of the efficiency gains presented by opportunities for concurrency.

(todo, mb insert the RVWMO axioms here? except that for us, do we want multi copy atomicity to apply only to ordered things? also, what about atomic, relaxed load/stores, are they "ordering instructions" that can't participate in data races?)

No thin-air reads for data-race-free programs: Boot implementations promise that for data-race-free programs, there are no thin air reads or self-satisfying conditionals. Apparently there's not yet an academic consensus on how to best formally define these in such a way that the undesirable paradoxical behavior is ruled out without ruling out acceptable optimizations, although in practice these are not observed on contemporary platforms [5]. For these reasons we forbid these things without defining them. If i had to take a guess at defining what we are excluding, however, i would say that we require that the causal chain determining the value returned by any memory read operation is acyclic. (i think this is implied by DRF-SC? but just being clear here. todo)

Even without any atomicity or ordering, within any shared memory_domain:

a sequence of writes to any single memory location from a single thread are globally observed in program order. TODO do we really want to make this guarantee? Probably; i think every mainstream kind of hardware shared memory has this. But one could imagine a Boot memory abstraction over a distributed system with an unreliable network or with network delays that does not support this (but which does support atomicity).
a reads or writes in a single thread are ordered if there is a dependency between the reads or writes (see the RISC-V RVWMO spec for definitions of dependencies, although of course the table of which RISC-V instructions have which dependencies does not directly apply here). TODO do we really want to make this guarantee? It imports a lot of complexity to our specification. Otoh "Exploring the Trisection of Software, Hardware, and ISA in Memory Model Design" took issue with an earlier RISC-V memory model that apparently didn't include dependencies, and C14 added a prohibition on out-of-thin-air executions that were permitted in C11, which suggests that others find dependency-less memory models unworkably weak; but see also the following note to C's Release-Consume ordering: "Note that currently (2/2015) no known production compilers track dependency chains: consume operations are lifted to acquire operations. " [6].

We generally follow the definitions and semantics in the RISC-V RVWMO specification, adapted in the obvious way for our instructions (e.g. that specification speaks of specific RISC-V operand fields, which are different from ours). todo: well maybe not quite; do we guarantee multi copy atomicity even on relaxed accesses? Probably not

Blocks of memory are said to be either thread-local or shared. malloc_local returns thread_local memory and malloc returns shared memory.

TODO: we need to guide implementors how to implement these on (a) a system with only FENCE rw,rw and similar and no release/acquire loads/stores, and (b) a system with only release/acquire loads/stores and no FENCE rw,rw, and (c) a system with only (SC?) locks, and no release/acquire loads/stores and no FENCE, and (d) on x86_64, and (e) on ARM, and (g) on Java, and (h) on CIL. mb see also https://www.cs.tau.ac.il/~orilahav/papers/popl16.pdf

TODO: we need to guide programmers how to implement: (a) something like RISC-V FENCE rw,rw using Boot; (b) something like C atomic_fence SC using Boot; (c) SC locks using Boot. mb see also https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/

To blindly/conservatively implement Boot on most platforms:

find the instruction corresponding to FENCE (see table above for RISC-V, C, x86, ARM).
implement casamorlx (most platforms either have a CAS (eg. LOCK CMPXCHG on x86) or they have ll/sc (eg. LR and SC on RISC-V); with ll/sc you can do something like loop: LL; compare or CONTINUE; swap; SC; BNEZ loop; if a platform has neither but has locks, then implement it using a lock)
implement casamo by surrounding casamorlx with a FENCE on either side
implement addamo, apamo, andamo, oramo, xoramo from casamo using the construction [7], and then surrounding with a FENCE on either side
implement all of lwa lpa lwrc lprc lwrlx lprlx as lw surrounded with a FENCE on either side, and similarly for stores (TODO is this correct?)
implement malloc_local as just malloc

To blindly/conservatively write a Boot program without worrying about this stuff, only use atomic loads and stores on memory that might be shared, and insert FENCE instructions before and after every memory operation (load, store, casamo/casamorlx, addamo, apamo, andamo, oramo, xoramo, plus malloc etc). (TODO is this correct?)

todo: say that we have coherence (and dependencies?)

Undefined behavior

When a program contains undefined behavior, no guarantees are provided on what the implementation will do with that program. The implementation might refuse to compile the program, or the program could crash at runtime, or it could run but return weird results, or corrupt memory, or even execute undesired commands such as reformatting a disk (this could happen, for example, if a function pointer was corrupted and then jumped to, and the corrupted value happened to point to a disk I/O subroutine).

Note that the part of the program containing undefined behavior doesn't even have to execute for the undesired results to occur; the mere fact that some potential execution of the program could execute undefined behavior makes the behavior of the entire program undefined.

Our motivation for this sort of specification is that we want it to be legal for a Boot implementation to blindly compile a Boot program to a lower-level platform such as C or assembly without checking the Boot program for bugs first, and the most straightforward translation of buggy code could have unpredictable results on some platforms. For example, in C, the C standard defines signed arithmetic overflow to have undefined behavior, so a Boot program with signed arithmetic overflow could legally be compiled to a C program with signed arithmetic overflow, and that C program would have undefined behavior. For another example, a data race could cause memory corruption, and memory corruption has unpredictable effects on various platforms. The reason that we specify that the behavior of the whole program becomes undefined if any part of it contains undefined behavior is that some target platforms (for example, C compilers) may assume the correctness of the program given to it and perform global optimizations that are incorrect otherwise. We can't do anything about how the target platforms work and we don't wish to force Boot implementations to check for bugs in Boot code at compile time or protect against bugs at runtime, so we must inherit the scary semantics of the target platforms here.

The phase "undefined behavior" is sometimes abbrevated "UB".

Instruction listing

In the following, op0, op1, op2 are short for operand 0, operand 1, operand 2.

TODO: the following is almost all out of date

Three-operand instructions

todos

we should note which variant of ANNOTATE is NOP (maybe 0xFFFF? or maybe 0x0001, b/c we've already reserved 0x0000 so maybe 'whenever the (1st? 3rd?) operand is 0, the meaning is reserved for future use by Boot'). Remember, the assembler will be inserting NOPs as needed to deal with the leading bit of a quad needing to be 0.
should we have DEP (data execution prevention) in here? Or is that an Oot-level thing? Maybe at least have a code vs data flag in malloc? ASLR is up to the implementation's malloc, so the spec doesn't need to cover it.
consider adding back in bitor-int, bitand-int, bitxor-int from extended
i found this note to myself: "in Boot, Note that adding one to a pointer advances it by one native pointer increment"
i found this note to myself (i thought we took care of this already, but take a look...): "BNE may not fit in ovm unless we limit Custom instructions to length 16. No wait limit Branch length to 16 works, if we also use TOS for one operand. JREL in boot also must be limited to 16 ( i guess that's +-8?)"
i found this note to myself (i thought we took care of this already, but take a look...): cap semantics reserved for BootX?. Move PC to Sysinfo
program starts with one element on small stack (so that you can read and write the TOS); popping this last element is an illegal stack underflow
need a call instruction that protects the stacks of the caller, eg. a subvm (also protect memory and files, eg all side effects)
the opcodes in the instruction listing and in the definitions are mismatched. Also, i added GETSTATE and SETSTATE but mb there isn't space for them due to BootX?
the purpose of GETSTATE and SETSTATE is to maybe replace the special registers, especially the really special ones like 'PC' (we might want to save the stack pointer registers as registers). Should we have special registers, or should we have GETSTATE and SETSTATE?
- "On the other hand, which is more annoying; having to waste opcodes on pseudoregister instructions, or having some of the registers be 'special', necessitating extra checks in every addressing mode effective address calculation? hmm yeah maybe i'll take the special instructions, please. Otoh this would mean that using the PC as an offset to pointers because awkward. And we do want the stack pointers to be normal registers, right? But what about those addressing modes in Boot that don't accept stack pointers? Arg..."
should MSIZE be optional?
consider adding IN and OUT instructions for I/O to 64k of additional memory space for "volatile implementation-dependent device memory". But... aren't READ and WRITE already good enough for this? I was imagining that IN and OUT would read and write one word at a time, but we can always just READ and WRITE one word, right? But that takes longer, which i normally don't care about, but this is the sort of stuff that low-level code might be doing in a time-sensitive place (eg peripheral control on an MCU).

todo consider the following syscalls/lists of stuff: comparing the PDClib's platform-specific layer and newlib, the following are in the intersection:

fork execve wait _exit unlink link environ open read write lseek close raise ('kill' is used in newlib to send a signal) memory allocation (sbrk in newlib, mmap in PDClib?)

additional functionality only in newlib: fstat getpid isatty stat times

additional functionality only in PDClib: memory deallocation (munmap) signal handler setup (signal) /dev/urandom

comparing the intersection above to the union of current boot and bootx:

not in boot/bootx:

execve link environ raise ('kill' is used in newlib to send a signal)

only in boot/bootx: poll break loadcodeptr flush walk

only in boot/bootx but in one of newlib, PDClib: time (but in newlib, although not PDclib) memory deallocation (but in PDClib, but not newlib) /dev/urandom (but in PDClib, but not newlib)

so consider adding 'not in boot/bootx' and subtracting 'only in boot/bootx'

---

there's some additional error types in the following, to consider adding:

https://bitbucket.org/pdclib/pdclib/src/c8dc861df697a6c8bddbcbf331d9b6fcae6e2f4d/platform/posix/functions/_PDCLIB/_PDCLIB_stdinit.c?at=default&fileviewer=file-view-default

https://bitbucket.org/pdclib/pdclib/src/c8dc861df697a6c8bddbcbf331d9b6fcae6e2f4d/platform/posix/includes/signal.h?at=default&fileviewer=file-view-default

https://bitbucket.org/pdclib/pdclib/src/c8dc861df697a6c8bddbcbf331d9b6fcae6e2f4d/platform/posix/internals/_PDCLIB_config.h?at=default&fileviewer=file-view-default

---

here are some non-portable functions in various alternate versions of Windows libc:

https://docs.microsoft.com/en-us/cpp/cppcx/crt-functions-not-supported-in-universal-windows-platform-apps

do we really want to follow their lead there? There is no stdin,stdout,stderr; there is no 'system' or 'execve' to execute other applications ("A UWP app cannot invoke another UWP app or a desktop app."); there are no environment variables; there are no pipes.

hmm yeah maybe we should follow their lead here, actually, for Boot if not for Oot; Boot is supposed to be minimalistic.

so:

take out stdin, stdout, stderr
dont add execve
(is 'link' for symlinks? if so, dont add that either)
dont add environ
(probably do add signals and signal handlers)
- remember these should be VECTORED interrupts, which means that you should be able to register (or, statically specify) the entry point for various signals; b/c if you just have one handler for all signals, then that handler must do a CASE statement to branch on the signal it receives, which takes extra time inside of a part of the program which might be time-sensitive or might be an inner-loop.
- leave the semantics of specific well-known signal IDs as implementation-dependent? Do we reserve any range of signals for custom use?

---

8 GPRs is too few. We should move PC and CAP into an additional space of 'special' registers (but then that makes PC-relative addressing harder in OVM, doesn't it? hmm... mb we could say that the first 16 special registers are in normal register space as R16 thru R31?). Perhaps this is the same as the space of I/O ids that READ and WRITE use. Perhaps we should add GETSTATE and SETSTATE for this. That's still only 10 GPRs, maybe forget about MEMSTACK and CALLSTACK? Maybe they can be BootX? or even OVM concepts instead of Boot concepts. 12 GPRs is (barely) acceptable; then we'd have R0, TOS, ERR, SMALLSTACK as the first four (special) registers.
could even get rid of SMALLSTACK register and say that stack ops targeting R0 as their stack are targeting SMALLSTACK. Maybe we could add back the PC instead. Or maybe MEMSTACK or CALLSTACK. Or add back the PC and say that targeting the PC with stack ops is really targeting one of MEMSTACK or CALLSTACK. No, wait: then you couldnt't do the LOAD/STOREs with offsets on SMALLSTACK, which is key.
OK, wait a moment, we had good reasons for most of those registers. The PC is there because if we move the PC into special register space, then that's one more instruction to do PC-relative code. The reason we have stack registers is because we want to allow PUSH and POP and LOAD and STORE to target them, while at the same time denying a MOV from the stack register to anything else (because the 'hardware stacks' might be implemented in a fancy way, eg in an inaccessible memory space, or eg as a segmented stack; in fact since we have a size-limited one, and then not one but two large ones, it's pretty likely that they will be implemented in some fancy way). The reason we have stack registers separate from other special registers is because not only PUSH and POP, but also LOAD and STORE, can target them. LOAD and STORE can also work with the PC as the base address so that's not redundant. And if we have all of those others, then removing CAP would be an odd number, so may as well save it for BootX?. Also, only having 8 GPRs doesn't sound so bad when you realize that you also have a stack. Spending 8 registers on special stuff is an obvious consequence of our luxurious environment with 3 stacks and a TOS and a zero register. The other 3 are ERR (which also holds carry and upper multiply results, so many architectures have something like this), PC (which is not that unusual, i bet), and CAP (without which it would be an odd number anyways).

---

we should add some immediate-mode variants of instructions. Some candidates for this include ADD, SUB, GETSTATE and SETSTATE (if GETSTATE etc work on the same space of 'I/O addresses' as READ, WRITE); look at other ISAs to see what usually provides an immediate-mode variant.
also consider adding some PC-relative and indirect indexed address modes for LOAD. I don't think we have the 3-operand opcode space for that, though.
also consider some 2-operand instructions that implicitly push their output to SMALLSTACK
it would be real nice to have, in the 3-operand instructions:
- ADDI, which is one of the most common (dynamic frequency) instructions in some other ISAs (see [8] table 5.1) (but is it common for adding ints to ints, or adding ints to pointers?) (also note that that paper says: "ADDI’s outsized popularity is due not only to its frequent use in updating induction variables but also to its two idiomatic uses: synthesizing constants and copying registers.")
- LOAD and STORE plus immediate displacement (we have this now)
- LOAD and STORE plus register index (we don't have this now)
i'm holding off for now because we have so few spots free
allow the implementation to not do SMALLSTACK overflow or underflow checks, and implement it as a ring buffer -- note that this means that the bottom elements of the 'stack' get overwritten upon 'stack overflow', and old seemingly deleted elements reappear upon 'stack underflow' (or upon accessing past the bottom of the stack with displacement addressing). In this case, for security, you'd probably want to guarantee that if you fill the stack with 16 0s that old seemingly deleted elements would be permanently overwritten, or alternately, provide an instruction to securely clear the stack.
in RISC-V, the assembly notation for stores and loads is: sw t1,0(s0) and lw a0,0(s0); note that their ordering is always register-with-data, memory location from/to which the load/store is occurring, rather than using op0 as 'left side of the equals sign' like we do. We might want to change to match them.
note: i dont think we can do anything about this, but i would have liked to have bitxor, pickk, rollk, and also have srl in the 2-operands
actually, merge STDIN and STDOUT to file/channel/device 0, so that we can read and write just using the zero register
note that to get the base addr of MEMSTACK, you can just start with the stack pointer (in MEMSTACK) and then add the stack depth, which you can get from a SYSINFO call
might want to have a query to query the depth of SMALLSTACK after all, to allow context switching? alternately, make it cyclic after all? also, mb MEMCPY copy it (this makes me think maybe we do want a REGREG after all, which would provide access to the GPRs as if they were in memory (SMALLSTACK cant be accessed this way). alternately, provide SAVESTACK and RESTORESTACK instructions.
- nah, this stuff is getting too complicated; also there are probably big optimization advantages to NOT having a REGREG. Programs can still copy the registers, and MEMSTACK, and SMALLSTACK, one at a time. Maybe put SAVESTACK, RESTORESTACK, SAVEREGS, RESTOREREGS in the 32-bit BootX? ISA, and mb move MEMCPY there too.
note that the PC is in terms of bytes, not 16-bit words; this means that the 'k' in the branching instructions refer to bytes, not words, and so even numbers of them are required to jump back and forth; eg '2' means skip 1 16-bit instruction. The odd numbers are useless in Boot because it's not until BootX? that we define an 8-bit encoding. Memory access, however, is still in words. File access is in words, although on platforms (almost all of them) where the files refer to strings of bytes, those files are represented as streams of words whose values never go above 255.
PEEK and POKE replace IN and OUT. Note that PEEK and POKEs are atomic and totally ordered with other PEEKs and POKEs to the same device, if supported by the platform. One device may have many subaddresses to PEEK and POKE.
todo: replace 'stack-aware' with the rule that any instruction that expects to read or write indirectly (that is, through dereferencing a pointer), when applied to register R2, will read or write from SMALLSTACK (rather than whatever value is in ERR, even if that is a pointer), computing any displacements as if SMALLSTACK were any other stack (eg no displacement means that the effective address is T; displacement of 1 means it is the second item on SMALLSTACK; etc)
maybe 'OPEN' also takes an I/O descriptor for 'volume'? Descriptor 0 is the default filesystem? This allows you to 'mount' other 'volumes' later on via 'open'
maybe READ and WRITE also take a 'key' argument? This allows them to contain a SEEK and also to function as a dict/hashmap/keyvalue-store
maybe add DELETE, which can be applied both to filesystems (to delete a file) and within keyvalue-stores (to delete a key). Fundamental operations in key-value datastores are OPEN, CLOSE, GET, PUT, DELETE [9].
shall we guarantee that each operation is atomic/tear-free? Or leave open the possibility for systems that don't support that, and put it as a feature flag in SYSINFO 1?

---

more todos:

need to add CODEPTR back to the set of types. I guess its implementation-dependent whether a CODEPTR is a subtype of PTR or not.
what about the carry in addition? i guess to define the carry we may need to define what happens when there is overflow? Now that i'm implementing it it doesn't seem crazy to add an overflow check and specify a bitwidth; the bitwidth could be some undetermined 'native' bitwidth which is one of 16,32,64,128,256, and then we can specify modular arithmetic and carry etc w/r/t that bitwidth. This would mean that we would no longer abstract away from the 'type' of arithmetic (modular vs overflow checking vs floating point, etc), but we would still abstract away from the bitwidth -- and i guess that's what really matters, because we can emulate modular arithmetic with a few extra instructions (in fact we don't even need an 'if', just a 'mod'), but we can't do anything about the native pointer size being larger than we might want.
- note that we don't even have to branch to emulate modular arithmetic, just use the 'mod' operation. In x86, there is no mod, only DIV, which is slow. However, since we are MODing by a power of two, we can just use BITAND. So on lower-level targets it may be possible to emulate modular arithmetic quickly (but you probably wouldn't have to, because they'd probably offer modular arithmetic).
todo our bitwise operations are incorrect on negative numbers (because Python guesses the bitwidth, converts them to unsigned, and then leaves them that way, eg 65535 & -32767 = 32769)
we should probably remove the LOG instruction -- it's nice but not essential enough to be included at this level -- we already have BREAK and also console output
we probably need some way to distinguish CREATEing a new file from OPENing an existing one. But if we add a CREATE instruction then we probably want DELETE too, and we don't have room for that. So maybe just add CREATE to the mode bits of OPEN.
in the textual assembly syntax, it's annoying to not be able to have multiple spaces in between the mnemonic and the operands, or between the operands, because then you can't line up the operands into columns. So should add the freedom to put whitespace there.
specify MUL_INT correctly. Make a note that the least_significant_word can be taken as the entire result (expressed as a signed word of the native wordsize) when -(MAX_INT+1) <= result <= MAX_INT. Note that if the absolute magnitude of both operands is known to be less than or equal to floor(sqrt(MAX_INT)), then the result is guaranteed to be in that range; for 16-bit native ints this corresponds to both operands <= 181; so, multiplication of 7-bit quantities can be accomplished even while ignoring the most significant result word.
mb just say "it is implementation-dependent whether CODEPTRs are a subset of, interchangable with, or distinct from PTRs"
mb the two absolute ptr CVTs at the bottom should be 'absolute ptr, in units of slots from implementation-dependent starting point' (or 'from beginning of globals' or something like that? or is that getting too far into ELF-like packaged binaries?) and 'absolute ptr, in units of bytes from beginning of program'
annotate type zero can be giving the type of registers. Register 0 (the zero register in Boot, the PC in BootX?)'s type is fixed so that makes annotate 0 0 0 useless, allowing us to keep the all-zero bit pattern illegal.
replace 'undefined value' with 'unspecified value', as in eg [10]
the phrase 'trap representation' may be useful; here's an example from elsewhere: "a trap representation: A value that may trap if used and whose use makes any subsequent behavior undefined."
TODO: note that malloc, mdealloc support is optional, and an implementation may choose to return 'memory not available' errors for all malloc calls. similarly, READ and WRITE and POLL are optional and may always return 'invalid file id', and LOG and BREAK are optional and may be treated as no-ops.
"To keep implementation easy, Boot is designed for non-concurrent pure computation and interactive I/O on already opened files/streams passed as parameters. There is an extension of Boot called BootX? which contains additional opcodes for efficiency as well as functionality for filesystem operations, process control and inter-process communication, concurrency, shared memory ordering, and asynchronous signals/interrupts."
specify that transferring control into the middle of a quad whose leading bit is 1 (ie a quad with a different encoding format) is illegal; however, implementations are allowed to not bother checking for this, so the result may be undefined behavior, not just a crash.
do something similar to the old LOADLO: "This instruction is illegal except immediately following either a LOADI to the same op0, or another LOADLO to the same op0. In addition, there is a platform-specific limit to the maximum allowed length of a sequence of LOADI/LOADLO instructions, such that sequences that would be capable of loading constants larger than the supported word bitwidth are forbidden (a sequence of length 2, that is one LOADI followed by one LOADLO, is always permitted; this corresponds to 16-bit constants). A sequence of LOADI/LOADLOs longer than the platform-specific limit may cause a crash or may cause an undefined value to be loaded into the register indicated by op0."
"If all bits are '1's , this is an illegal instruction, which at the implementation's discretion may either be skipped over, or cause a crash, if executed. This is to help halt execution if non-program memory is accidentally executed or if there is an error in the memory hardware.
consider changing the encoding to sacrifice half of the R2 opcodes in exchange for li into 4-bit register operand?
consider adding load/stores in R2? but this is less RISCy...
todo implement lsi etc, also IN, OUT
todo think about harvard architectures, are we doing stuff that would make compilation to harvard architectures difficult?: "WebAssembly? is a Harvard architecture; this means that code and data are not in the same memory space. No code or code addresses are visible in the wasm linear memory space, the only "address" that a function has is its index in the wasm function index space. Additionally the wasm implementation's runtime call stack (including the return address and function arguments) is not visible in the linear memory either. This means that address-taken local C variables need to be on a separate stack in the linear memory (hereafter called the "linear stack"). It also means that some functions may not need a frame in the linear stack at all."
todo note in docs some techniques Boot uses to get higher code density (in the 'compact' encoding): (1) both a register machine and a stack machine (with TOS) (2) two banks of registers (double the registers without increasing operand bits) (3) most instructions are P-type or R-type (3-operand and half the registers, or 2-operand and all the registers) (4) zero register (saves the need for a few more opcodes)
should note that, sure, you could store pointers in memory and read them back as integers, but you shouldn't. Also, should decide and then note if you are allowed to add an integer to a pointer that exceeds its bounds if you don't deref it.
gotta remember to consider doing something to make Boot easily compilable (without a dense table mapping each Boot location to the target platform code location) via restricting indirect jump targets -- or decide that only BootX? is easily compilable
consider renaming the int and ptr regs from $x and &x to ix and px.
TODO actually let's go back to defining int32 as '32-bits OR MORE' and make integer overflow fully undefined behavior. This is because signed integer overflow is undefined behavior in C and we want to allow porters to compile arithmetic straightforwardly to their unchecked C equivalents; and also because [11] says that in many cases it's much more efficient for variables to be register width. I also made a note about this at the beginning.
C18 std says "A weak compare-and-exchange operation may fail spuriously. That is, even when the contentsof memory referred to byexpectedandobjectare equal, it may return zero and store back toexpectedthe same memory contents that were originally there.6NOTE 2This spurious failure enables implementation of compare-and-exchange on a broader class of machines, e.g.load-locked store-conditional machines" [12] page 207/pdf page 226. mb we should say that too (but should we even bother to have a weak CAS? probably not).
gotta specify how to get argc, argv, environ

proj-oot-bootReferenceAdocOld200622

Boot (Oot Boot) reference

Introduction

Cheatsheet

General architecture

Instruction encoding

Data types

Registers

INTSTACK and PTRSTACK

I/O

Memory allocation

Zero segment

Calling conventions and returning conventions

Calling convention r12

Calling convention r5

Calling convention rv

Calling convention s

Interoperation

Concurrency

For Boot programmers: the simplest way

For Boot programmers: a less simple but still recommended way

For Boot programmers: the most performant way (not recommended)

Implementing Boot

Informal description of relevant primitive memory operations

Non-atomic load/store: lw, sw

Atomic load/store: lw.a, sw.a

Atomic load/store, Release Consistency: lw.a.rc, sw.a.rc

Atomic load/store, Relaxed: lw.a.rlx, sw.a.rlx

CAS: casrmw

CAS, release consistency: casrmw.rc

CAS, relaxed: casrmw.rlx

fence

Semi-formal memory consistency model

program-order, single-threaded semantics, memory_domain, happens-before

Which values can be read by reads

Coherence

Classification of memory operations

Acquire and release operations

RMW operations

Sequentially consistent operations

fence

thread-local memory_domains

Data races; programs with data races are undefined

Programs with OOTA are invalid

Progress

SC for DRF

Memory model discussion and comparison to other languages' models

TODO

Undefined behavior

Instruction listing

Three-operand instructions

todos