proj-oot-bootReferenceAdocOld200622

Boot (Oot Boot) reference

:hardbreaks: :toc: macro :toclevels: 1

Version: unreleased/in-progress/under development (0.0.0-0)

THIS DOCUMENT IS OUT OF DATE. The markdown version, boot_reference.md, has replaced it (although that's not finished yet).

Boot is a low-level 'assembly language' virtual machine (VM) that is easy to implement.

toc::[]

TODO rewrite to current design -- the pyboot implementation, particularly the instruction listings in constants.py, is (mostly) more up to date than this document. Note also that a simpler 32-bit encoding has been introduced; this 16-bit encoding has been made an optional 'compressed' encoding (mb since it's optional, it should be moved to BootX?) later: actually now i'm thinking of simplifying even further and removing stacks, and adding floating point, see ootAssemblyNotes26.txt

TODO actually let's go back to defining int32 as '32-bits OR MORE' and make integer overflow fully undefined behavior. This is because signed integer overflow is undefined behavior in C and we want to allow porters to compile arithmetic straightforwardly to their unchecked C equivalents; and also because [1] says that in many cases it's much more efficient for variables to be register width. I also made a note about this at the end.

Introduction

Boot's purpose is to promote the portability of OVM (Oot Virtual Machine) by serving as a portable target language underneath it.

The reason for writing a new assembly language, rather than adopting LLVM or Webassembly or similar, is that we want the entire Oot toolchain to run on top of many platforms, including 'on top of'/within existing high-level languages such as Python. This means that the entire Boot implementation needs to be quick and easy to port (not just retarget to a new backend, but entirely port), and it needs to support opaque platform-native pointers of unknown size.

Highlights:

Cheatsheet

Instruction encoding: 16 bits, little endian. LSB bit (the 'instruction encoding length format bit') is always 0.

Four formats:

For each format, from LSB to MSB, the fields are:

In graphical form (from MSB to LSB) (0,1 = constants, o = opcode, x = operand 2, y = operand 1, z = operand 0:

I-formats:

R-formats:

P-formats:

TODO out of data; this is now the 'compact' format, which should go in a different document

Note that some implementation could choose to consider the subinstructions of otwo, oone, ozero as separate encoding formats.

Note that some implementation could choose to consider the I-format as consisting of two subformats, depending on how many bits the immediate constant is.

The I11 and I8 instructions are:

Datatypes: int32, ptr

Registers: two banks of 16 registers each; one for int, one for ptr. The first 5 registers in each bank are special. The next 3 are GPRs. Some operations are only available on the first 8 registers.

Instructions:

TODO: add the other new instructions

==
annotation ann
constants ldi sai
loads and stores and moves ld ldh ldhu ldb ldbu ldp st sth stb stp cp cpp
arithmetic of ints add addi sub mul
bitwise arithmetic sl sru srs and or xor
arithmetic of pointers aip
comparision control flow beq beqp bne blt
control flow jpc jy
concurrency CAS fence acquire release
misc more
sys sysinfo
RESERVED 3 R instructions, 1 I instruction
RESERVED 6 two-argument instructions
RESERVED 7 one-argument instructions
RESERVED 8 zero-argument instructions
RESERVED 1 instruction
==

other ideas: in out push pop dup drop swap over read write halt impdep poll sysinfo xcall devop, CAP register/stack cmov select call ret cmov malloc dealloc realloc

In the following, #X is an immediate constant, $X is an integer register, &X is a pointer register, and x is an untyped argument. All immediate constants are signed two's-complement ints.

From left to right, the arguments go into operands op0, op1, op2. Immediate operands are always on the right (the highest-numbered operand). When two immediate operands are combined into an #imm6 (as with instruction jd), op1 is the high-order bits and op2 is the low order bits (imm6 = (op1 << 3) + op2).

The branch immediates have a special coding, because -2, -1, 0 are not needed; at the time of branch instruction execution the PC is pointing to the beginning of the instruction after the branch instruction. Since instructions are 16-bits, -2 would branch right back to the branch instruction itself, causing an infinite loop; -1 would branch into the middle of the branch instruction; and 0 would cause the branch to end up at the same place as if there were no branch. Therefor the branch instructions interpret the numbers 0..7 in op2's binary representation to code for: [-6, -5, -4, -3, 1, 2, 3, 4]. The program counter is in units of bytes, and instructions are 16-bits (2 bytes), so a branch immediate of +2 skips over one instruction, and a branch of +4 skips over two instructions, and a branch of -4 skips backwards two instructions, and a branch of -6 skips backwards three instructions. The odd-numbered branch immediates (-5, -3, 1, 3) are illegal in Boot but are present for compatibility with BootX?, which is a variable-length instruction set including some 1-byte instructions.

Jump and branch offsets are in units such that an offset of 2 represents one Boot instruction. Platforms which represent Boot code in memory in ways such that one instruction spans more or less than 2 memory locations should adjust the jump and branch offsets accordingly when executing instructions jpc beq beqp bne blt, and also should emit the correct INSTRUCTION_SIZE in SYSINFO. Program code which dynamically computes code pointer by adding integers to the PC or to other code pointers should be sure to take account of INSTRUCTION_SIZE.

meta:

constants:

register loads and stores and moves:

stack and register/stack loads and stores and moves:

(note: you can use 'cp' and 'cpp' to achieve DUP, and by also using the zero register, DROP)

arithmetic of ints:

bitwise arithmetic:

arithmetic of pointers:

(?: i'm not sure if these are worth including; the idea is to make up for the fact that the platform word size and pointer size is not known at compile time. Note also that a fancy VM could easily search for these at loadtime and substitute in the appropriate platform-specific immediate constants. These also serve to provide an ADDI on pointers, which is probably very useful)

conditional branches:

unconditional jumps:

concurrency:

placeholders for other formats:

meta:

I/O:

memory allocation:

interop:

General architecture

Memory locations are of platform-specific size. The SYSINFO instruction can be used to query constants which specify how many memory location are needed to store a PTR (PTR_SIZE), a 32-bit quantity (I32_SIZE), or a 16-bit quantity (I16_SIZE). A memory location must fit at least an 8-bit quantity, so there is no I8_SIZE (it would always be 1).

If a memory location holds more than 8-bits, the instructions lw, lh, lhu, lb, lbu, will load the quantity in the memory location mod 232, 216, 216, 28, 28 respectively.

The integer registers hold 32-bits and are treated as signed integers (twos complement).

Data that needs to be accessed concurrently must be aligned; other data has no alignment restrictions. Instructions must be 16-bit aligned.

Loads and stores of a single 32-bit word or of a single pointer are guaranteed to be atomic (or, if not, the Boot implementation must not be considered thread-safe and its documentation must state this prominently).

The only pointer manipulation instructions provided by Boot is the 'aip' instruction (todo: rename to ap), which adds a (signed) integer to a pointer. However, on some implementations it may be possible for Boot programs to manipulate pointers by reading in a memory location that contains a pointer as if it were an integer. The result of this is undefined/implementation-defined behavior.

Instruction encoding

The instruction encoding is fixed-length 16-bits.

Encodings are little-endian on disk, over the network, or when presented in serialized form, and are typically stored in host byte order in memory.

The instruction with an all-zero bitpattern (jpc 0) is illegal, and programs which might execute this instruction are illegal. The mnemonic alias 'BAD' is assigned to this bitpattern. The execution of this instruction is a critical error, and implementations should, but are not required to, terminate with an error if such an instruction is executed. This may assist in detecting situations in which uninitialized or non-coding memory is accidentally executed (because such memory may be more likely to contain zero than any other value), or in which some types of hardware errors are present.

Reserved instructions are treated similarly to the all-zero bitpattern.

Note that program code IS allowed to contain bit patterns of 16 0 bits, provided that these are not executed as Boot instructions; this provides for platforms which store data in the instruction stream, and also provides for forward compatibility with mixed-length encoding extensions which use Boot for 16-bit instructions but also include other longer instructions that sometimes contain 16 0 bits within them.

The beginning of the program code is a table of offsets into program code. The table starts with a 32-bit unsigned integer specifying the length of the table (in units of 32-bit integers). Each entry in the table is a 32-bit unsigned integer specifying an offset from the first location after this jump table, in units of bytes. There is at least one entry in the table and the first entry in the table is a pointer to the program entry point.

The jk instruction can be used to jump to the locations indicated in this table.

This table must remain constant and should not be read or modified during program execution; the Boot toolchain is permitted to read the table once at compile time or load time and then discard it, and use internal mechanisms to implement the jk command (for example, a compiler could get rid of the table and compile each jk instruction into a platform-native absolute jumps).

Although the length of the constant table is stored as a 32-bit integer, the official Boot implementations only support constant tables of length up to and including 32767 (whichis equal to 2^15 - 1, also the largest integer that can be represented as a signed 16-bit integer).

Jumps into the middle of instructions, or into a table of program code offsets, are not allowed (jumps to the addresses indicated by the entries of such tables are allowed, of course).

Note that self-modifying code, or jumps to Boot code which was written at runtime, may not be permitted on some platforms (for example, a compiler which compiles Boot code to native code may not be able to support interpreting additional Boot code at runtime).

Data types

There are two defined data types:

INT32 can hold at least the range -32768 thru 32767, inclusive. Arithmetic whose mathematical result would be a value outside of that range has undefined behavior (except for 'mul', which returns the 32 low-order bits of the result; even there, however, if the mathematical result would be outside of the range of an int32, the sign of the result may be incorrect).

Registers

There are two banks of 8 registers each. One bank can only hold ints and the other can only hold ptrs. We denote the integer register bank with a '$' prefix and the pointer register bank with a & prefix.

In each bank, the first 3 registers are special:

The other 6 registers, $5 thru $7 and &5 thru &7, are GPRs (general purpose registers). In addition there are two more secondary banks of 8 registers each, $8 thru $15 and &8 thru 15, that can only be used as operands to some instructions (2-operand instructions, except for SYSINFO and 2-operand I/O instructions and RESERVED_P1_6).

When the Boot program terminates with a HALT instruction, &3 should hold the same value that it held upon the start of the program, and $3 should hold a result code (the convention is that 0 indicates success, unless the purpose of the program is to compute and return an integer, in which case the result code should be that integer).

If an instruction interacts with INTSTACK or PTRSTACK multiple times, the convention for the ordering of the mutations is: 1) first, all input operands are read, in the order operand 0, operand 1, operand 2 (or whichever subset of these are input operands), which is the same as reading the assembly language operands from right to left -- if any of these are $S or &S, they pop the relevant stack in turn. There is no 'short-circuit evaluation'; all input operands are read (and popped from smallstack if needed) 2) then, the instruction executes, perhaps interacting with a stack as an effect of its execution 3) finally, all output operands are written to, in the same order given by (1) (that is, right to left in Boot assembly). If an instruction only conditionally writes to some of its outputs, then giving it a destination operand of $S or &S is illegal for that instruction. If the instruction can trap, that is legal; if it does trap, then the stack is not pushed for outputs.

INTSTACK and PTRSTACK

These are fixed-length stacks. They each start the program with a depth of 1. They each have a maximum capacity of at least 16 items. There is no way to check the current depth of these stacks.

A stack underflow or overflow, that is, an operation that would make one of these stacks' depths, respectively, less than 1 (one stack element is always present so that the $T and &T registers are always meaningful) or larger than its fixed capacity, is illegal and the result is undefined behavior.

Each slot in INTSTACK can contain a 32-bit integer and each slot in PTRSTACK can contain a pointer.

The top 16 slots in the stacks can be directly referenced through the lsi/ssi/lsp/ssp instructions. An immediate operand of 0 to these instructions indicates the top of the stack (for example, in int_stack, lsi #0 recalls the same value as $t). Higher numbers refer to locations further down in the stack; these locations are expressed as depth offsets relative to the top of the stack. Depths past the bottom of the stack are legal, and you can store values to these locations, and these values are accessible. For example, if the stack is currently depth 1, the following sequence of code is guaranteed to end up with 111 in register 4:

li $s #22 ; push 22 onto the stack li 3 #111 ; load 111 into register 3 ssi 3 #14 ; store 111 into the stack location at depth 14 li $s #33 ; push 33 onto the stack lsi 4 #15 ; load 111 back from the same stack location that it was stored into

Note that even though 111 was stored into stack depth 14, and then loaded back from depth 15, because these depth offsets are relative to the top of the stack, and because the stack was pushed in between those two instructions, the location that depth 14 used to refer to is now referred to by depth 15.

However, as the stack gets larger, any value at a location deeper than #15 may (or may not) be lost; for example, if after executing the previous sequence of code, we were to do:

li $s #44 ; push 44 to the stack cp $0 $s ; pop the stack lsi 5 #15 ; load whatever is at stack depth #15 into register 5

Now an arbitrary value is loaded into register 5, because after we pushed 44 to the stack, the value 111 was at a stack depth greater than 15, and there is no guarantee that any values stored at stack depths greater than 15 remain stored. Perhaps register 5 will hold 111; perhaps it will hold 44; perhaps it will hold something else. There is no guarantee. This is done so that implementations can pre-allocate a fixed number of locations to hold the stacks.

Note: "user code" should act as if the capacity of the stacks is only 16; the other 16 locations are intended to be used by virtual machine compilation toolchains to implement macroinstructions.

Note: some implementations may implement these stacks as circular buffers of size >=32.

I/O

The in, inp, out, outp instructions read input and write output from/to 'devices'.

Some devices might only accept/emit ints, and some might only accept/emit pointers, and some might accept/emit both.

The devop instruction executes a device-specific control operation or metaoperation on a device.

The interpretation of devices and devops is platform-specific. It is suggested than device 0 be the console (stdin/stdout) on platforms that support that.

Note that official Boot tooling only supports device numbers up to 32767.

Memory allocation

The malloc, mdealloc instructions support allocation and deallocation of memory. Malloc pushes a result code to intstack.

Zero segment

When a Boot program starts, &0 points to a small block of pre-allocated memory called the zero segment.

The zero segment is guaranteed to be big enough to hold the larger of 16 PTRs or 16 INT32s.

Despite its name, the zero segment is not necessarily initialized and may hold arbitrary data at the start of the program.

The zero segment is shared memory. If a new Boot thread is spawned, it will inherit the same zero segment. If a new Boot process is spawned, it will get a new zero segment.

Calling conventions and returning conventions

Boot has 4 calling conventions.

TODO: too many... just look at how many r1_instructions and r0_instructions we need for this.

TODO: tail calls make callee-saved registers hard to use or useless, so need to change these to have more caller-saved.

Calling convention r12

Calling convention r12 passes up to 12 integer arguments and 12 pointer arguments in registers.

All registers aside from r0, r1, r2, r3 are used to potentially pass arguments (some might also think of these as caller-saved). Only pointer register r3 (which is suggested to be used for a memory stack) is callee-saved.

Upon returning from a call, one integer return value, found in integer register r3, will be passed.

This can be thought of as the calling convention used to initially call/start a Boot program.

Calling convention r5

Calling convention r5 passes up to 5 integer arguments and 5 pointer arguments in registers.

Register 3 (both banks) is callee-saved. Pointer register 3 is used as a memory stack pointer when applicable, otherwise it is callee-saved.

Integer register 4 is caller-saved. Pointer register 4 is used as a return address pointer/link register.

Registers 5, 6, 7, 8, 9 (both banks) are used to pass arguments and return values (see below) (from lower to higher number get arguments from left to right).

Registers 4, 5, 6, 7, 8, 9 (both banks) are caller-saved. Registers 3, 10, 11, 12, 13, 14, 15 (both banks) are callee-saved. The integer and pointer stacks are also callee-saved.

Upon making a call, up to 5 integer arguments are in integer registers 5, 6, 7, 8, 9, and up to 5 pointer arguments are in pointer registers 5, 6, 7, 8, 9.

Registers 4-9 are caller-saved registers and may be overwritten and used for any purpose by the callee.

Upon returning from a call, up to five integer and up to five pointer return values will be found in registers 5-9 using the same convention as for calling.

Calling convention rv

Calling convention rv passes a variable number of arguments in registers and in memory.

Calling convention rv is like calling convention r5, except that registers 10 and 11 (both banks) are used to pass pointers to additional arguments. The caller/callee status and function of all registers is the same as in calling convention r5 except for registers 10 and 11 (both banks).

Upon making a call, the first 5 integer arguments are in integer registers 5, 6, 7, 8, 9, and the first 5 pointer arguments are in pointer registers 5, 6, 7, 8, 9. The total number of integer arguments is in integer register 10 and the total number of pointer arguments is in integer register 11. If there are more than 5 total integer arguments then pointer register 10 contains a pointer to the remaining integer arguments, and if there are more than 5 total pointer arguments then pointer register 11 contains a pointer to the remaining pointer arguments.

Unconventionally, although registers 10 and 11 are involved in argument-passing, they are callee-saved.

If there are more than 5 integer or pointer arguments, respectively, then the space for the remaining arguments pointed to by pointers in pointer registers 10 and 11, respectively, are available to the callee for overwriting, similar to caller-saved registers (this allows for constant-space recursive 'tail calls'). The callee must not deallocate the memory pointed to by pointer registers 10 or 11.

Upon returning from a call, the convention for return arguments is the same as in calling convention r5. Up to 5 integer and up to 5 pointer arguments may be returned.

Calling convention s

Calling convention s passes up to 2 arguments on the stacks, and returns up to one return value on the stacks. Which stacks are affected dependends on the types of arguments being passed. The callee pops/consumes the arguments, if any, and pushes the return value, if any.

All registers are callee-save, and both stacks are callee-save, except for the arguments and return values.

Mnemonic shorthands for the possibilities are:

Calling:

Returning:

Interoperation

The xentry* instructions are a family of instructions which vary by calling convention. The instruction should be the first instruction in Boot subroutines intended to be called directly from an external source. It implements a platform-specific calling convention, and translates it to a Boot calling convention. The instruction takes no operands. The members of the family are xentrys0, xentrysi, xentrysp, xentrysii, xentryspp, xentrysip, xentryspi, xentryr5, xentryr12, xentryrv.

The xret* instructions are a family of instructions which vary by calling convention. The instruction returns to an external caller using a platform-specific calling convention. The program should be ready to return using a Boot calling convention, and this will be translated by xret into the platform's calling convention. xret takes one pointer operand, which is the return address. The members of the family are xrets0, xretsi, xretsp, xretri, xretr5. xretr5 should be used if xentryr5 or xentryrv was used, xretr12 should be used in xentryr12 was used, and otherwise one of the xrets* members should be used.

The xcall* instructions are a family of instructions. The instruction calls an external subroutine, translating a Boot calling convention into the platform calling convention. xcall* takes one pointer operand, which is a pointer to the subroutine being called. The members of the family are xcalls0, xcallsi, xcallsp, xcallsii, xcallspp, xcallsip, xcallspi, xcallr5, xcallr12, xcallrv.

The xpostcall* instructions are a family of instructions. The xpostcall* instruction should be placed immediately after the xcall* instruction if the Boot program expects that an external subroutine may return after an xcall. It translates the platform-specific after-call-return calling convention into the Boot return convention. The instruction takes no operands. After xpostcall, return values will be found in locations specified by the given Boot calling convention, and callee-saved registers and stacks will be restored to their original condition. The members of the family are xpostcalls0, xpostcallsi, xpostcallsp, xpostcallr12, xpostcallr5. xpostcallr5 should be used if xcallr5 or xcallrv was used, xpostcallr12 should be used if xcallr12 was used, otherwise one of the xpostcalls* members should be used.

Neither xcallrv nor xpostcallr5 allocate nor deallocate any memory for extra arguments which may be pointed to by pointer registers 10 and/or 11.

It is legal for a Boot program to call itself or other Boot code using xcall, provided that the called subroutine begins with a matching xentry and (if it returns to the caller) the subroutine ends with an xret and returns to a matching xpostcall instruction in the caller.

Concurrency

Unfortunately the memory consistency design of Boot was kind of thrown together and may be incorrect.

For Boot programmers: the simplest way

If you don't want to think about this stuff at all, only use memory operations with "sequentially consistent ordering". The memory operations with sequentially consistent ordering are:

lpa, lwa, spa, swa, casamo, addamo, apamo, andamo, oramo, xoramo, fence

(and you won't have any need for the 'fence' instruction if you restrict yourself to the memory operations are from this list)

These memory operations must only be used on aligned addresses. All addresses returned by malloc are guaranteed to be aligned. &0 is guaranteed to be aligned. Any address generated from an aligned address of the form new_addr = old_addr + a*ALIGN_SZ (todo: add ALIGN_SZ to SYSINFO), where a is an integer, is guaranteed to be aligned. TODO: no, align per data type, e.g. align ptrs to ptr_sz, ints to int_sz.

TODO: change this to align to the size of the operand, rather than ALIGN_SZ; e.g. 16-bit data should be 16-bit aligned, 32-bit data should be 32-bit aligned, etc

Also, put a fence instruction after you receive any newly allocated memory that you plan to use as shared memory, and before you free it, or before you transfer control to any external code that might access it.

If you do this (that means, for example, your program should not contain the LW instruction; you would always use LWSC in its place), then the Boot implementation guarantees that every execution of your program will be sequentially consistent, that is, the following property will always hold:

(SC) All memory accesses in all threads will always appear to take place in a single total order, all threads observe these memory accesses in this same total order, and the operations of each thread appear in that total order in program order.

In other words, the Boot implementation makes the following promise to you:

(guarantee: SC execution if only SC instructions) If your program doesn't contain any memory operations except for the sequentially consistent ones, and your program only operates on one memory_domain, then every execution of your program will be SC.

You can imagine that when each thread wants to perform a memory operation, it submits this request to a global queue, and the implementation goes thru and services these requests one by one in the order it received them. This is probably not how your platform actually works, but if you following the rules above, you can reason about your program as if it were.

For Boot programmers: a less simple but still recommended way

On many platforms, the use of sequentially consistent (SC) ordering instructions is significantly less efficient than their non-SC variants. A reasonable way to improve performance is given here. If you don't really care about performance, our advice is to not bother about all this and just do what the previous section suggests, and skip the rest of this section.

Still here? First, some definitions.

Definition of "sequentially consistent (SC) execution": An SC execution is one in which all memory accesses in all threads appear in a single total order, all threads observe these memory accesses in this same total order, and the operations of each thread appear in that total order in program order.

Definition of "competing accesses": a pair of potentially simultaneous accesses to the same memory location by different threads in which one of the accesses is a write.

Now, what we recommend is that you ensure that your program uses only sequentially consistent instructions for (both accesses in each) any possible competing access in any possible SC execution. Furthermore, in between any possibly competing access and any non-sequentially consistent memory instruction to the same location, place a FENCE instruction. If you do this, then the Boot implementation once again promises that your program will always enjoy sequential consistency, just as above. In other words, the Boot implementation makes the following promise to you:

(guarantee: SC execution if only SC competing accesses and FENCES) If, in every SC execution, every competing access involves only sequentially consistent instructions, and there is a FENCE in between every competing access and every non-sequentially-consistent memory access instruction, and your program only operates on one memory_domain, then every execution of your program will be SC.

So, for example, you can use the faster LW instruction to access a memory location (which doesn't have to be aligned) when you are absolutely sure that no other thread will ever access the same memory location at the same time (for example, for thread-local storage). If two threads might both simultaneously access a location, though, then it must be aligned, and both of them must use the slower LWSC instruction.

Note: this is our variant of what other languages call the "DRF-SC", "DRF0", or "SC for DRF" property.

(todo: is this condition sufficient? this seems like overkill with the FENCEs, is it? also can we make a less strict requirement than 'only one memory domain'?)

For Boot programmers: the most performant way (not recommended)

todo the following (and maybe some of the previous) is out of date

If you really need the highest performance, and you are willing to think really hard to get it, then you can use a wider set of memory access instructions at the cost of losing sequential consistency. We warn that programs without sequential consistency are really hard to reason about, even for experts. We recommend not doing this; instead, use one of the methods given in the previous two sections, and skip the rest of this section.

Still here? If you do do this, we recommend encapsulating the non-sequentially-consistent parts of your program within a library that provides concurrency primitives (for example, mutexes, or for example, parallel map/reduce), and within this library try to use only well-known algorithms proven to work in similar settings.

The word 'synchronization' is used in memory consistency models in a different way than in ordinary English. In ordinary English, to 'synchronization' typically refers to making events happen at the same time. In memory consistency models, 'synchronization' typically refers to PREVENTING events from happening at the same time! For this reason, we'll use the word 'order' where other authors would say 'synchronize' (e.g. 'ordering instructions' where other authors would say 'synchronization instructions' and 'ordered' where they would say 'synchronized'). We'll use the word 'relaxed' for instructions that you might expect ordering from but that don't provide any (nor do they anti-order, they just don't affect it), and we'll still use the word 'asynchronous' as other authors use it.

(an etymological aside: this makes more sense if you imagine that we generalize 'synchronization' to mean merely that there is some constraint on the relative ordering of events (just a guess, but perhaps this peculiar phrasing came about via the the intermediate concept of a synchronous circuit, in which a synchronous (in the ordinary English sense) global clock signal is used to effect various constraints on the ordering of various events). Simultaneity becomes merely one constraint among many that might be imposed upon an ordering of events. Most often in memory models the constraint imposed by 'synchronization' is the negation of simultaneity through a happens-before relation; so two events are said to be ordered by synchronization when we know that they cannot happen at the same time. This is almost the opposite of the word's use in ordinary English! I say almost because in both ordinary English and memory consistency models, 'desynchronized' might be used refer to events that have no constraints on their relative ordering; although even here, in ordinary English, 'desynchronized' might also be used to refer to events that are definitely not simultaneous.)

Definition of "data race": in some SC execution, there exists a pair of competing accesses, and neither of these accesses are by atomic instructions; and neither of these events happens-before the other.

Data races are undefined behavior. (our motivation for specifying it this way is that C specifies that (see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf page 14, 1.10.23), and we want it to be legal for a Boot compiler to blindly compile Boot to C without the Boot compiler having to check for the absence of data races first)

Definition of "data-race-free": A program is data-race-free if there are no data races.

All memory operations are atomic except for the following non-atomic operations: byte and halfword loads/stores; also lp, lw, sp, sw.

The 'rlx' instructions ({casamo, lp, lw, sp, sw}rlx) are atomic but relaxed. They do not affect the happens-before ordering.

The 'rc' instructions ({lp, lw, sp, sw}rc) have RCpc semantics (Release Consistency with Processor Consistency between the timing operations themselves). sprc and swrc have release semantics, meaning that no memory operations can be reordered from before to after the releasing instruction (that is, every memory operation in the same thread that occurs before the execution of the releasing instruction in program order happens-before the releasing instruction), and furthermore, all writes in the current thread before the releasing instruction become visible to other threads that do an acquiring instruction on the same memory location afterwards (that is, if the releasing instruction happens-before any acquiring instruction at the same memory loation, then every memory operation in the same thread as the releasing instruction that happens-before the execution of the releasing instruction also happens-before the acquiring instruction in the thread with the acquiring instruction). lprc and lwrc have acquire semantics, meaning that no memory operations can be reordered from after to before the acquire operation, and furthermore, all writes in any other thread that executed a releasing instruction before the acquire instruction become visible to the thread doing the acquring instruction. Note that the acquiring and releasing instructions themselves have Processor Consistency, meaning that all acquiring and releasing instructions in any one thread are observed by all threads in program order, however different threads may observe different orderings of acquiring and releasing instructions that are executed by distinct threads.

The sequentially consistent instructions TODO.

The FENCE instruction prevents reordering across the fence; that is, for every pair of instructions in the same thread as the fence with some instruction before the fence in program order with some instruction after the fence in program order, the instruction before the fence happens-before the instruction after.

The sequentially consistent ordering instructions (a list of these is given above) have the following properties:

Definition of sequentially consistent ordering: For the instructions with sequentially consistent ordering, (a) and in addition (b) each time the program is executed, for each memory_domain a single total ordering of all executions of any of these instructions in that domain exists, and all threads observe these executions in that same global ordering.

casamorlx is provided for situations in which CAS functionality is desired without ordering; relaxed versions of the other atomics can be built using the casamorlx primitive, see [2].

The platform is not allowed to reorder the effects of instructions within a single thread in any way that would be visible in single-threaded programs.

todo talk about torn writes, reordering of instructions, and different threads observing different reorderings.

Implementing Boot

todo

Informal description of relevant primitive memory operations

TODO: describe each of these in a less formal way (less formal than the semiformal model below)

This section is intended to say the same thing as the parts of the section "semi-formal memory consistency model" that defines the semantics of the instructions, but in a less formal way, and with more examples.

In this section, when we talk about instruction reorderings, we are speaking dynamically, not statically; that is, a single static instruction in a certain line of code may be executed many times over the course of a thread or program, and we are considering each of these dynamic instances of execution as separate.

Non-atomic load/store: lw, sw

These are ordinary load/store with no provisions for concurrency. On some platforms they are faster than atomics, however if one thread could access some memory location while another thread is writing to it, and if either of the two threads are using non-atomic load/stores, then the result is undefined behavior. Therefore, these should not be used to access memory which is being shared, except when other concurrency instructions are being used to introduce ordering constraints in such a way so as to ensure that there can be no simultaneous access.

From the perspective of the thread executing them, lw/sw are always observed to execute in program order. However, from the perspective of other threads, an lw or sw could in some cases be reordered, and in some cases different threads could observe different reorderings. Other concurrency instructions can be used to block some or all of the possible reorderings.

In any case, for any given memory location, the sequence of writes to that memory location is globally observed by all threads to occur in the same order (we call this 'coherence').

Atomic load/store: lw.a, sw.a

These are sequentially consistent load/store. They are the slowest way to load/store, but if you need to access a memory location which is being shared, these are the safest way to do so. Most programmers should always use only these for loads/stores of memory which might be accessed concurrently by multiple threads.

They are atomic, which means that even if one thread accesses a memory location while another thread is writing to it, one of the accesses will execute to completion before the other begins (undefined behavior does not result).

They are sequentially consistent, which means that you can imagine that they, along with all other sequentially consistent operations in the program, execute in some sequential order, that this same sequential ordering is observed by all threads, and that the operations of each thread appear in this sequence in program order.

"Synchronization operations" refers to the subset of instructions lw.a.rc, sw.a.rc, lw.a, sw.a, casrmw, casrmw.rc. Synchronization operations in the same thread may not be reordered with each other.

lw.a is an acquire operation and sw.a is a release operation. Instructions after lw.a may not be reordered before the lw.a. Instructions before sw.a may not be reordered after the sw.a. If lw.a takes its value from sw.a (or from any other release operation), then that release operation is ordered before lw.a; a release operation ensures that previous writes are visible to any other thread that later acquires on the same memory location. These instructions operate in a paradigm called RCsc, short for "Release Consistency with Sequential Consistency between synchronization operations".

Atomic load/store, Release Consistency: lw.a.rc, sw.a.rc

These are Release Consistency load/store. On some platforms they are slower than lw.a.rlx and lw but faster than lw.a. They are difficult to reason about and hence their use is discouraged.

These are not sequentially consistent, and therefore two threads may observe different orderings of operations.

For example, the IRIW (Independent Reads of Independent Writes) anomaly is permitted with these operations (although it would not be permitted when all operations are lw.a, sw.a):

x := 1

a := x 1b := y 1y := 1
c := y 0d := x 0

In more detail, consider the following IRIW program consisting of the following 4 threads:

Thread 0: sw.a.rc 0, &X ldi $5, #1 sw.a.rc $5, &X

Thread 1: sw.a.rc 0, &Y ldi $5, #1 sw.a.rc $5, &Y

Thread 2: lw.a.rc $5, &X lw.a.rc $6, &Y

Thread 3: lw.a.rc $6, &Y lw.a.rc $5, &X

At the end, it is possible for Thread 2 to have ($5, $6) = (1, 0) and also simultaneously for Thread 3 to have ($5, $6) = (0,1). That is to say, Thread 2 observed Thread 0's write of #1 to X Thread 0 to execute before Thread 1's write of #1 to Y, yet Thread 3 observed the opposite ordering.

These instructions operate in a paradigm called RCpc, short for "Release Consistency with Processor Consistency between synchronization operations". "Release Consistency" means that the operations have acquire/release semantics. "Processor Consistency" means that although each thread's order of synchronization operations within the thread is observed globally by all other threads, different threads may perceive different interleavings of synchronization operations between threads. "Synchronization operations" refers to the subset of instructions lw.a.rc, sw.a.rc, lw.a, sw.a, casrmw, casrmw.rc.

Note that, although writes to different memory locations may be reordered with respect to each other, writes to the same memory location may not; that is, for each memory location, all threads globally observe all writes to that one memory location in the same order (we call this 'coherence').

"Synchronization operations" refers to the subset of instructions lw.a.rc, sw.a.rc, lw.a, sw.a, casrmw, casrmw.rc. Synchronization operations in the same thread may not be reordered with each other.

lw.a.rc is an acquire operation and sw.a.rc is a release operation. Instructions after lw.a.rc may not be reordered before the lw.a.rc. Instructions before sw.a.rc may not be reordered after the sw.a.rc. If an execution of lw.a.rc takes its value from sw.a.rc (or from any other release operation), then that release operation is globally observed by all threads to occur before that lw.a.rc; a release operation ensures that previous writes are visible to any other thread that later acquires on the same memory location.

Atomic load/store, Relaxed: lw.a.rlx, sw.a.rlx

These operations are atomic, but do not impose any ordering constraints. They are neither acquires nor releases. They are just like lw and sw, except atomic (and so they can be used to prevent data races. Recall that a data race requires two accesses to the same memory location which are potentially simultaneous, and at least one of these accesses is non-atomic).

CAS: casrmw

casrmw Reads the memory location &dest, and compares it to the argument $expected. If the value read from memory matches the $expected, then it may write the $new argument to memory location &dest; we say 'may' because it's also possible that it fail to do this. If it wrote the $new argument to memory, it returns $new, otherwise it returns the value read from &dest. All of these things (the read from memory, the comparison, and potentially the write to memory) are done atomically, which is why we call this an atomic RMW operation.

It is guaranteed that this operation will not consistenly fail to write the other argument to memory if the value at the memory location is indeed equal to $expected, and there are not concurrent modifications to that memory location.

The reads and writes to memory are sequentially consistent.

casprmw is the same except that the parameters 'expected' and 'new', and the return value, are pointers instead of integers.

CAS, release consistency: casrmw.rc

Same as casrmw and casprmw except that the reads and writes to memory have release consistency instead of sequential consistency.

CAS, relaxed: casrmw.rlx

Same as casrmw and casprmw except that the reads and writes to memory have relaxed consistency instead of sequential consistency.

fence

Fences prevent reordering across the fence within the given memory_domain; that is, no operation that occurs before the fence can be reordered after another operation after the fence, if both of these operations are in the same memory_domain. Fences are sequentially consistent, which means that they stand in a total order with all other sequentially consistent operations in each memory domain.

Semi-formal memory consistency model

Unfortunately, this was just sort of thrown together, and is likely to be inconsistent. Feedback welcome.

The 12 primitive memory operations in Boot, and the C/C++ operations that they are most similar to, are:

program-order, single-threaded semantics, memory_domain, happens-before

We define "program-order" to be the per-execution, per-thread, dynamic (rather than static) partial order in which instructions are specified to be executed by the program. For example, in the infinite loop program 'cpy $5 $6; jpc -4', the program-order is "cpy $5 $6; jpc -4; cpy $5 $6; jpc -4; ...".

For any Boot program without undefined behavior, the semantics of any thread of execution is given by its semantics if considered as a single-threaded program, except that which values are returned by reads from shared memory are defined here in this memory consistency model. (we call this the single-threaded-program-order property).

Each location in memory belongs to exactly one memory_domain.

For each execution, for each memory_domain, we introduce a partial order 'happens-before'. Informally, happens-before represents the ordering between events that must be respected by every execution of a program and observed by every thread. In the following, to be concise, we omit the 'for each memory_domain' quantifier before each statement regarding happens-before, and we omit the memory_domain subscripts distinguishing the happens-befores.

program-order

happens-before is transitively closed; that is to say, if x happens-before y and y happens-before z, then x happens-before z.

Which values can be read by reads

For any read r, if there is some write w to the same location, then r must take its value from w if both of the following are satisfied by w:

If for any read r, if there is no write w meeting the above conditions, then r must take the value from some write w to the same location such that all of the following are satisfied by w:

In case the program contains some reads with no corresponding writes for them to read, we introduce dummy writes of random values at the beginning of the program to every memory location. Each of these dummy writes happen-before every other read and write in the program.

Coherence

In any given execution, for every memory location, all writes to that memory location are totally ordered, and this total order appears within the happens-before partial order (this property is called coherence).

Informally, note that all operations within any single thread are, from the perspective of that thread, totally ordered (by the single-threaded-program-order property), and also that all operations to any single memory location are totally ordered (by the coherence property). So, all that remains to be done is to specify the relationships between events that span multiple threads and multiple memory locations.

Classification of memory operations

The 'atomic' primitive operations are lw.a, lw.a.rc, lw.a.rlx, sw.a, sw.a.rc, sw.a.rlx, casrmw, casrmw.rc, casrmw.rlx, fence. 'acquire' primitive operations are lw.a, lw.a.rc, casrmw, casrmw.rc, fence. The 'release' primitive operations are sw.a, sw.a.rc, casrmw, casrmw.rc, fence. The 'RMW' primitive operations are casrmw, casrmw.rc, casrmw.rlx, fence. The sequentially consistent primitive operations are lw.a, sw.a, casrmw, fence.

Acquire and release operations

For any two instances of operations in the same thread r1, r2, if r1 is either a 'require' or a 'release', and r2 is either a 'require' or a 'release', and r1 is before r2 in program-order, and r1 and r2 operate on the same memory domain, then r1 happens-before r2.

For every instance 'a' of an acquire instruction in an execution, and for every other instruction i in that same thread, and that is a memory operation on the same memory domain, and that is after the 'a' instruction in program-order, a happens-before i.

For every instance 'r' of a release instruction in an execution, and for every other instruction i in that same thread, and that is a memory operation on the same memory domain, and that is before the 'r' instruction in program-order, i happens-before r.

For every acquire 'a' that takes its value from a release 'r', r happens-before a.

RMW operations

The read r and write w generated by a single RMW operation have the properties (this is called the RMW-atomic property):

Sequentially consistent operations

In any given execution, within each memory domain, all instances of the sequentially consistent primitive operations on that memory domain are totally ordered, and whenever one sequentially consistent primitive operation is before another one in the same thread on the same memory domain in program-order, it is also before the other one in this total order. For each memory domain, this total order is contained within happens-before.

fence

A fence has the semantics of "lw.a $0 &dummy; sw.a $0 &dummy" (todo doublecheck syntax) where &dummy is a distinguished (dummy) memory location within the given memory_domain that cannot be accessed in any other way. Recall that fence also is all of an acquire, a release, a RMW, and a sequentially consistent primitive.

thread-local memory_domains

Some memory_domains have the 'thread-local' property. The thread that allocates a thread-local memory domain owns it. If any thread-local memory_domain is ever accessed by any thread other than the thread that owns that memory_domain, this is undefined behavior.

Data races; programs with data races are undefined

A data race is when there are two instructions in different threads, at least one of which is a write or RMW, at least one of which is non-atomic, which are not ordered by happens-before. A data race is undefined behavior.

The behavior of (any program whose possible executions contain any execution that contains undefined behavior) is undefined.

Programs with OOTA are invalid

An OOTA (out-of-thin-air) cycle is a read instruction r that takes its value from a write instruction w whose value syntactically depends on r. By 'syntactic' is meant that syntactic dependencies are determined by static form of the program, regardless of whether it is possible to realize the dependency in any possible execution. We include control dependencies in 'syntactic'. Programs with possible executions with OOTA cycles are invalid (an implementation may refuse to compile them, or may halt while executing the program with a runtime error) but not undefined (if an implementation does not halt with an error, it must execute the program according to the specified semantics).

Progress

No memory operation may be preceded in happens-before by an infinite sequence of other memory operations.

For each memory location, the latest value (in the total order guaranteed by coherence) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.

SC for DRF

A 'non-SC race' is a generalization of 'data race' where 'non-atomic' is replaced by 'non-sequentially consistent'.

A sequentially consistent execution is defined as an execution for which all memory operation execute in a total order, and for each thread, that thread's program order is contained within this total order.

If no sequentially consistent executions contain any non-SC races, then only sequentially consistent executions are permitted.

Memory model discussion and comparison to other languages' models

The memory operations follow C/C++20 and consist of non-atomic loads/stores, loads/stores/RMWs in each of {relaxed, RCpc, and RCsc} memory orders, and a sequentially-consistent/strong/cumulative 'fence'.

(in the previous sentence, where I said RCsc, I meant operations with C/C++ memory_order_seq_cst, and where I said RCpc, I meant loads/stores/RMWs with memory_order_acquire, memory_order_release, and memory_order_acq_rel).

Another way to say this:

The memory operations follow C/C++20 and consist of:

We provide all of the memory_orders that C/C++20 does, except for memory_order_consume. That is, we provide relaxed, acquire, release, acq_rel, seq_cst; although we don't provide the same amount of flexibility that C/C++ does to mix-and-match operations and memory orders.

For lw and sw, the 'default' (shortest mnemonic) is non-atomic, the weakest one; this is a concession to readability because most loads/stores will be non-atomic; however, once you get into the atomics, the 'default' (shortest mnemonic) is the strongest, and the longer the mnemonic gets, the weaker the memory_order gets. This is intended to serve as a guide or memory aid for the programmer; stick with the atomics with the shorter mnemonics unless you are sure you want the other ones.

I defined 'fence' to be a C atomic_thread_fence with memory_order_seq_cst, but then I also defined it as equivalent to a dummy acquire/release pair. I'm not sure if C fences are actually equivalent to that; 'Strengthen SC fences' in [3] suggests that they are defined independently because it suggests changing the definition of fences without touching the definition of acquire/releases, and also the RC11 paper says "Among these, Lahavet al.[17] proposed strengthening the semantics of SC fences in a different way from the way we do here, by treating them as read-modify-writes to a distinguished location. That strengthening, however, was considered in the restricted setting ofonly release/acquire accesses, and does not directly scale to the full set of C11 access modes. In fact, for the fragment containing only SC fences and release/acquire accesses, RC11-consistency is equivalent to RA-consistency that treats SC fences as RMWs to a distinguished location". I don't fully understand that, I don't know what the difficulty about scaling "to the full set of C11 access modes" was, or if that difficulty applies here, so unless/until I learn more, I'll leave it like this.

I took the definition of guaranteed and allowed reads from Golang, even though the memory operations are adapted from C/C++. Those are probably incompatible in some way that I didn't notice. Also, compared to Golang, i weakened guaranteed reads so that there is no guarantee if there is some other w' which is happens-before-incomparable to r (w' cannot be incomparable to w because, by coherence, all writes to one memory location are totally ordered), and then i changed the RMW axiom to match. Golang's guaranteed reads would have had "there does not exist any other write w' to the same location s.t. (w happens-before w') and (w' happens-before r)", and i changed it to "there does not exist any other write w' to the same location s.t. (w happens-before w') and (not r happens-before w')".

Another model i saw somewhere apparently had something to the effect of "SC read operations cannot read from overwritten writes" [4]. I think this is provided (not just for SC) by my 'allowed reads' axiom.

Regarding the out-of-thin-air problem (OOTA) my thoughts are (a) it's an open research problem, (b) C/C++20 doesn't say much about it, and we want to be able to compile to C/C++. My guess is that a program that admits OOTA executions should be considered invalid; usually in Boot we have to say that invalid programs are undefined behavior, because we want to be able to compile to C and C says that; but here we have the freedom to not say that because C/C++ doesn't say that OOTA programs are invalid. Note that in order to fully formalize the semi-formal definition that i've given, you have to get into dependencies, which i think are probably not worth the effort to formalize.

We provide a weak CAS but not a strong CAS so as to make implementation easy/possible on platforms without strong CAS.

I'm not sure if i specified 'program order' or the single-threaded-program-order property very well. I mean to say that no reorderings are allowed unless they would be correct in a single-threaded program. Or maybe that each thread sees its own reads and writes in program order.

Note that, outside of the single-threaded-program-order property and OOTA, there is no dependency ordering here.

I've included an 'SC for DRF' guarantee. Rather than requiring only SC synchronization operations, i guaranteed SC as long as all conflicting accesses including one non-SC synch operation are ordered by happens-before. I'm not sure if a 'SC for DRF' theorem holds here (my intuition is that it does), but in any case rather than leave that for a theorem I just included it in the specification.

Some 'litmus tests' and how (I think) they play out under this spec: IRIW (independent reads of independent writes) will be permitted whenever the sequentially-consistent operations aren't used to rule it out. Roach motel movement of stuff outside of any acquire..release critical section into an adjacent critical section will be permitted. RCpc (lw.a.rc) acquire..releases to different variables can overlap, RCsc (lw.a) ones cannot.

I'm not sure if FENCE is really needed but i'm puzzled about how else to efficiently deal with the situation in which your code calls other concurrent code that makes relaxed accesses to some various memory location within a block of memory, and then later you want to deallocate that memory. You need to have some way to make sure that you order all of those earlier relaxed accesses before you free the memory, and you don't really know which relaxed accesses were made because that wasn't your code making them, and you don't want to have to iterate over the whole block and do an acquire/release on each location within it.

I'm disappointed that i ended up including so many memory_orders because i think it makes the model more complicated, but i think they each have a reason to be there. Non-atomic loads/stores are for ordinary, non-concurrent loads/stores. RCsc (lw.a, casrmw) is for when the programmer is doing concurrent programming but doesn't want to reason about weak memory models (which should be the vast majority of the time). RCpc (lw.a.rc) is for when the programmer needs to order stuff between two threads but doesn't need to incur the cost of forcing all threads to stand in a global ordering. Relaxed is for when the programmer doesn't need any ordering at all; for when they just want something like a non-atomic load/store but atomic.

Note: one difference between us and C/C++11 is that we don't support C/C++11's notion of 'release sequences'.

TODO: acquire-release chains are local (but cumulative) TODO: acquire/release can be thought of as delineating critical sections. Stuff outside of the critical sections can leak in (be reordered in), but not vice versa.

RISC-V and ARMv8 have something a type of consistency that not well represented in this model, called "multicopy-atomic" or "multicopy-atomic-other" in which each write may be observed by the writing thread sooner than other threads observe it, but whenever any other thread observes a write, all other threads must then also observe that write (the intuition is that each thread can have its own thread-local writeback cache, but there can't be any writeback caches shared between multiple threads; all writes have to hit main memory before any thread besides the writer can see them, which induces a total ordering between all writes that hit main memory). These architectures get a much simpler memory model by disallowing the need to represent the possiblity of IRIW-ish scenarios in which various subsets of threads can each observe different orderings of writes; for example, most of RISC-V's model simply has to specify which orderings from Program Order are 'preserved' into a global ordering. To blindly and safely compile code written for those memory models to this one, i think you'd have to use the SC operations in this model, which are less performant than the non-SC-but-multicopy-atomic operations of RISC-V and ARMv8. I would have liked to extend this memory model to include operations that are non-SC-but-multicopy-atomic; I believe this would require the introduction of another total order to represent the total ordering of write operations at the time they hit main memory, and then modification of the which-reads-are-guaranteed-and-allowed axioms to respect that order for inter-thread reads while also allowing each thread to observe its own writes earlier. I didn't add that in because (a) i'm running out of opcodes, (b) i doubt anyone will want to blindly compile RISC-V and ARMv8 code to this, and (c) we already have what i regard as the bare essentials; local memory operations, sequential consistency, shared memory operations that don't force any ordering, and some way to synchronize between threads without forcing full sequential consistency (d) compared to RCpc, which is the mode in my model that provides-some-ordering-but-not-SC, multicopy-atomic-other seems to me to be less appropriate for massive parallelism, because there is still something global that has to be synchronized over all threads and all memory locations; so i don't want to get rid of RCpc to make room for multicopy-atomic-other, because i want to focus on massive parallelism (e) one of the main benefits of multicopy-atomic-other is much simpler models, but by including weaker modes our model would remain complicated anyways (f) disallowing a multicopy-atomic-other mode will force some algorithms to use our SC primitives where multicopy-atomic-other would have done, which will mean inefficiencies when Boot code is run on multicopy-atomic-other platforms, but higher-level concurrency libraries that really want to squeeze out the last drop of performance could always just bypass the Boot layer and use platform-specific code if they want the highest performance.

The first Progress axiom (about a finite number of memory actions) is taken almost verbatim from RISC-V, and the second (about a finite amount of time before visibility) is taken almost verbatim from C++.

I didn't include program-order in happens-before. I think other memory models include program-order in happens-before (e.g. https://www.hboehm.info/misc_slides/10-pldi-adve-boehm-tutorial.pdf ). But then how do you say that you can't reorder SC operations but you can reorder some other operations? What i did above is create happens-before edges for operations in the same thread that can't be reordered, and make another axiom that says that single threads must observe the same semantics as usual w/r/t reading their own writes. So i'm probably missing something here.

To the allowed-reads axiom, i added the conjunct:

(i'm not sure if this is required b/c i didn't see it elsewhere yet; many other models have the program order be included in happens-before, maybe that subsitutes for this? Not sure)

'atomic' operations are these which (appear to) execute atomically, that is, they never (appear to) overlap in time with any other operation.

todo:

seqcst lets you see your program as a simple interleaving of threads

acquire/release is good for acquiring and releasing locks and in this way making sure critical sections dont overlap

TODO

the following is out of

An implementation is of course perfectly free to implement stronger concurrency guarantees than required. For example, an implementation is free to have a global interpreter lock (todo is that enough?), or to surround each and every load and store (and the loads/stores within the implementation of CAS) in the program with full sequentially-consistent rw,rw memory barriers (todo is that enough?). Such techniques allow a Boot implementor to provide correctness without having to learn about and understand the details of memory consistency models/memory orders, at the cost of removing some or all of the efficiency gains presented by opportunities for concurrency.

(todo, mb insert the RVWMO axioms here? except that for us, do we want multi copy atomicity to apply only to ordered things? also, what about atomic, relaxed load/stores, are they "ordering instructions" that can't participate in data races?)

No thin-air reads for data-race-free programs: Boot implementations promise that for data-race-free programs, there are no thin air reads or self-satisfying conditionals. Apparently there's not yet an academic consensus on how to best formally define these in such a way that the undesirable paradoxical behavior is ruled out without ruling out acceptable optimizations, although in practice these are not observed on contemporary platforms [5]. For these reasons we forbid these things without defining them. If i had to take a guess at defining what we are excluding, however, i would say that we require that the causal chain determining the value returned by any memory read operation is acyclic. (i think this is implied by DRF-SC? but just being clear here. todo)

Even without any atomicity or ordering, within any shared memory_domain:

We generally follow the definitions and semantics in the RISC-V RVWMO specification, adapted in the obvious way for our instructions (e.g. that specification speaks of specific RISC-V operand fields, which are different from ours). todo: well maybe not quite; do we guarantee multi copy atomicity even on relaxed accesses? Probably not

Blocks of memory are said to be either thread-local or shared. malloc_local returns thread_local memory and malloc returns shared memory.

TODO: we need to guide implementors how to implement these on (a) a system with only FENCE rw,rw and similar and no release/acquire loads/stores, and (b) a system with only release/acquire loads/stores and no FENCE rw,rw, and (c) a system with only (SC?) locks, and no release/acquire loads/stores and no FENCE, and (d) on x86_64, and (e) on ARM, and (g) on Java, and (h) on CIL. mb see also https://www.cs.tau.ac.il/~orilahav/papers/popl16.pdf

TODO: we need to guide programmers how to implement: (a) something like RISC-V FENCE rw,rw using Boot; (b) something like C atomic_fence SC using Boot; (c) SC locks using Boot. mb see also https://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/

To blindly/conservatively implement Boot on most platforms:

To blindly/conservatively write a Boot program without worrying about this stuff, only use atomic loads and stores on memory that might be shared, and insert FENCE instructions before and after every memory operation (load, store, casamo/casamorlx, addamo, apamo, andamo, oramo, xoramo, plus malloc etc). (TODO is this correct?)

todo: say that we have coherence (and dependencies?)

Undefined behavior

When a program contains undefined behavior, no guarantees are provided on what the implementation will do with that program. The implementation might refuse to compile the program, or the program could crash at runtime, or it could run but return weird results, or corrupt memory, or even execute undesired commands such as reformatting a disk (this could happen, for example, if a function pointer was corrupted and then jumped to, and the corrupted value happened to point to a disk I/O subroutine).

Note that the part of the program containing undefined behavior doesn't even have to execute for the undesired results to occur; the mere fact that some potential execution of the program could execute undefined behavior makes the behavior of the entire program undefined.

Our motivation for this sort of specification is that we want it to be legal for a Boot implementation to blindly compile a Boot program to a lower-level platform such as C or assembly without checking the Boot program for bugs first, and the most straightforward translation of buggy code could have unpredictable results on some platforms. For example, in C, the C standard defines signed arithmetic overflow to have undefined behavior, so a Boot program with signed arithmetic overflow could legally be compiled to a C program with signed arithmetic overflow, and that C program would have undefined behavior. For another example, a data race could cause memory corruption, and memory corruption has unpredictable effects on various platforms. The reason that we specify that the behavior of the whole program becomes undefined if any part of it contains undefined behavior is that some target platforms (for example, C compilers) may assume the correctness of the program given to it and perform global optimizations that are incorrect otherwise. We can't do anything about how the target platforms work and we don't wish to force Boot implementations to check for bugs in Boot code at compile time or protect against bugs at runtime, so we must inherit the scary semantics of the target platforms here.

The phase "undefined behavior" is sometimes abbrevated "UB".

Instruction listing

In the following, op0, op1, op2 are short for operand 0, operand 1, operand 2.

TODO: the following is almost all out of date

Three-operand instructions

todos

todo consider the following syscalls/lists of stuff: comparing the PDClib's platform-specific layer and newlib, the following are in the intersection:

fork execve wait _exit unlink link environ open read write lseek close raise ('kill' is used in newlib to send a signal) memory allocation (sbrk in newlib, mmap in PDClib?)

additional functionality only in newlib: fstat getpid isatty stat times

additional functionality only in PDClib: memory deallocation (munmap) signal handler setup (signal) /dev/urandom

comparing the intersection above to the union of current boot and bootx:

not in boot/bootx:

execve link environ raise ('kill' is used in newlib to send a signal)

only in boot/bootx: poll break loadcodeptr flush walk

only in boot/bootx but in one of newlib, PDClib: time (but in newlib, although not PDclib) memory deallocation (but in PDClib, but not newlib) /dev/urandom (but in PDClib, but not newlib)

so consider adding 'not in boot/bootx' and subtracting 'only in boot/bootx'

---

https://bitbucket.org/pdclib/pdclib/src/c8dc861df697a6c8bddbcbf331d9b6fcae6e2f4d/platform/posix/functions/_PDCLIB/_PDCLIB_stdinit.c?at=default&fileviewer=file-view-default

https://bitbucket.org/pdclib/pdclib/src/c8dc861df697a6c8bddbcbf331d9b6fcae6e2f4d/platform/posix/includes/signal.h?at=default&fileviewer=file-view-default

https://bitbucket.org/pdclib/pdclib/src/c8dc861df697a6c8bddbcbf331d9b6fcae6e2f4d/platform/posix/internals/_PDCLIB_config.h?at=default&fileviewer=file-view-default

---

https://docs.microsoft.com/en-us/cpp/cppcx/crt-functions-not-supported-in-universal-windows-platform-apps

do we really want to follow their lead there? There is no stdin,stdout,stderr; there is no 'system' or 'execve' to execute other applications ("A UWP app cannot invoke another UWP app or a desktop app."); there are no environment variables; there are no pipes.

hmm yeah maybe we should follow their lead here, actually, for Boot if not for Oot; Boot is supposed to be minimalistic.

so:

---

---

---

more todos: