The contents of boot_reference as of 2017/03/27, just before a cleanup/removal of old-now-incorrect stuff/decision to remove 'optional' instructions:

Title

:hardbreaks:

Version: unreleased/in-progress/under development. Working towards development version 0.0.0-0.

todo: i said this was RISC-y, but: + +

Cheatsheet

Instruction encoding: 16 bits, little endian, from LSB to MSB, for fields each 4 bits long: opcode, operands 0, 1, 2. Instructions are packed into 64-bit aligned quads. The least-significant-bit of each quad must be 0. + + Datatypes: u16 (unsigned 16-bit integer), ptr (pointer). + + Registers: 16 registers. First 8 are special: + + (todo see below this is out of date)

0. 0 register 1. PC 2. ERR 3. SCRATCH 4. HIDDENSTACK 5. DATASTACK 6. CALLSTACK 7. reserved for future use + + Instructions: todo: this is slightly out of date, copy updated version from below:

0. annotate (also nop, when operands are zero) 1. loadi (load immediate) 2. load (load from memory to register) 3. store (store from register to memory) 4. cpy (copy from register to register (in other architectures this is usually called "MOV")) 5. skipz (skip if zero) 6. skipnz (skip if non-zero) 7. calli (call indirect) 8. pop 9. push 10. add-u16 (addition of u16s) 11. sub-u16 (subtraction of u16s) 12. leq-u16 (less-than-or-equal of u16s) 13. add-ptr (add a u16 to a pointer) 14. sub-ptr (subtract a u16 from a pointer) 15. vmcall (other vm functionality)

Instruction encoding

The instruction encoding is fixed-length 16-bits, and instructions are grouped into 64-bit 'quads'. + Encodings are little-endian, and bits are given from LSB to MSB. + + The encoding of each instruction is:

4 bits: opcode
4 bits: operand 0
4 bits: operand 1
4 bits: operand 2 + + Four instructions are grouped into a 'quad'. The quads must be 64-bit aligned in memory. The least significant bit of each quad must always be 0. This implies that every fourth instruction only has half of the opcodes available to it. The motivation for that is so that implementations may extend Boot by defining additional syntax and semantics for 64-bit chunks whose least significant bit is 1. + +

Data types

There are five defined data types:

U16 (unsigned 16-bit integer)
PTR (pointer)
CODEPTR (pointer to executable code)
OTHER (opaque implementation-dependent type)
UNDEFINED (the result of an invalid operation) + + Pointers (PTRs) are semi-opaque; although there are instructions to add integers to pointers, pointers are not themselves integers. Pointers cannot be added or compared to other pointers. The underlying representation of pointers is unspecified; pointers may be, for example, 64-bit native pointers, or they may be 16-bit integers. + + Each register, stack, and memory location must be able to contain at least one "word" which can be either of a 16-bit integer, or a pointer, or a codepointer. Registers, stack, and memory locations may or may not be capable of containing other opaque values that are neither U16s nor PTRs nor CODEPTRs ('OTHER'). + + Codepointers (CODEPTRs) are opaque; platforms may fail to support arithmetic on them or dereferencing them (for writing or even for reading). Similarly, no standard Boot instructions support operating on OTHER types; these may only be moved from place to place (if an implementation uses OTHER values, it is anticipated that implementation would provide library functions or other implementation-dependent extensions that can produce, manipulate, and consume them). + + Platforms are not required to detect attempts to apply unsupported operations on data. + + If an attempt is made to dereference and write to the location 'pointed to' by a non-PTR, or if a dynamic jump is made to a non-CODEPTR, then the result is undefined behavior. Similarly, an attempt to dereference and read from the location 'pointed to' by a non-PTR should result in a crash or in an UNDEFINED value (however, implementations are permitted to behave in an undefined manner for invalid pointer reads, although this is discouraged; this is only allowed because in some cases the platform that the implementation is running on may already do something weird when certain memory locations are read, and we don't want to make the implementation check each memory access to avoid those locations). Otherwise, application of unsupported/incorrectly typed operations result either in a crash (ideally with an INVALID_TYPE error code, but this is not required), or continued program execution with the result of the operation being a value of type UNDEFINED (but undefined BEHAVIOR, that is, a behavior other than these two alternatives, is not allowed, however). For example, application of an ADD-U16 instruction to a PTR may result in a crash, or an UNDEFINED value that behaves like some arbitrary PTR, or an UNDEFINED value that behaves like some other type, but will not result in arbitrary memory corruption. + + Since we don't define the semantics of either OTHER or UNDEFINED, these would seem to be redundant. The intent is that OTHERs may be created by correctly-behaving libraries and implementation extensions, whereas UNDEFINEDs are only created by erroneous programs. + +

Registers

There are 16 registers. The first 8 registers are special.

0. 0 register 1. PC 2. ERR (or STATUS?) 3. TOS (alias to top of datastack) 4. CAP (or MODE?) 5. HIDDENSTACK 6. DATASTACK 7. CALLSTACK + + There are three stacks (hiddenstack, datastack, callstack). The stack pointers are in scare quotes because Boot programs are not allowed to directly manipulate these; all stack manipulation must be accomplished via the PUSH and POP and LOAD and STORE instructions. The stacks may or may not be stored in ordinary memory, and the "pointers" may or may not be ordinary pointers. + + In more detail: + +

0. 0 register

Reading from the zero register always returns 0. Writing to the zero register has no effect. + +

1. PC

The program counter. This is a pointer to the next instruction to be executed. It is read-only. A Boot program is not allowed to directly read or write program memory, or dereference pointers into program memory. Program memory may or may not be stored in ordinary memory.

2. ERR

This is used to return error codes. An error code of "0" means no error (success).

It is also used to return carry and overflow results from addition and subtraction.

# 3. CAP This stores a stack "pointer" to the capability stack.
1. 4. HIDDENSTACK

A stack "pointer".

The capacity of the hiddenstack is 15 items.

The compiler must be able to resolve at compile time the stack offset of each access of the hidden stack, relative to the hiddenstack pointer at the beginning of function/code segment.

# 5. DATASTACK

A stack "pointer".

# 6. CALLSTACK

A stack "pointer".

# 7. TOS Alias to the top item on DATASTACK.

Instructions

The 16 three-operand instruction opcodes and mnemonics and operand signatures are: 0. annotate: ? ? ? (this instruction is called BAD when all operands are zeros) 1. loadi: oi k k (load 8-bit u16 immediate) 2. loadlo: oi k k (load 8-bit immediate with multiply; multiply the destination register by 256 and then add the 8-bit immediate value. NOTE: this instruction is illegal except immediately following either a LOADI to the same destination register, or another LOADIM to the same destination register. NOTE: these allow you to load constant greater than 8 bits. The maximum constant size is platform-specific; a sequence of LOADI/LOADIM instructions that would create a larger constant than the platform supports is unsupported on that platform) 3. load: o k si (load from memory (plus constant offset) to register; atomic/tear-free) 4. store: so k i (store from register to memory (plus constant offset); atomic/tear-free) 5. TWO: ? ? k (two-operand instructions)) 6. jrel: 0 0 k (relative jump) 7. skipz: 0 0 ii (skip if zero) 8. skipnz: 0 0 ii (skip if non-zero) 9. add-u16: oi ii ii (addition of u16s) 10. addi-u16: oi ii k (add a constant to a u16) 11. sub-u16: oi ii ii (subtraction of u16s) 12. leq-u16: o ii ii (less-than-or-equal of u16s) 13. add-ptr: op ip ii (add a u16 to a pointer) 14. addi-ptr: op ip k (add a constant to a pointer) 15. cas-atomic: pio i i (compare-and-swap)

The 16 two-operand instruction opcodes and mnemonics and operand signatures are: 0. cpy: o i (copy from register to register (in other architectures this is usually called "MOV")) 1. or-u16: io i (bitwise OR: op0 = op0 OR op1) 2. and-u16: io i (bitwise AND: op0 = op0 AND op1) 3. xor-u16: io i (bitwise XOR: op0 = op0 XOR op1) 4. add-atomic-u16 pio i (value at memory location is atomically replaced with the sum of its previous value and a register) 5. add-atomic-ptr pio i 6. mul-u16: io i (multiplication: op0 = op0 MUL op1) 7. xchng-atomic: pio io (value at memory location is atomically exchanged with value of register) 8. picki: sm k (push a copy of the n-th item in a stack (zero-indexed) onto the top of the stack) (note: this can also do dup (with r0), over) 9. rolli: sm k (rotate the top n+2 elements on a stack) (note: this can also do swap (with k=0), rot) 10. slli-u16: io k (op0 = logical_shift_left(op0 by k bits)) 11. srli-u16: io k (op0 = logical_shift_right(op0 by k bits)) 12. library: k k (call to platform-specific library function identified by 8-bit constant (remember, each operand is only 4 bits); calling conventions are as with syscalls) 13. pop: o sim 14. push: som i 15. ONE: ? k (one-operand instructions)

The 16 one-operand instruction opcodes and mnemonics and operand signatures are: 0. jd: pc (dynamic jump) 1. stackload: sm (pop saved stack from DATASTACK, and load it) 2. stacksave: si (save stack, and push it to DATASTACK) 3. fence-acquire: p (the memory location p is ordered by an acquire memory barrier) 4. fence-release: p (the memory location p is ordered by a release memory barrier) 5. loadmulti-atomic: k (pop pointer (pi) from DATASTACK, atomically load k items onto DATASTACK) 6. storemulti-atomic: k (pop pointer (pi) from DATASTACK, atomically store k items from DATASTACK) 8. not-u16: io (bitwise NOT: op0 = NOT op0) 8. addstk-u16: sm (stack version of add-u16) 9. substk-u16 10. leqstk-u16 11. addstk-ptr: sm (stack version of add-ptr) 12. fence-seq: p (the memory location p is ordered by a sequential consistency memory barrier) 13. stackif: sm (pop top three items in stack, if top item was 0, push second item, otherwise push third item; C's ternary conditional operator) 14. ZERO: k (zero-operand misc operations) 15. syscall: k (zero-operand I/O and IPC)

The 16 zero-operand misc operations are: 1. 2. casstk: (stack version of CAS, on DATASTACK) 3. log: (log a message; loglevel k and then pointer to message is on top of DATASTACK) 4. mdealloc (deallocate memory (like C's free); ptr is on TOS) 5. 6. 7. 8. sra1stk-u16: (TOS = arithmetic_shift_right(TOS by 1 bit)) 9. modeset: (change the processor state in some way (eg set error handler); 2 arguments passed on stack) 10. malloc-shared: (allocate shared memory supporting atomic operations; amount of memory requested is popped from DATASTACK, result is pushed to DATASTACK) 11. mrealloc: (resize memory allocation (like C's realloc); pointer and new size is popped from top of DATASTACK) 12. malloc-local: (allocate local memory; amount of memory requested is popped from DATASTACK, result is pushed to DATASTACK) 13. regload: (pop k from datastack?, pop saved k registers from CALLSTACK and restore k registers starting with 8) 14. regsave: (pop k from datastack?, save registers greater than 7 and less than 7+k, and push to CALLSTACK; stored register format is opaque (but is it one or many entries on CALLSTACK?!? b/c we want to be able to manually manipulate CALLSTACK and even move saved registers elsewhere in memory, right?)) 15. break

The 16 zero-operand syscalls are:

0. sysinfo (does this take a 16-bit query argument on the stack, or just return all of the system info every time? probably the former) 1. delete 2. walk (like plan9's 9p's walk; todo is this actually different from ls/query, i forgot) 3. rand 4. sync/flush 5. seek 6. fork 7. forkdone (fork's join/wait) (todo can't we do this some other way, using channels, without burning an opcode?) 8. open 9. close 10. read 11. write 12. poll (todo: how exactly poll works) 13. time (takes a clock_spec on the stack, and returns one or more words on the stack according to the clock_spec; returns 0(s) and sets ERR condition if requested clock is not supported) 14. 15. halt (pop DATASTACK and terminate program execution, returning the popped value)

Many of the zero-operand instructions (esp. the syscall-y ones) also use DATASTACK to pass/return parameters.

todo:

mb we do want loadd and stored in the three-operand instructions: b/c the reason for leaving them out is that you could just do pointer arithmetic to get the same thing in two instructions, but can you do pointer arithmetic on stack "pointers"? If not, then this doesn't work.
is it better to have in-place and/or, or accumulator (ie outputs to scratch) and/or?
does MUL deserve to be in the 3-operand instructions?
REGSAVE and REGLOAD could just always push/pop from CODESTACK, in which case they could both be one-operand and could take a k argument for number of registers)
could have some zero-operand instructions (syscalls, i guess?)
yknow, we don't really want mlen to return u16, and we dont want to limit malloc to accept a u16. So mb replace u16 with a 'native' 'number' type that is guaranteed to be at least u16, but not guaranteed to be u16, so eg if you add 1 to 65535 it may or may not overflow or wrap to 0
can opaque values be in constant table, or are they limited to u16s? if u16s, then limited to 64k words of code because that's the biggest jump constant you can encode. That seems too limiting...
if limited to u16, should we even have a constant table? without a constant table, you could still have a jump table, but it would limit code size to 4k words, and also would be less space-efficient to store

maybe:

stack size and stack capacity queries?
REST-like CRUD and collection operations?
capability security
cpuid
reset
mark register as unused
sync/flush
kill
array, string, map creation?
CSWAP (fredkin) and CCNOT (Toffoli)? Note that CCNOT can do AND in two instructions (load 0 into a register x, then do CCNOT(A,B,x), where x is the output), and, given a zero register, if you make 0 swap instead of 1, CSWAP generalizes SWAP, and also can do OR in two instructions (LOADI 1 into a register x then do CSWAP(A, B, x), where x is the output). Of course, CSWAP and CCNOT can also be defined by and+or in a similar fashion
getinfo/mode, setinfo/mode? mb to set error handing strategy, eg when we encounter an err, do we put the err code into ERR or do we jump to an error handler?
stack addition/subtract/mul?
a way to check capability access beforehand without causing an ACCESS DENIED error?

maybe (copied from ootLibrariesNotes6):

readinfo remove
writeinfo remove dup mount
exec wait
getpid(getmypid) kill
pipe signal stuff_to_setup_signal_handlers
cpuid/uname/myinfo/version
mv mkdir 12. info: pi (finfo/stat) 13. query: pi (ls/query) 6. mcalloc: po i (allocate memory and zero it (like C's calloc))
don't need 'sleep' b/c we have 'poll' which has a timeout
done?: create new code segment, and/or transform data memory to code memory, and/or transform code memory to data memory (and possibly, create new code segment with constant table)
done?: loadk, loadj (load from constant table/jump table; if we have these, we need to have a way to set these tables, right? b/c we don't have a LOADMODULE in Boot)
calli: o i pc (call indirect) (should calli's arguments be numbers of parameters and return values passed on the stack, instead of actually being parameters and return argument values?) mkcode pi ?or mb more generally, ?chmod_pointer? for capability stuff? but mb we need mkcode so that we have a way to specify the constant table with the code? rroll: sm i (rotate the top n elements on a stack in the other direction) cnot: io io (CNOT gate; flip the contents of register op0 iff the contents of register op1 is 1) not: oi ii (logical not) (we don't need: just do SUB 1 - x) (do we need NEG? no, just do SUB 0 - x) loadd: o i si (load from memory to register, dynamic offset) (this would probably be frequently used, but is not essential, as you could just do an ADD-PTR followed by a LOAD) stored: so i i (store from register to memory, dynamic offset) mlen: si (current depth of stack or length of allocated memory (should those be combined?)) atomic add, and, or, xor, max, min exec addstk-u16 sm substk-u16 sm leqstk-u16 sm addstk-ptr sm stackisempty: o (is the stack empty? should probably be 0-operand. Useful if we want to enumerate the stack for debugging without doing a stacksave. Not sure if we should keep this one; b/c it will be rare and you can always just do a stacksave, if we make the stacksave format non-opaque) loadk: k k k (load from 12-bit-indexed constant table, and push onto DATASTACK) switch (jrel with a register input) swap: io io (swap the contents of two registers) (note: 'roll STACK 0' can be used to swap the top 2 elements in a stack) andi-u16: io k (bitwise AND: op0 = op0 AND constant) (note: is this worth it? remember that constant can only be 0-15. would stackandi-u16 k k be better?) 5. capunpack: (from the capability on top of CAP, generate a struct describing it. there is no 'cappack' because you can't edit capabilities) 6. capalloc: (allocate new capability stack (input on TOS: size to allocate); returns the new stack pointer on data stack. Deallocate with ordinary mdealloc)

note: over and dup can be done with 'pick 1' and 'pick 0'

TODO see also those other lists of popular linux syscalls that i made, and my ideas regarding instructions for epoll, completion ports, etc

note: the reason for (malloc, mdealloc, open, close, read, write), is that they are fundamental and common. (walk) because it is fundamental. (poll sleep) because they are called a lot in tight loops and so you want them to be efficient.

Where the signature characters mean:

k: compile-time constant (integer from 0-15, inclusive)
i: input register (input is any type except for stack "pointers") (register direct)
o: output register (output is any type except for stack "pointers") (register direct)
io: input and output register (any type except for stack "pointers") (register direct)
ii: input register (input is integer type) (register direct)
oi: output register (output is integer type) (register direct)
ip: input register (input is pointer type)
op: output register (input is pointer type)
pi: input pointer (register indirect) (stack "pointers" not allowed)
po: output pointer (register indirect) (stack "pointers" not allowed)
pm: pointer which may be read and/or written and/or mutated (stack "pointers" not allowed)
si: input pointer (register indirect) (both ordinary pointers and stack "pointers" are allowed)
so: output pointer (register indirect) (both ordinary pointers and stack "pointers" are allowed)
sm: pointer which will be mutated (both ordinary pointers and stack "pointers" are allowed)
sim: input pointer which will be mutated (both ordinary pointers and stack "pointers" are allowed)
som: output pointer which will be mutated (both ordinary pointers and stack "pointers" are allowed)
pc: input code pointer
?: variant, or reserved for future use
0: must be 0

In addition, the mnemonic 'nop' refers to 'annotate' with all zero operands.

In the following, op0, op1, op2 are short for operand 0, operand 1, operand 2.

Three-operand instructions:

## 0. annotate: ? ? ? If all bits are zero, this is not ANNOTATE but rather BAD, an illegal instruction. This is to help halt execution if memory is accidentally executed, since some memory may contain zeros.

Used to embed metadata into code. Has no effect on execution; can be skipped by an interpreter or compiler.

## 1. LOADI (load immediate): oi k k

A number is formed by bitwise concatenating op1 and op2, with op1 being MSB (that is, by multiplying the value of op1 by 16, and then adding this to op2). This number is then treated as a u16 and stored in the register indicated by op0.

## 2. LOAD (load from memory to register): o k si

The contents of the register indicated by op2 is taken to be a pointer. The memory location indicated by this pointer, plus op1, is looked at, and the value found there is stored in the register indicated by op0.

Note: by giving a stack pointer in op2, LOAD can be used to peek at stack items at indices from 0 to 15 without POPing. Stack "pointers" are allowed in op2.

## 3. STORE (store from register to memory): so k i

The value in the register indicated by op2 is stored in (the memory location indicated by the contents of the register indicated by op0, plus op1).

Note: by giving a stack pointer in op2, LOAD can be used to overwrite stack items at indices from 0 to 15 without PUSHing. Stack "pointers" are allowed in op2.

## 4. SKIPZ (skip if zero): 0 0 ii

If the value in the register indicated by op1 is zero, then the PC is incremented by 1.

Operands 0 and 1 must be zero; skipz instructions with non-zero operands 0 or 1 are reserved for the internal use of the implementation, and are syntax errors if found in Boot program code.

## 5. SKIPNZ (skip if non-zero): 0 0 ii

If the value in the register indicated by op1 is not zero, then the PC is incremented by 1.

Operands 0 and 1 must be zero; skipnz instructions with non-zero operands 0 or 1 are reserved for the internal use of the implementation, and are syntax errors if found in Boot program code.

## 6. JREL (relative jump): 0 0 k

If k > 7, increment the PC by k-7. Otherwise, increment the PC by k - 9.

## 7. two (two-operand instructions): ? ? k

The two-operand instruction identified by op2 is called with operands op0 and op1. See section below for a list of two-operand instructions.

## 8. POP: o k sim

The n-th item on the stack specified by op2 is popped (reading the value, then reducing the stack size by 1 and moving each item above the n-th down by 1), where n is op1. The value that was popped is placed into the register specified by op0.

## 9. PUSH: som k i

The value in the register specified by op2 is pushed into the stack at the n-th index, where n is given by op1 (increasing the stack size by 1 and moving each item above the n-th up by 1, then writing the value).

## 10. ADD-U16 (addition of u16s): oi ii ii op0 := op1 + op2 + ERR, where ERR is either 0 or 1

if the result is larger than op0 can hold, op0 will be set to the result mod its op0's maximum value, and ERR will be set to 1; otherwise ERR will be set to 0

## 11. SUB-U16 (subtraction of u16s): oi ii ii op0 := op1 - op2

todo: this instruction was deleted, i think

The current PC is PUSHed onto CALLSTACK. The contents of the register specified by op1 is PUSHed onto DATASTACK. The PC is set to the contents of the register specified by op2, and execution continues until a RET is encountered when the TOS of CALLSTACK is equal to the next instruction. At that time, the register specified by op0 will be set to the return value of the RET, and execution resumes.

Two-operand instructions:

todo: some of the numbers below are wrong

## 0. CPY (copy from register to register (in other architectures this is usually called "MOV")): o i

The value in the register indicated by op2 is copied into in the register indicated by op0.

## 1. NOT (logical not): o i

The values in the registers specified by op0 and op1 are swapped.

## 3. MALLOC (allocate memory): po i

Memory whose size is the value in the register indicated by op1 is allocated. Note: each location in memory must be able to hold at least a 16-bit word (but could hold more, depending on the implementation). So the units of the amount of memory requested is not bytes. If the allocation is successful, a pointer to the new memory is placed into the register indicated by op0. If the requested amount of memory is not available, ERR will be set to the code for OOM (out-of-memory) and the value in the register specified by op0 is undefined.

One-operand instructions:

## 0. RET (return from call): o

CALLSTACK is POPped and the PC is set the value popped, and the value in the register indicated by op0 is returned.

Calling convention

All registers are volatile/caller-save, that is to say, the callee is permitted to modify any register and does not need to preserve any of them.

Parameters and/or return values are passed on DATASTACK. Caller-cleanup, that is, the callee does not pop parameters passed to it on the stack even if it doesn't need them anymore; the caller does that.

If success or failure is to be indicated upon return, then upon return, ERR has been set to a u16: 0 for success, or an error code otherwise.

Error codes

... 0. success (no error) 1. carry (not an error) 2. resource unavailable (used by MALLOC and REALLOC for OOM) 3. access denied 4. unimplemented instruction/operation (eg, this is returned if TIME or CAS is encountered but not available on this platform) 5. invalid type 6. invalid argument

(do we have a separate error code for UNSUPPORTED CLOCK? UNSUPPORTED SEEK?)

30. BAD/illegal instruction encountered 31. internal error (unexpected internal error in the Boot implementation; this should cause a crash, so it should never be seen by Boot code) 32-63. RESERVED 64-127. implementation dependent 128-255. custom (reserved for user code)

?? todo: signal interrupts? ...

Note: 0 and 1 are not treated as "errors"; when ERR is occupied by either 0 or 1 there is no 'error' (so that these can be used for 'carry' and 'overflow' in arithmetic), otherwise there is.

RESERVED

Items marked RESERVED are not available for implementation-dependent use; they are reserved for future versions of Boot.

Semantics

todo, anything that we need to say here?

Concurrency semantics

In shared memory (memory allocated with malloc-shared), LOAD, STORE, and atomics (the operations labeled -atomic) (where supported by a particular implementation) are atomic. All implicit loads and stores of single Boot words (16-bit uints, or pointers, plus whatever 'OTHER' types the implementation provides) are tear-free/atomic. Nothing else in shared memory is guaranteed to be atomic. Some or all of the -atomic operations may not be available on all implementations. Some implementation may implement some atomics, or even all memory accesses to shared memory, via a lock at the granularity of the entire region allocated by one call to malloc-shared. Valid Boot programs must not share local memory (memory allocated with malloc-local), that is, they must not dereference a pointer to local memory which is owned by another process (local memory is owned by the process that allocated it).

Registers are never shared, hence single instructions that only do computations between registers (ie. that don't access main memory or I/O or IPC) are atomic.

Our memory consistency model is release consistency. In the absence of any FENCE instructions, atomic operations have only a relaxed memory ordering (see memory_order_relaxed in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ). Synchronization is only effective within a memory domain (which is provided as an argument to malloc-shared); this is to accomodate NUMA and distributed architectures in which one process may have access to multiple distinct memories. To do an acquire, use FENCE-ACQUIRE (see memory_order_acquire in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ); to do a release, use FENCE-RELEASE (see memory_order_release in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ); use FENCE-SEQ to have the effect of both a FENCE-ACQUIRE and a FENCE-RELEASE and in addition to ensure sequential consistency amongst all FENCE-SEQs in the same memory domain (see memory_order_seq_cst in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ). If you don't know what any of this means, then please always follow these guidelines:

only use atomic operations (including LOAD and STORE, which are atomic) to access shared memory
always put a FENCE-SEQ immediately before and after these shared memory accesses
be aware that other threads may be accessing shared memory interleaved with yours. For example, if you read x from shared location X, and then read y from shared location Y, and then compute f(x,y) and then store the result into shared location Z, be aware that another thread may have changed the value of X or Y or both in the meantime, so you cannot assume that Z = f(X,Y) after this computation; for example, if f() is multiplication, then if you read x=3 and y=4 and wrote z=12, perhaps another thread changed x to 10 while you were multiplying, causing the end result to be x=10,y=4,z=12 (so z != x*y).

CAP register

The semantics of the CAP register (including its access privileges) are implementation-dependent.

Access privileges

A process may freely read/write to registers (except for PC, which is read-only, and IMPREG, which may or may not be readable and/or writable, possibly conditionally). A process may freely read/write to memory that itself has MALLOC'd unless it dropped access privileges in an implementation-dependent fashion. Otherwise, access privileges are implementation-dependent. The implementation should use the ACCESS_DENIED error code to express when an operation was illegal due to insufficient privileges.

16-bit arithmetic

If a Boot program computes an arithmetic operation which, mathematically, should produce an integer value greater than 2^b - 1, where b is bitwidth indicated by the output of the SYSINFO call for the supported bitwidth, then the output value is an undefined value. Since b is guaranteed to be at least 16-bits, values up to 65535 are always safe.

Appendix: the .ob0 file format

A Boot implementation is not required to support the .ob0 file format; it is an optional, related technology.

All integers are little-endian.

header:
- 4 bytes: magic number: 1B 6F 62 30
- 2 bytes: version: 0
- 2 bytes: header length (without padding)
- 2 bytes: max stack depth required, or 0 for a request to indicate unknown or a request for more than 65535
- 2 bytes: function count
- for each function:
  - 2 bytes: function ID
  - 2 bytes: function offset
  - 2 bytes: function code length
- implementation-dependent bytes
- padding (to 16-bit alginment)
for each function:
- 2 bytes: function header length
- 2 bytes: input count (must be <= 255)
- 2 bytes: output count (must be <= 255)
- for each input:
  - 2 bytes: input type
- for each output:
  - 2 bytes: output type
- implementation-dependent bytes and padding (to 64 bits)

TODO: remove 32k/64k file length limit in implementations TODO: increase # of bytes for function offset to 4 or 8 TODO: increase function length limit to 2^32 or 2^64, and increase # of bytes for function code length (or simply remove this) TODO: harmonize #function limit to 65535 in implementations TODO: add max_constant_length to implementations TODO: move variable-length function headers to the beginning of file, and list their offsets in the header at top (or just concatenate them at top and expect the implementation to read through all of those input and output types to get to the offsets). The end of the file is just a chunk of code (and other implementation-specific data).

TODO: above, we went to all this trouble (the LOADIM instruction) mostly just so that we wouldn't have to set a platform-independent limit the length of code, but here the maximum length of function code is very small, in fact in current implementation an ob0 file is limited to just 32767 bytes, and in asmdis to just 65536 bytes (todo: harmonize those). Shouldn't we make this larger, if not unbounded? At least 32- or 64-bit? otoh recall that the 64k limit makes sense if we are using ob0 files for custom instructions, which have a <64k limit (64/+-8 = 8k limit, i think, b/c of JREL's ability to skip up to 8 instructions)

NOTE: since the ob0 file format already has separate 'functions', do any instructions refer to these? eg is there a JUMP instruction to go to one of these functions? Can you JUMP into the middle of another function? Bring back 'custom instructions' in the form of a CALL2? NOTE: are the jump targets in code with reference to the beginning of the ob0 file, or with reference to the beginning of the current function? i'm thinking that the jump targets are relative to the whole file, you can jump wherever you like, there is no CALL2, and there is no special support for functions; the functions are just entry points for external use. Of course, certain types of use (eg custom instructions) can impose additional restrictions. In this case, why partition the file into code for functions? Well, (a) we have a fixed length file header but a variable-length per-function header (actually not though b/c the # of functions per-file is already variable), and (b) some usages may find it useful to partition the code into functions. However, here, the partitioning adds no extra information; a use-case which wants to partition memory can do so even if all of the functions are concatenated in the file, since it knows the entry points; and some users may want to read in all of the functional headers without the function bodies. So, actually, i added a TODO above to move everything to the beginning of the file and just leave a chunk of code at the end

TODO: if we DO want to stick with length limits, then should we change LOADIM to just LOADHIGH, permanently limiting Boot to 16-bit constants?

Appendix: design choices discussion

CAP register

Although its behavior is left open, the CAP register is intended (but not required) to be used for access control/security/privilages/capabilities (hence the name CAP).

Lack of safety and security

If the purpose of Boot is to serve as a substrate for and a subset of Ovm, and if Ovm is to be a safe language, then why don't we require bounds-checking at the Boot level? Why do we allow invalid pointer operations to have undefined effects? Because Boot will be used to execute the Ovm implementation, and for efficiency we will trust the Ovm runtime to be safe and correct. Ovm will be a safe language; Boot is not.

Note that this means that an Ovm implementation running on a Boot platform will not be able to simply identify parts of an untrusted program that it is running that contain only instructions in the Boot subset of Ovm, and blindly execute them directly on the Boot platform; it must prove safety or insert bounds-checks.

Undefined values for arithmetic over 16-bits

We allow the implementation freedom to produce arbitrary output values when the outputs 'should' overflow 16-bits because we want to allow Boot implementations to be able to simply use the native arithmetic operations of the underlying platform without inserting checks for overflow.

todo

check vs. webassembly, Lua, JVM, CLR, LLVM to see what we are missing.
- Already checked vs. TinyRISC?-V. Missing: JALR (i don't think we need that)
- Vs. RISC-V compressed, missing:
  - scaled variants, sign- and zero- extending variants (i don't think we need these)
- already checked vs. atomics on risc-v, llvm, webassembly, clr (i think). Missing and or xor dec(or sub). i think we'll live. also, the is_lockfree stuff should be covered by our cpuinfo analog.

some things our CPUINFO will probably need:

our version
which instructions are supported?
how many items can load/store multi handle (max k. Note that our encoding yields a max of 16, eg 256-bit uints)?
which atomics are lockfree?
how many CPU cores (or, VM OS processes) are available? can be bind our threads to CPU cores? which CPU core are we running on right now?

our MALLOC-SHARED needs to specify a 'memory domain' -- fences only synchronize within a memory domain

i dunno, maybe bring back CAP. Just make it read-only, so that the implementation can control writing to it, or something like that ('something like that' b/c you do want to write from it when copying from a CAPMALLOC'd memory location, and you want to do stack ops like ROT on it). It would be nice to know that running arbitrary Boot code won't subvert the capabilities register itself. Although i guess OotB? isn't exactly a strict extension of Boot if OotB? forbids some memory accesses due to capabilities?

i'm conflicted about whether or not we should just eliminate all of the non-essential instructions and leave them open for the Ovm level (make them implementation-dependent, i guess). I'm leaning towards eliminating them. Pros:

we get closer to our original 'dead-simple' goal
if i end up implementing an Ovm implementation in Boot by hand, then whenever i change the Boot instruction set i'll have to rewrite that code. Which means that i might try to restrict myself to the core of Boot instructions most likely to survive changes. Which suggests that that core should be the real Boot instruction set -- because presumably eventually other people will have the same issue
since Ovm SHORT == Boot, after we have real Ovm code, we surely will want to profile it and see which instructions are most needed in SHORT -- we probably can't correctly guess that at this stage

Cons:

most of the additional instructions either aren't that hard to implement, and would save significant code length (eg addi)
some of the additional instructions are things that need to be at a low level, such as the acquire-release fences
as long as we leave in the TWO/ONE/ZEROs, we've already added a lot of structure
without seeing the full picture and seeing how much opcode space we have at various places, it's hard to know whether we want to eg. have MUL in TWO or at the top level (three-operand)
some of these are pretty obvious and probably aren't going away (eg addi)
leaving so many slots as 'IMPLEMENTATION DEPENDENT' hurts interoperability, making the idea of Boot kind of useless as a language

these Cons aren't very convincing to me at all; they have obvious rebuttals: (a) the Ovm implementation can add these guys back and so we get the code length savings in Ovm code (b) again, as long as these instructions aren't REQUIRED, we can add them back in Ovm code (c) so what (d) yes but we've already done that now in this design, so for the actual Boot spec we can eliminate them (e) well we could leave a few in; but i'm not very confident that what i think is 'obvious' is actually what profiling will show is most needed for SHORT (f) well we could mark them as RESERVED, subject to being filled later when we decide at the Ovm later; or 'suggested but not required' eg so that if anything goes there, it can only be the suggested thing. Also, this probably isn't a near-term concern as i really doubt anyone but me will be looking at Boot until after i make Ovm, if ever. So i guess we should do it (it = eliminate all but the core instructions).

how does load/store multi work? do we load/save to/from consecutive registers (if so, we CAN'T loadmulti 256-bits, because we only have 16 regs total, and 8 of them are special)? or does a greater bitwidth fit into one register? if the latter, then does a greater bitwidth also fit into every memory location? but if so, then doesn't ordinary load also do tearfree loadmulti? does the extended arithmetic bitwidth also imply that this extended amount fits in every memory location (i think yes)?

should we make the third operand of load/store loadmulti, or leave it as displacement like it is now?

i guess load/store multi isnt really needed, since with CAS we can do that with locking?

load/savestk appears to not be needed, but i guess it is b/c we dont have a STACKLEN or STACKEMPTY.

i guess LIBRARY isnt really needed

should we make the third operand of load/store loadmulti, or leave it as displacement like it is now? leave it as displacement, because we want to peek/poke the first 16 items in any stack

Already checked vs. TinyRISC?-V. Missing: JALR (i don't think we need that) Vs. RISC-V compressed, missing: * scaled variants, sign- and zero- extending variants (i don't think we need these) already checked vs. atomics on risc-v, llvm, webassembly, clr (i think). Missing and or xor dec(or sub). i think we'll live. also, the is_lockfree stuff should be covered by our cpuinfo analog.

when doing absolute jumps (which we can do via a series of LOADI/LOADLOs and then JD, i think? or does JD take a native pointer? hmm, it probably does... so we probably need a separate JMP), is the jump address number of instructions from the beginning of the file, or to entry point n in the ob0 file format?

do we use .ob0 file format, or just RAW? simplicity suggests forgetting about .ob0 at this level; but i think we need the entry points for absolute jumps.

The alternatives to using a table of entry points for absolute jumps are:

allowing only absolute jumps to the first 4k (3 operands at 4-bits each) of program code -- since there is no other way to jump to the rest of the program, this effectively limits program size to 4k
having absolute jump destination operands be from 0-4k, but then multiplying this by 16 or 4096, so that program code can actually be up to 256k (if 16) or 224 (since in MEDIUM mode JMPi will have 3 operands of 8 bits each). Multiplying by 256 seems more restrictive and/or bad for code density than having just 4096 entry points (since you'll often have a lot of unused space/zeros in program code inbetween the 256-item-aligned instructions). Multiplying by 16 doesn't seem so bad, this might be viable; but having a 256k program size limit seems like too much to bear (although really it's just a MODULE size limit, right?).
a MK-PROGRAM-PTR instruction that takes an integer, which is an offset into the program in units of number-of-instructions, and converts it to a native pointer which can then be used with JD -- the problem is that since this offset is loaded from a register, so even if we specify that that register must have been immediately LOADI/LOADLO'd, now we have a dynamically computed offset into program code, so the compiler has to either prove that it was not actually dynamically computed but actually just now loaded, or it has to be able to resolve such offsets at runtime -- the latter is much worse, but even the former is annoying, possibly more annoying than dealing with the .ob0 file format

in any case, for programs that need more than 4k jump destinations, you'll have to split them into modules, and then how do you do INTER-MODULE jumps? I guess you could have a SWITCHMODULE instruction which loads another jumptable? hmm... yknow, i don't see anyone else doing this 'entry point' stuff. Maybe it's the wrong thing to do. Mb MK-PROGRAM-PTR (MKPC) is the way to go.

is it better to have in-place and/or, or accumulator (ie outputs to scratch) and/or? other ISAs seem to be doing in-place, so i guess that's better

yknow, we don't really want mlen to return u16, and we dont want to limit malloc to accept a u16. So mb replace u16 with a 'native' 'number' type that is guaranteed to be at least u16, but not guaranteed to be u16, so eg if you add 1 to 65535 it may or may not overflow or wrap to 0 done

can opaque values be in constant table, or are they limited to u16s? if u16s, then limited to 64k words of code because that's the biggest jump constant you can encode. That seems too limiting... done; they are loaded by implementation-dependent-limited-length sequences of LOADI/LOADLO

if limited to u16, should we even have a constant table? without a constant table, you could still have a jump table, but it would limit code size to 4k words, and also would be less space-efficient to store no, got rid of the constant table in favor of LOADI/LOADLOs

---

tomove to ovm level (already deleted from Boot):

.ob0 format

capability security
signal interrupts. in which case we should also have error handlers, and maybe MODESET...

SWITCH -- this would be like JREL but in register, instead of immediate, mode -- although in that case, the two scratch operands (8-bits) wouldn't be enough to hold the true offset, unless we limited the size of the value being switched upon (which seems fine..). Not 'essential', as i guess we can still build switch tables, much more inefficiently, with many compares and jumps (mb even JREL jumps?).

in sysinfo:

loadmulti/savemulti max. What is the max bitwidth that we can load/save multi? (0=16, 1=32, 2=64, 3=128, 4=256? or 15 = 256-bit (16*16-bit = 256 but 0 = 16-bit, not 0, right?))
which atomic ops are lockfree
how many CPU cores (or, VM OS processes) are available? can be bind our threads to CPU cores? which CPU core are we running on right now?

Our memory consistency model is release consistency. ... To do an acquire, use FENCE-ACQUIRE (see memory_order_acquire in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ); to do a release, use FENCE-RELEASE (see memory_order_release in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ); use FENCE-SEQ to have the effect of both a FENCE-ACQUIRE and a FENCE-RELEASE and in addition to ensure sequential consistency amongst all FENCE-SEQs in the same memory domain (see memory_order_seq_cst in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants )

3-operand instructions which were removed (this would fill up the 3-operand instruction opcodes): addi-u16: oi ii k (add a constant to a u16) addi-ptr: op ip k (add a constant to a pointer)

2-operand instructions which were removed (this would fill up the 2-operand instruction opcodes): xchng-atomic: pio io (value at memory location is atomically exchanged with value of register) library: k k (call to platform-specific library function identified by 8-bit constant (remember, each operand is only 4 bits); calling conventions are as with syscalls) add-atomic-u16 pio i (value at memory location is atomically replaced with the sum of its previous value and a register) add-atomic-ptr pio i

1-operand instructions which were removed (this would fill up the 1-operand instruction opcodes): fence-acquire: p (the memory location p is ordered by an acquire memory barrier) fence-release: p (the memory location p is ordered by a release memory barrier) loadmulti-atomic: k (pop pointer (pi) from DATASTACK, atomically load k items onto DATASTACK) storemulti-atomic: k (pop pointer (pi) from DATASTACK, atomically store k items from DATASTACK) addstk-u16: sm (stack version of add-u16) substk-u16 leqstk-u16 addstk-ptr: sm (stack version of add-ptr) stackif: sm (pop top three items in stack, if top item was 0, push second item, otherwise push third item; C's ternary conditional operator)

0-operand instructions which were removed (this would leave 4 spots in the 0-operand instructions): casstk: (stack version of CAS, on DATASTACK) sra1stk-u16: (TOS = arithmetic_shift_right(TOS by 1 bit)) modeset: (change the processor state in some way (eg set error handler); 2 arguments passed on stack) regload: (pop k from datastack?, pop saved k registers from CALLSTACK and restore k registers starting with 8) regsave: (pop k from datastack?, save registers greater than 7 and less than 7+k, and push to CALLSTACK; stored register format is opaque (but is it one or many entries on CALLSTACK?!? b/c we want to be able to manually manipulate CALLSTACK and even move saved registers elsewhere in memory, right?))

other ideas for instructions:

stack size and stack capacity queries?
REST-like CRUD and collection operations?
reset
mark register as unused
kill
array, string, map creation?
CSWAP (fredkin) and CCNOT (Toffoli)? Note that CCNOT can do AND in two instructions (load 0 into a register x, then do CCNOT(A,B,x), where x is the output), and, given a zero register, if you make 0 swap instead of 1, CSWAP generalizes SWAP, and also can do OR in two instructions (LOADI 1 into a register x then do CSWAP(A, B, x), where x is the output). Of course, CSWAP and CCNOT can also be defined by and+or in a similar fashion
stack addition/subtract/mul?
a way to check capability access beforehand without causing an ACCESS DENIED error?
readinfo remove
writeinfo remove dup mount
exec wait
getpid(getmypid) kill
pipe signal stuff_to_setup_signal_handlers
cpuid/uname/myinfo/version
mv mkdir 12. info: pi (finfo/stat) 13. query: pi (ls/query) 6. mcalloc: po i (allocate memory and zero it (like C's calloc))
don't need 'sleep' b/c we have 'poll' which has a timeout
done?: create new code segment, and/or transform data memory to code memory, and/or transform code memory to data memory (and possibly, create new code segment with constant table)
done?: loadk, loadj (load from constant table/jump table; if we have these, we need to have a way to set these tables, right? b/c we don't have a LOADMODULE in Boot)
calli: o i pc (call indirect) (should calli's arguments be numbers of parameters and return values passed on the stack, instead of actually being parameters and return argument values?) mkcode pi ?or mb more generally, ?chmod_pointer? for capability stuff? but mb we need mkcode so that we have a way to specify the constant table with the code? rroll: sm i (rotate the top n elements on a stack in the other direction) cnot: io io (CNOT gate; flip the contents of register op0 iff the contents of register op1 is 1) not: oi ii (logical not) (we don't need: just do SUB 1 - x) (do we need NEG? no, just do SUB 0 - x) loadd: o i si (load from memory to register, dynamic offset) (this would probably be frequently used, but is not essential, as you could just do an ADD-PTR followed by a LOAD) stored: so i i (store from register to memory, dynamic offset) mlen: si (current depth of stack or length of allocated memory (should those be combined?)) atomic add, and, or, xor, max, min exec addstk-u16 sm substk-u16 sm leqstk-u16 sm addstk-ptr sm stackisempty: o (is the stack empty? should probably be 0-operand. Useful if we want to enumerate the stack for debugging without doing a stacksave. Not sure if we should keep this one; b/c it will be rare and you can always just do a stacksave, if we make the stacksave format non-opaque) loadk: k k k (load from 12-bit-indexed constant table, and push onto DATASTACK) switch (jrel with a register input) swap: io io (swap the contents of two registers) (note: 'roll STACK 0' can be used to swap the top 2 elements in a stack) andi-u16: io k (bitwise AND: op0 = op0 AND constant) (note: is this worth it? remember that constant can only be 0-15. would stackandi-u16 k k be better?) 5. capunpack: (from the capability on top of CAP, generate a struct describing it. there is no 'cappack' because you can't edit capabilities) 6. capalloc: (allocate new capability stack (input on TOS: size to allocate); returns the new stack pointer on data stack. Deallocate with ordinary mdealloc)
other stuff from RISC-V, webassembly, Lua, JVM, CLR, LLVM, LuaJit?2 ( http://wiki.luajit.org/Bytecode-2.0 and http://wiki.luajit.org/SSA-IR-2.0 ), WebDav? to see what we are missing. I guess i sorta checked vs. webm, llvm, lua, jvm already. mmap a file

note: REGSAVE and REGLOAD could just always push/pop from CODESTACK, in which case they could both be one-operand and could take a k argument for number of registers)

also add a calling convention. see next section.

calling convention thoughts

idea for HLL calls:

Suggested calling convention

All registers are volatile/caller-save, that is to say, the callee is permitted to modify any register and does not need to preserve any of them.

If success or failure is to be indicated upon return, then upon return, ERR has been set to a uint: 0 for success, or an error code otherwise. {zwsp}

note: Golang passes arguments on the stack, and they are pretty efficient, so if it's good enough for them..:

All registers are caller-saved
All parameters are passed on the stack
Return values are also returned on the stack, in space reserved below (stack-wise; higher addresses on amd64) the arguments. " -- [1]

[2] [3] [4]

this is 5%-10% less efficient than using some registers [5], and even less efficient when calling external functions, because it doesn't match the (x86 and most others) platform ABI [6]

https://gist.github.com/dr2chase/5a1107998024c76de22e122ed836562d https://github.com/golang/go/issues/18597 is a Go proposal to use some registers. Some arguments against are:

makes tracebacks harder
makes GC harder
something that i don't understand about 'improved inlining'

in Boot's case, saying 'use platform ABI' probably isn't a good option, because we are pretty divorced from the underlying platform at its ABI level.

Using some registers instead of stack may be a good idea in some circumstances. Although, SMALLSTACK is probably in cache and so may not be as costly as using DATASTACK. Seems to me that the best thing would be to use some registers for 'low-level' function calls (like the way that a 'link register' can be used for single-level calls, but a callstack is better for many-leveled calls), where tracebacks and GC aren't happening anyways (eg inside the OVM implementation and maybe even somewhat higher up), and use 'all stack' at higher levels. at this moment i can't come up with a hard rule for defining what is sufficently 'high-level' so as to mandate 'all stack', and such a definition would be needed to give a well defined 'ABI'. i guess one definition is 'calls which don't provide either unified GC nor unified tracebacks across them'

for the low-level calls, we have 8 GPRs to work with, plus possibly the TOS of smallstack, which is aliased to a register. Calling conventions often have some mixture of link register, return register, caller-saves, callee-saves, and argument passing. Seems to me that it would not be crazy to do:

2 GPRs for caller-saves
2 GPRs for argument-passing, and return value passing (note that these are a special case of 'caller-saves', in that the callee is free to overwrite them)
4 GPRs for callee-saves

actually let's combine caller-saves and argument-passing (ForwardCom? does that)

btw earlier i thought we'd have a REGSAVE command which, since this is usually a VM running on a runtime, just swaps the pointer to the registers, making it not matter how many caller-saves there are. But i realized that this doesn't work, because we don't want to incur the overhead of allocating space for the new register file instance, unless we are allocating it on the stack, in which case why not just have the calling convention be to pass everything on the stack? So, the overhead does matter (but a REGSAVE would still be helpful because the VM implementation can probably do it faster).

what is optimal? Agner Fog says it hasn't been studied, and opines that 50%-66% caller-saved might be good for systems with few registers, and somewhat less for systems with more registers (in Fog's own ISA proposal, ForwardCom?, which has 32 registers and 32 vector registers, half of the registers are caller-saved, and half are callee [7]; arguments are passed in the caller-saved registers, and return values are returned in those also; note i think it is typical to consider argument-passing regs as a subtype of caller-saved [8] [9]; ForwardCom? does not use a link register, for reasons well explained in the manual, section 6.2 Implementation of call stack subsection Link Register page 88 PDF page 89; a cursory look at the wikipedia Calling Conventions page shows that some other architectures go approx. 50/50 as well):

" How many registers should be callee - save? I have never seen a study of the optimal ratio of caller - save to callee - save registe rs. Scratch registers are preferred for temporary values that do not have to be saved across a function call. Functions that do not call any other functions (leaf functions) and functions that have a low probability of calling other functions (effective le af functions) will prefer to use scratch registers. If a function has more than one call to other functions or calls another function inside a loop, and if it needs to store values of local variables across these function calls, then the function becomes s impler by using callee - save registers. If the called functions need to use the same registers, then there is no advantage in speed, but possibly in size. If the called functions can use other registers, then there is an advantage in speed as well. Since le af functions are the most likely ones to be speed - critical, it is reasonable to have as many scratch registers as are typically needed in a leaf function. Functions that call other functions, on the other hand, are likely to have more variables and thus ne ed more registers. Balancing these considerations, I would expect the optimal fraction of scratch registers to be between a half and two thirds for architectures that have few registers, and somewhat lower if there are plenty of registers. Some compilers have capabilities for whole - program - optimization, and we can expect such features to become more common in the future. If the compiler has information about the register needs of both caller and callee at the same time, then it can allocate different regis ters to the two functions so that no registers need to be saved. In this case, the optimal solution is to define callee - save registers only for system functions, device drivers and library functions. " -- http://www.agner.org/optimize/calling_conventions.pdf

(and what about who cleans up the stack when stuff is passed on the stack? [10] suggests that both caller- and callee-cleanup happen; ForwardCom? dodges the question by passing a pointer to an argument list in memory (which, i guess, is sort of like caller-cleanup, however); and the table 5 in chapter 7 (function calling conventions) of http://www.agner.org/optimize/calling_conventions.pdf suggests that both were used in the past but all of C and the 64-bit ABIs use caller-. so we should probably use caller-).

so in summary what i'm proposing is two conventions:

low-level, ie when there is not either of (a unified GC, or a unified traceback) unified across both sides of the call, and high-level (where there is either a unified GC or a unified traceback):

low-level:

r8-r11 callee-saves
r12-r15 for argument-passing (and the callee is free to overwrite these; so these are caller-saves, also called 'volatile', or 'scratch'), and return value passing
if more than 4, other parameters and/or return values are passed on DATASTACK.

high-level:

All registers are volatile/caller-save, that is to say, the callee is permitted to modify any register and does not need to preserve any of them.
Parameters and/or return values are passed on DATASTACK.
Return values are returned on the stack in space reserved below (stack-wise; actually higher addresses if stack grows downwards) the arguments.

both:

If success or failure is to be indicated upon return, then upon return, ERR has been set to a uint: 0 for success, or an error code otherwise.
for any parameters passed on stack, Caller-cleanup, that is, the callee does not pop parameters passed to it on the stack even if it doesn't need them anymore; the caller does that.

---

i took this out because we don't have entry points anymore (also, would a debugger written in Boot that allows the user to manipulate the hidden stack really be illegal?):

The compiler must be able to resolve at compile time the stack offset of each access of the hidden stack, relative to the smallstack pointer at the beginning of function/code segment.

---

interesting idea from forwardcom:

"One operand of each instruction can be a register, a memory operand with different addressing modes, or an immediate constant. The other operands must be registers."

---

http://www.latticesemi.com/~/media/LatticeSemi/Documents/DataSheets/iCE/iCE40LPHXFamilyDataSheet.pdf says that this series has "16 RAM4K Memory Blocks", the 4k is 4 kilobits. It further appears to say that these 4k bits are organized into 256 16-bit words. So it has 4k 16-bit words of memory, in 16 256-word chunks.

---

two fundamental composite data types:

structs
arrays

---

the 4th addr mode bit can be 'displaced' addressing mode; when set, the BOXED bit still does its thing, but the other 2 bits specify a displacement from +1 to +4 (no +0 because ordinary indirect mode already does that)

the other empty addr mode (11) could be 'dynamic displace', which uses a special register for displacement (7? right now that is TOS. 8?)

---

so we have 1 addr mode bit left (2 of them specify 1 of 4 basic addr modes, immediate, register, register indirect, and RESERVED; the other one specified BOXED or unboxed).

when this bit is set, it could force us into REGISTER INDIRECT mode regardless of the other bits. The BOXED bit still does its thing, but the other 2 bits are now used to specify a displacement from +1 to +4 (no +0 because ordinary indirect mode already does that)

or, it could force us into IMMEDIATE mode (or boxed immediate, which is constant table), but give us two more data bits, so 1024 immediates (and a constant table of size 1024) instead of just 256.

i guess i think that displacement in register indirect mode is more useful.

alternately, this bit could be used for custom addressing modes.

---

alternately, this bit could be used to load/store >16-bit words into/from registers on implementations that support larger words in registers, yet have 16-bit memory locations (32, 64, 128, 256 bit). This doesn't seem very consistent with our architecture, though; in our architecture, if registers are larger, than so are words in memory.

---

i guess the 4th addr mode could still be STACK?

i can see two ways to do it:

1) if op0 is STACK, then the effective address contains a constant if read, and at the end of the instruction the effective value is PUSHed onto the indicated stack. if op1 or op2 are STACK, then at the beginnin of the instruction they are POPed from the indicated stack, in order of op2 and then op1. At the end of the instruction any updates to them are lost. The motivation for reading a constant in op0 is that this is really meant for PUSHing, so we don't want to engage memory, and we don't want to do different things for different instructions

2) different instruction classes determined by opcode. The primary class is like (1) (out, in, in signature). But there is another class in which each of op2, op1, op0 are POPed before the instruction, then PUSHED afterwards (inout, inout, inout signature).

recall that in our previous OVM attempt, the six signatures were:

() (ANNOTATE only) (out, imm2) (LOADI) (in, imm2) (JZREL, etc) (out, in, in) (typical) (in, in, in) (JI, although it doesn't use the third argument, so i guess we could have used out, in, in, and set out=r0) (inout, inout, inout) (SWAP, although it doesn't use the third argument)

so if we do this, we should try to partition opcode space into these sigs (or even fewer classes). Which may impact Boot, because we (maybe?) want TWO, ONE, ZERO to fit into this (although it wouldn't be too bad to just have TWO, ONE, ZERO be exceptions). The () and imm2s could also be exceptions (just don't use STACK mode on these, or rather, treat them as out,in,in if STACK mode is used on op0); also, ordinary instructions could combine the two 'in's if they want). And we could probably get away without making STACK mode work well on the op0 of the (in,in,in)s.

i guess STACKIF would have to be:

(inout, in, in)

although we could also just have a separate STACKIF instruction rather than a TRINARYIF using stack mode.

so i think the way to go is probably to have two instruction classes, as far as STACK addressing mode is concerned:

(out, in, in) (inout, inout, inout)

---

in MEDIUM, do we really need to distinguish polymorphic and non-polymorphic opcodes? the only two places we used that in our previous attempts were:

if not a polymorphic opcode, we don't even process the type operand field
if is a polymorphic opcode, and if the type is not PRIMITIVE_TYPE_DYN, then we lookup the implementation of the instruction by type (via opcode = self.polymorphic_instruction_registration[opcode, typearg])

so, in that system, each polymorphic instruction either resolves dynamically (and it's up to the instruction itself to deal with that), or it is dynamically replaced by another opcode that is not polymorphic.

eh, i guess that's pretty good actually. But then how many of our 256 opcodes do we reserve for polymorphism? we already burned 64 for the MEDIUM opcodes corresponding to Boot/SHORT opcodes, and surely now that we have signed and floating point arithmetic of different lengths, and division, and more comparisons, we'll have a bunch more (add, sub, mul, div, mod, neg, leq, eq) * (signed, unsigned, float) * (16, 32, 64) = 8 * 3 * 3 = 72; so that's already over 128 opcodes in total.

now, what we could do is say, we don't care if code can specify a non-polymorphic variant in opcode alone -- so we can give these non-polymorphic functions opcodes above 255 (between 256 and 65535). The problem with that is that now we need 64k of words of memory just to hold the table mapping opcodes to implementations. The compromise here would probably be to limit it to opcodes between 256 and 4k (12-bit opcodes).

so far i think maybe i'll stick with 0-127 is non-polymorphic, 128-255 is polymorphic (but what about 256-65535?)

---

so if the opcode identifies both polymorphicity and signature-for-stack-addressing, how do we do that?

Maybe:

0-127 is non-poly 128-255 is poly

0-55 is out, in, in 56-63 is inout, inout, inout 128-247 is out,in,in 248-255 is inout, inout, inout

---

for a few minutes i felt we should add back in loadmulti and storemulti right now to Boot. But actually the thing is, some platforms only offer atomic load/store up to their pointer bitwidth. And on naive implementations, each Boot memory location is already the full pointer bitwidth. So it would be a pain for the implementation to implement loadmulti and storemulti for any larger bitwidth.

---

i woke up from a dream one day feeling like some sort of 16bit instruction is missing. debug-like, a cursor.

---

i've been feeling that the FORKDONE (fork's join) is too special-case to occupy a whole Boot opcode.

FORKDONE could be replaced by channels that communicate changes in process state, but I don't want to create an operating system with regards to the special file path structure for process channels. So perhaps just generalize Fork done

Perhaps we can generalize it somehow to binary operations over processes, of which 'join' is just one of many?

And/or to all sorts of synchronization, eg synchronization barriers, of which 'join' seems to be one type?

And/or, start and ending of transactions

And/or, change-mode on some sort of control graph

---

" System programming languages differ from assembly languages in two ways: they are higher level and they are strongly typed. The term "higher level" means that many details are handled automatically so that programmers can write less code to get the same job done. For example:

    Register allocation is handled by the compiler so that programmers need not write code to move information between registers and memory.
    Procedure calling sequences are generated automatically: programmers need not worry about moving arguments to and from the call stack.
    Programmers can use simple keywords such as while and if for control structures; the compiler generates all the detailed instructions to implement the control structures."

---

Pillar

In some ways similar to C--. Used by the Functional Language Research Compiler (FLRC) / Haskell Research Compiler (HRC).

Links:

FLRC / HRC's MIL

The Functional Language Research Compiler / Haskell Research Compiler has an IL called 'MIL'.

Links:

---

proj-oot-ootAssemblyNotes16

Title

Cheatsheet

Instruction encoding

Data types

Registers

0. 0 register

1. PC

2. ERR

Instructions

Three-operand instructions:

Two-operand instructions:

One-operand instructions:

Calling convention

Error codes

RESERVED

Semantics

Concurrency semantics

CAP register

Access privileges

16-bit arithmetic

Appendix: the .ob0 file format

Appendix: design choices discussion

CAP register

Lack of safety and security

Undefined values for arithmetic over 16-bits

todo

Suggested calling convention

Pillar

FLRC / HRC's MIL