proj-oot-ootAssemblyNotes16




The contents of boot_reference as of 2017/03/27, just before a cleanup/removal of old-now-incorrect stuff/decision to remove 'optional' instructions:

Title

:hardbreaks:

Version: unreleased/in-progress/under development. Working towards development version 0.0.0-0.

todo: i said this was RISC-y, but: + +

Cheatsheet

Instruction encoding: 16 bits, little endian, from LSB to MSB, for fields each 4 bits long: opcode, operands 0, 1, 2. Instructions are packed into 64-bit aligned quads. The least-significant-bit of each quad must be 0. + + Datatypes: u16 (unsigned 16-bit integer), ptr (pointer). + + Registers: 16 registers. First 8 are special: + + (todo see below this is out of date)

0. 0 register 1. PC 2. ERR 3. SCRATCH 4. HIDDENSTACK 5. DATASTACK 6. CALLSTACK 7. reserved for future use + + Instructions: todo: this is slightly out of date, copy updated version from below:

0. annotate (also nop, when operands are zero) 1. loadi (load immediate) 2. load (load from memory to register) 3. store (store from register to memory) 4. cpy (copy from register to register (in other architectures this is usually called "MOV")) 5. skipz (skip if zero) 6. skipnz (skip if non-zero) 7. calli (call indirect) 8. pop 9. push 10. add-u16 (addition of u16s) 11. sub-u16 (subtraction of u16s) 12. leq-u16 (less-than-or-equal of u16s) 13. add-ptr (add a u16 to a pointer) 14. sub-ptr (subtract a u16 from a pointer) 15. vmcall (other vm functionality)

Instruction encoding

The instruction encoding is fixed-length 16-bits, and instructions are grouped into 64-bit 'quads'. + Encodings are little-endian, and bits are given from LSB to MSB. + + The encoding of each instruction is:

Data types

There are five defined data types:

Registers

There are 16 registers. The first 8 registers are special.

0. 0 register 1. PC 2. ERR (or STATUS?) 3. TOS (alias to top of datastack) 4. CAP (or MODE?) 5. HIDDENSTACK 6. DATASTACK 7. CALLSTACK + + There are three stacks (hiddenstack, datastack, callstack). The stack pointers are in scare quotes because Boot programs are not allowed to directly manipulate these; all stack manipulation must be accomplished via the PUSH and POP and LOAD and STORE instructions. The stacks may or may not be stored in ordinary memory, and the "pointers" may or may not be ordinary pointers. + + In more detail: + +

0. 0 register

Reading from the zero register always returns 0. Writing to the zero register has no effect. + +

1. PC

The program counter. This is a pointer to the next instruction to be executed. It is read-only. A Boot program is not allowed to directly read or write program memory, or dereference pointers into program memory. Program memory may or may not be stored in ordinary memory.

2. ERR

This is used to return error codes. An error code of "0" means no error (success).

It is also used to return carry and overflow results from addition and subtraction.

  1. # 3. CAP This stores a stack "pointer" to the capability stack.
    1. 4. HIDDENSTACK

A stack "pointer".

The capacity of the hiddenstack is 15 items.

The compiler must be able to resolve at compile time the stack offset of each access of the hidden stack, relative to the hiddenstack pointer at the beginning of function/code segment.

  1. # 5. DATASTACK

A stack "pointer".

  1. # 6. CALLSTACK

A stack "pointer".

  1. # 7. TOS Alias to the top item on DATASTACK.

Instructions

The 16 three-operand instruction opcodes and mnemonics and operand signatures are: 0. annotate: ? ? ? (this instruction is called BAD when all operands are zeros) 1. loadi: oi k k (load 8-bit u16 immediate) 2. loadlo: oi k k (load 8-bit immediate with multiply; multiply the destination register by 256 and then add the 8-bit immediate value. NOTE: this instruction is illegal except immediately following either a LOADI to the same destination register, or another LOADIM to the same destination register. NOTE: these allow you to load constant greater than 8 bits. The maximum constant size is platform-specific; a sequence of LOADI/LOADIM instructions that would create a larger constant than the platform supports is unsupported on that platform) 3. load: o k si (load from memory (plus constant offset) to register; atomic/tear-free) 4. store: so k i (store from register to memory (plus constant offset); atomic/tear-free) 5. TWO: ? ? k (two-operand instructions)) 6. jrel: 0 0 k (relative jump) 7. skipz: 0 0 ii (skip if zero) 8. skipnz: 0 0 ii (skip if non-zero) 9. add-u16: oi ii ii (addition of u16s) 10. addi-u16: oi ii k (add a constant to a u16) 11. sub-u16: oi ii ii (subtraction of u16s) 12. leq-u16: o ii ii (less-than-or-equal of u16s) 13. add-ptr: op ip ii (add a u16 to a pointer) 14. addi-ptr: op ip k (add a constant to a pointer) 15. cas-atomic: pio i i (compare-and-swap)

The 16 two-operand instruction opcodes and mnemonics and operand signatures are: 0. cpy: o i (copy from register to register (in other architectures this is usually called "MOV")) 1. or-u16: io i (bitwise OR: op0 = op0 OR op1) 2. and-u16: io i (bitwise AND: op0 = op0 AND op1) 3. xor-u16: io i (bitwise XOR: op0 = op0 XOR op1) 4. add-atomic-u16 pio i (value at memory location is atomically replaced with the sum of its previous value and a register) 5. add-atomic-ptr pio i 6. mul-u16: io i (multiplication: op0 = op0 MUL op1) 7. xchng-atomic: pio io (value at memory location is atomically exchanged with value of register) 8. picki: sm k (push a copy of the n-th item in a stack (zero-indexed) onto the top of the stack) (note: this can also do dup (with r0), over) 9. rolli: sm k (rotate the top n+2 elements on a stack) (note: this can also do swap (with k=0), rot) 10. slli-u16: io k (op0 = logical_shift_left(op0 by k bits)) 11. srli-u16: io k (op0 = logical_shift_right(op0 by k bits)) 12. library: k k (call to platform-specific library function identified by 8-bit constant (remember, each operand is only 4 bits); calling conventions are as with syscalls) 13. pop: o sim 14. push: som i 15. ONE: ? k (one-operand instructions)

The 16 one-operand instruction opcodes and mnemonics and operand signatures are: 0. jd: pc (dynamic jump) 1. stackload: sm (pop saved stack from DATASTACK, and load it) 2. stacksave: si (save stack, and push it to DATASTACK) 3. fence-acquire: p (the memory location p is ordered by an acquire memory barrier) 4. fence-release: p (the memory location p is ordered by a release memory barrier) 5. loadmulti-atomic: k (pop pointer (pi) from DATASTACK, atomically load k items onto DATASTACK) 6. storemulti-atomic: k (pop pointer (pi) from DATASTACK, atomically store k items from DATASTACK) 8. not-u16: io (bitwise NOT: op0 = NOT op0) 8. addstk-u16: sm (stack version of add-u16) 9. substk-u16 10. leqstk-u16 11. addstk-ptr: sm (stack version of add-ptr) 12. fence-seq: p (the memory location p is ordered by a sequential consistency memory barrier) 13. stackif: sm (pop top three items in stack, if top item was 0, push second item, otherwise push third item; C's ternary conditional operator) 14. ZERO: k (zero-operand misc operations) 15. syscall: k (zero-operand I/O and IPC)

The 16 zero-operand misc operations are: 1. 2. casstk: (stack version of CAS, on DATASTACK) 3. log: (log a message; loglevel k and then pointer to message is on top of DATASTACK) 4. mdealloc (deallocate memory (like C's free); ptr is on TOS) 5. 6. 7. 8. sra1stk-u16: (TOS = arithmetic_shift_right(TOS by 1 bit)) 9. modeset: (change the processor state in some way (eg set error handler); 2 arguments passed on stack) 10. malloc-shared: (allocate shared memory supporting atomic operations; amount of memory requested is popped from DATASTACK, result is pushed to DATASTACK) 11. mrealloc: (resize memory allocation (like C's realloc); pointer and new size is popped from top of DATASTACK) 12. malloc-local: (allocate local memory; amount of memory requested is popped from DATASTACK, result is pushed to DATASTACK) 13. regload: (pop k from datastack?, pop saved k registers from CALLSTACK and restore k registers starting with 8) 14. regsave: (pop k from datastack?, save registers greater than 7 and less than 7+k, and push to CALLSTACK; stored register format is opaque (but is it one or many entries on CALLSTACK?!? b/c we want to be able to manually manipulate CALLSTACK and even move saved registers elsewhere in memory, right?)) 15. break

The 16 zero-operand syscalls are:

0. sysinfo (does this take a 16-bit query argument on the stack, or just return all of the system info every time? probably the former) 1. delete 2. walk (like plan9's 9p's walk; todo is this actually different from ls/query, i forgot) 3. rand 4. sync/flush 5. seek 6. fork 7. forkdone (fork's join/wait) (todo can't we do this some other way, using channels, without burning an opcode?) 8. open 9. close 10. read 11. write 12. poll (todo: how exactly poll works) 13. time (takes a clock_spec on the stack, and returns one or more words on the stack according to the clock_spec; returns 0(s) and sets ERR condition if requested clock is not supported) 14. 15. halt (pop DATASTACK and terminate program execution, returning the popped value)

Many of the zero-operand instructions (esp. the syscall-y ones) also use DATASTACK to pass/return parameters.

todo:

maybe:

maybe (copied from ootLibrariesNotes6):

note: over and dup can be done with 'pick 1' and 'pick 0'

TODO see also those other lists of popular linux syscalls that i made, and my ideas regarding instructions for epoll, completion ports, etc

note: the reason for (malloc, mdealloc, open, close, read, write), is that they are fundamental and common. (walk) because it is fundamental. (poll sleep) because they are called a lot in tight loops and so you want them to be efficient.

Where the signature characters mean:

In addition, the mnemonic 'nop' refers to 'annotate' with all zero operands.

In the following, op0, op1, op2 are short for operand 0, operand 1, operand 2.

Three-operand instructions:

  1. ## 0. annotate: ? ? ? If all bits are zero, this is not ANNOTATE but rather BAD, an illegal instruction. This is to help halt execution if memory is accidentally executed, since some memory may contain zeros.

Used to embed metadata into code. Has no effect on execution; can be skipped by an interpreter or compiler.

  1. ## 1. LOADI (load immediate): oi k k

A number is formed by bitwise concatenating op1 and op2, with op1 being MSB (that is, by multiplying the value of op1 by 16, and then adding this to op2). This number is then treated as a u16 and stored in the register indicated by op0.

  1. ## 2. LOAD (load from memory to register): o k si

The contents of the register indicated by op2 is taken to be a pointer. The memory location indicated by this pointer, plus op1, is looked at, and the value found there is stored in the register indicated by op0.

Note: by giving a stack pointer in op2, LOAD can be used to peek at stack items at indices from 0 to 15 without POPing. Stack "pointers" are allowed in op2.

  1. ## 3. STORE (store from register to memory): so k i

The value in the register indicated by op2 is stored in (the memory location indicated by the contents of the register indicated by op0, plus op1).

Note: by giving a stack pointer in op2, LOAD can be used to overwrite stack items at indices from 0 to 15 without PUSHing. Stack "pointers" are allowed in op2.

  1. ## 4. SKIPZ (skip if zero): 0 0 ii

If the value in the register indicated by op1 is zero, then the PC is incremented by 1.

Operands 0 and 1 must be zero; skipz instructions with non-zero operands 0 or 1 are reserved for the internal use of the implementation, and are syntax errors if found in Boot program code.

  1. ## 5. SKIPNZ (skip if non-zero): 0 0 ii

If the value in the register indicated by op1 is not zero, then the PC is incremented by 1.

Operands 0 and 1 must be zero; skipnz instructions with non-zero operands 0 or 1 are reserved for the internal use of the implementation, and are syntax errors if found in Boot program code.

  1. ## 6. JREL (relative jump): 0 0 k

If k > 7, increment the PC by k-7. Otherwise, increment the PC by k - 9.

  1. ## 7. two (two-operand instructions): ? ? k

The two-operand instruction identified by op2 is called with operands op0 and op1. See section below for a list of two-operand instructions.

  1. ## 8. POP: o k sim

The n-th item on the stack specified by op2 is popped (reading the value, then reducing the stack size by 1 and moving each item above the n-th down by 1), where n is op1. The value that was popped is placed into the register specified by op0.

  1. ## 9. PUSH: som k i

The value in the register specified by op2 is pushed into the stack at the n-th index, where n is given by op1 (increasing the stack size by 1 and moving each item above the n-th up by 1, then writing the value).

  1. ## 10. ADD-U16 (addition of u16s): oi ii ii op0 := op1 + op2 + ERR, where ERR is either 0 or 1

if the result is larger than op0 can hold, op0 will be set to the result mod its op0's maximum value, and ERR will be set to 1; otherwise ERR will be set to 0

  1. ## 11. SUB-U16 (subtraction of u16s): oi ii ii op0 := op1 - op2
      1. 12. LEQ-U16 (less-than-or-equal of u16s): oi ii ii op0 := 1 if op1 <= op2; otherwise op0 := 0
      2. 13. ADD-PTR (add a u16 to a pointer): op ip ii op0 := op1 + op2
      3. 14. SUB-PTR (subtract a u16 from a pointer): op ip ii op0 := op1 - op2
      4. 15. CALL (call indirect): o i pc

todo: this instruction was deleted, i think

The current PC is PUSHed onto CALLSTACK. The contents of the register specified by op1 is PUSHed onto DATASTACK. The PC is set to the contents of the register specified by op2, and execution continues until a RET is encountered when the TOS of CALLSTACK is equal to the next instruction. At that time, the register specified by op0 will be set to the return value of the RET, and execution resumes.

See also RET.

Two-operand instructions:

todo: some of the numbers below are wrong

  1. ## 0. CPY (copy from register to register (in other architectures this is usually called "MOV")): o i

The value in the register indicated by op2 is copied into in the register indicated by op0.

  1. ## 1. NOT (logical not): o i
      1. 2. SWAP (swap the contents of two registers): io io

The values in the registers specified by op0 and op1 are swapped.

  1. ## 3. MALLOC (allocate memory): po i

Memory whose size is the value in the register indicated by op1 is allocated. Note: each location in memory must be able to hold at least a 16-bit word (but could hold more, depending on the implementation). So the units of the amount of memory requested is not bytes. If the allocation is successful, a pointer to the new memory is placed into the register indicated by op0. If the requested amount of memory is not available, ERR will be set to the code for OOM (out-of-memory) and the value in the register specified by op0 is undefined.

See also MDEALLOC.

  1. ## 14. syscall (platform-specific instructions): k

The k-th platform-specific subroutine is called. See section 'calling convention'.

  1. ## 15. ONE (one-operand instructions): ? k

The one-operand instruction identified by op1 is called with operand op0. See section below for a list of one-operand instructions.

One-operand instructions:

  1. ## 0. RET (return from call): o

CALLSTACK is POPped and the PC is set the value popped, and the value in the register indicated by op0 is returned.

See also CALLI.

  1. ## 1. MDEALLOC (free memory): pi The memory region found at the pointer in the register specified by op0 is deallocated.

If this pointer was not returned by a previous MALLOC, or if this memory region has already been deallocated (and has not been subsequently returned by another MALLOC), the result is undefined.

See also MALLOC.

  1. ## 15. HALT (terminate program execution): i

Program execution is terminated, and the value in the register indicated by op0 is returned.

  1. ## ?. CAS (compare-and-swap): pio i i Atomically, if the value of the memory location in the register op0 is equal to the contents of register op1, then set that memory location to the contents of register op2, and set register op1 to op2.
      1. ?. SYSINFO (query system information): one input, one output on stack The input determines which information is being queried. Currently defined:

0. Boot SYSINFO version. Returns 0. Note: this version number is distinct from the Boot specification version identifier (SdVer? version). 1. Capability overview. RESERVED. 2. SYSINFO max. The max SYSINFO query number supported by this implementation. 3. Word bitwidth supported for integer arithmetic. Returns 0 for 16-bits, 1 for 32-bits, 2 for 64-bits, 3 for 128-bits, 4 for 256-bits. Other values are RESERVED. 4. Effective number of CPUs. This does not have to be physical CPUs, but rather is used as just a hint as to how many threads are optimal for parallel processing in eg a workstealing scheduler. 5. loadmulti/savemulti max. What is the max bitwidth that we can load/save multi? 0=16, 1=32, 2=64, 3=128, 4=256

Other values for the input are RESERVED.

Calling convention

All registers are volatile/caller-save, that is to say, the callee is permitted to modify any register and does not need to preserve any of them.

Parameters and/or return values are passed on DATASTACK. Caller-cleanup, that is, the callee does not pop parameters passed to it on the stack even if it doesn't need them anymore; the caller does that.

If success or failure is to be indicated upon return, then upon return, ERR has been set to a u16: 0 for success, or an error code otherwise.

Error codes

... 0. success (no error) 1. carry (not an error) 2. resource unavailable (used by MALLOC and REALLOC for OOM) 3. access denied 4. unimplemented instruction/operation (eg, this is returned if TIME or CAS is encountered but not available on this platform) 5. invalid type 6. invalid argument

(do we have a separate error code for UNSUPPORTED CLOCK? UNSUPPORTED SEEK?)

30. BAD/illegal instruction encountered 31. internal error (unexpected internal error in the Boot implementation; this should cause a crash, so it should never be seen by Boot code) 32-63. RESERVED 64-127. implementation dependent 128-255. custom (reserved for user code)

?? todo: signal interrupts? ...

Note: 0 and 1 are not treated as "errors"; when ERR is occupied by either 0 or 1 there is no 'error' (so that these can be used for 'carry' and 'overflow' in arithmetic), otherwise there is.

RESERVED

Items marked RESERVED are not available for implementation-dependent use; they are reserved for future versions of Boot.

Semantics

todo, anything that we need to say here?

Concurrency semantics

In shared memory (memory allocated with malloc-shared), LOAD, STORE, and atomics (the operations labeled -atomic) (where supported by a particular implementation) are atomic. All implicit loads and stores of single Boot words (16-bit uints, or pointers, plus whatever 'OTHER' types the implementation provides) are tear-free/atomic. Nothing else in shared memory is guaranteed to be atomic. Some or all of the -atomic operations may not be available on all implementations. Some implementation may implement some atomics, or even all memory accesses to shared memory, via a lock at the granularity of the entire region allocated by one call to malloc-shared. Valid Boot programs must not share local memory (memory allocated with malloc-local), that is, they must not dereference a pointer to local memory which is owned by another process (local memory is owned by the process that allocated it).

Registers are never shared, hence single instructions that only do computations between registers (ie. that don't access main memory or I/O or IPC) are atomic.

Our memory consistency model is release consistency. In the absence of any FENCE instructions, atomic operations have only a relaxed memory ordering (see memory_order_relaxed in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ). Synchronization is only effective within a memory domain (which is provided as an argument to malloc-shared); this is to accomodate NUMA and distributed architectures in which one process may have access to multiple distinct memories. To do an acquire, use FENCE-ACQUIRE (see memory_order_acquire in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ); to do a release, use FENCE-RELEASE (see memory_order_release in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ); use FENCE-SEQ to have the effect of both a FENCE-ACQUIRE and a FENCE-RELEASE and in addition to ensure sequential consistency amongst all FENCE-SEQs in the same memory domain (see memory_order_seq_cst in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ). If you don't know what any of this means, then please always follow these guidelines:

CAP register

The semantics of the CAP register (including its access privileges) are implementation-dependent.

Access privileges

A process may freely read/write to registers (except for PC, which is read-only, and IMPREG, which may or may not be readable and/or writable, possibly conditionally). A process may freely read/write to memory that itself has MALLOC'd unless it dropped access privileges in an implementation-dependent fashion. Otherwise, access privileges are implementation-dependent. The implementation should use the ACCESS_DENIED error code to express when an operation was illegal due to insufficient privileges.

16-bit arithmetic

If a Boot program computes an arithmetic operation which, mathematically, should produce an integer value greater than 2^b - 1, where b is bitwidth indicated by the output of the SYSINFO call for the supported bitwidth, then the output value is an undefined value. Since b is guaranteed to be at least 16-bits, values up to 65535 are always safe.

Appendix: the .ob0 file format

A Boot implementation is not required to support the .ob0 file format; it is an optional, related technology.

All integers are little-endian.

TODO: remove 32k/64k file length limit in implementations TODO: increase # of bytes for function offset to 4 or 8 TODO: increase function length limit to 2^32 or 2^64, and increase # of bytes for function code length (or simply remove this) TODO: harmonize #function limit to 65535 in implementations TODO: add max_constant_length to implementations TODO: move variable-length function headers to the beginning of file, and list their offsets in the header at top (or just concatenate them at top and expect the implementation to read through all of those input and output types to get to the offsets). The end of the file is just a chunk of code (and other implementation-specific data).

TODO: above, we went to all this trouble (the LOADIM instruction) mostly just so that we wouldn't have to set a platform-independent limit the length of code, but here the maximum length of function code is very small, in fact in current implementation an ob0 file is limited to just 32767 bytes, and in asmdis to just 65536 bytes (todo: harmonize those). Shouldn't we make this larger, if not unbounded? At least 32- or 64-bit? otoh recall that the 64k limit makes sense if we are using ob0 files for custom instructions, which have a <64k limit (64/+-8 = 8k limit, i think, b/c of JREL's ability to skip up to 8 instructions)

NOTE: since the ob0 file format already has separate 'functions', do any instructions refer to these? eg is there a JUMP instruction to go to one of these functions? Can you JUMP into the middle of another function? Bring back 'custom instructions' in the form of a CALL2? NOTE: are the jump targets in code with reference to the beginning of the ob0 file, or with reference to the beginning of the current function? i'm thinking that the jump targets are relative to the whole file, you can jump wherever you like, there is no CALL2, and there is no special support for functions; the functions are just entry points for external use. Of course, certain types of use (eg custom instructions) can impose additional restrictions. In this case, why partition the file into code for functions? Well, (a) we have a fixed length file header but a variable-length per-function header (actually not though b/c the # of functions per-file is already variable), and (b) some usages may find it useful to partition the code into functions. However, here, the partitioning adds no extra information; a use-case which wants to partition memory can do so even if all of the functions are concatenated in the file, since it knows the entry points; and some users may want to read in all of the functional headers without the function bodies. So, actually, i added a TODO above to move everything to the beginning of the file and just leave a chunk of code at the end

TODO: if we DO want to stick with length limits, then should we change LOADIM to just LOADHIGH, permanently limiting Boot to 16-bit constants?

Appendix: design choices discussion

CAP register

Although its behavior is left open, the CAP register is intended (but not required) to be used for access control/security/privilages/capabilities (hence the name CAP).

Lack of safety and security

If the purpose of Boot is to serve as a substrate for and a subset of Ovm, and if Ovm is to be a safe language, then why don't we require bounds-checking at the Boot level? Why do we allow invalid pointer operations to have undefined effects? Because Boot will be used to execute the Ovm implementation, and for efficiency we will trust the Ovm runtime to be safe and correct. Ovm will be a safe language; Boot is not.

Note that this means that an Ovm implementation running on a Boot platform will not be able to simply identify parts of an untrusted program that it is running that contain only instructions in the Boot subset of Ovm, and blindly execute them directly on the Boot platform; it must prove safety or insert bounds-checks.

Undefined values for arithmetic over 16-bits

We allow the implementation freedom to produce arbitrary output values when the outputs 'should' overflow 16-bits because we want to allow Boot implementations to be able to simply use the native arithmetic operations of the underlying platform without inserting checks for overflow.

todo

some things our CPUINFO will probably need:

our MALLOC-SHARED needs to specify a 'memory domain' -- fences only synchronize within a memory domain

i dunno, maybe bring back CAP. Just make it read-only, so that the implementation can control writing to it, or something like that ('something like that' b/c you do want to write from it when copying from a CAPMALLOC'd memory location, and you want to do stack ops like ROT on it). It would be nice to know that running arbitrary Boot code won't subvert the capabilities register itself. Although i guess OotB? isn't exactly a strict extension of Boot if OotB? forbids some memory accesses due to capabilities?

i'm conflicted about whether or not we should just eliminate all of the non-essential instructions and leave them open for the Ovm level (make them implementation-dependent, i guess). I'm leaning towards eliminating them. Pros:

Cons:

these Cons aren't very convincing to me at all; they have obvious rebuttals: (a) the Ovm implementation can add these guys back and so we get the code length savings in Ovm code (b) again, as long as these instructions aren't REQUIRED, we can add them back in Ovm code (c) so what (d) yes but we've already done that now in this design, so for the actual Boot spec we can eliminate them (e) well we could leave a few in; but i'm not very confident that what i think is 'obvious' is actually what profiling will show is most needed for SHORT (f) well we could mark them as RESERVED, subject to being filled later when we decide at the Ovm later; or 'suggested but not required' eg so that if anything goes there, it can only be the suggested thing. Also, this probably isn't a near-term concern as i really doubt anyone but me will be looking at Boot until after i make Ovm, if ever. So i guess we should do it (it = eliminate all but the core instructions).

how does load/store multi work? do we load/save to/from consecutive registers (if so, we CAN'T loadmulti 256-bits, because we only have 16 regs total, and 8 of them are special)? or does a greater bitwidth fit into one register? if the latter, then does a greater bitwidth also fit into every memory location? but if so, then doesn't ordinary load also do tearfree loadmulti? does the extended arithmetic bitwidth also imply that this extended amount fits in every memory location (i think yes)?

should we make the third operand of load/store loadmulti, or leave it as displacement like it is now?

i guess load/store multi isnt really needed, since with CAS we can do that with locking?

load/savestk appears to not be needed, but i guess it is b/c we dont have a STACKLEN or STACKEMPTY.

i guess LIBRARY isnt really needed





should we make the third operand of load/store loadmulti, or leave it as displacement like it is now? leave it as displacement, because we want to peek/poke the first 16 items in any stack

Already checked vs. TinyRISC?-V. Missing: JALR (i don't think we need that) Vs. RISC-V compressed, missing: * scaled variants, sign- and zero- extending variants (i don't think we need these) already checked vs. atomics on risc-v, llvm, webassembly, clr (i think). Missing and or xor dec(or sub). i think we'll live. also, the is_lockfree stuff should be covered by our cpuinfo analog.

when doing absolute jumps (which we can do via a series of LOADI/LOADLOs and then JD, i think? or does JD take a native pointer? hmm, it probably does... so we probably need a separate JMP), is the jump address number of instructions from the beginning of the file, or to entry point n in the ob0 file format?

do we use .ob0 file format, or just RAW? simplicity suggests forgetting about .ob0 at this level; but i think we need the entry points for absolute jumps.

The alternatives to using a table of entry points for absolute jumps are:

in any case, for programs that need more than 4k jump destinations, you'll have to split them into modules, and then how do you do INTER-MODULE jumps? I guess you could have a SWITCHMODULE instruction which loads another jumptable? hmm... yknow, i don't see anyone else doing this 'entry point' stuff. Maybe it's the wrong thing to do. Mb MK-PROGRAM-PTR (MKPC) is the way to go.

is it better to have in-place and/or, or accumulator (ie outputs to scratch) and/or? other ISAs seem to be doing in-place, so i guess that's better

yknow, we don't really want mlen to return u16, and we dont want to limit malloc to accept a u16. So mb replace u16 with a 'native' 'number' type that is guaranteed to be at least u16, but not guaranteed to be u16, so eg if you add 1 to 65535 it may or may not overflow or wrap to 0 done

can opaque values be in constant table, or are they limited to u16s? if u16s, then limited to 64k words of code because that's the biggest jump constant you can encode. That seems too limiting... done; they are loaded by implementation-dependent-limited-length sequences of LOADI/LOADLO

if limited to u16, should we even have a constant table? without a constant table, you could still have a jump table, but it would limit code size to 4k words, and also would be less space-efficient to store no, got rid of the constant table in favor of LOADI/LOADLOs

---

---

tomove to ovm level (already deleted from Boot):

.ob0 format

SWITCH -- this would be like JREL but in register, instead of immediate, mode -- although in that case, the two scratch operands (8-bits) wouldn't be enough to hold the true offset, unless we limited the size of the value being switched upon (which seems fine..). Not 'essential', as i guess we can still build switch tables, much more inefficiently, with many compares and jumps (mb even JREL jumps?).

in sysinfo:

how does load/store multi work? do we load/save to/from consecutive registers (if so, we CAN'T loadmulti 256-bits, because we only have 16 regs total, and 8 of them are special)? or does a greater bitwidth fit into one register? if the latter, then does a greater bitwidth also fit into every memory location? but if so, then doesn't ordinary load also do tearfree loadmulti? does the extended arithmetic bitwidth also imply that this extended amount fits in every memory location (i think yes)?

Our memory consistency model is release consistency. ... To do an acquire, use FENCE-ACQUIRE (see memory_order_acquire in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ); to do a release, use FENCE-RELEASE (see memory_order_release in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ); use FENCE-SEQ to have the effect of both a FENCE-ACQUIRE and a FENCE-RELEASE and in addition to ensure sequential consistency amongst all FENCE-SEQs in the same memory domain (see memory_order_seq_cst in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants )

3-operand instructions which were removed (this would fill up the 3-operand instruction opcodes): addi-u16: oi ii k (add a constant to a u16) addi-ptr: op ip k (add a constant to a pointer)

2-operand instructions which were removed (this would fill up the 2-operand instruction opcodes): xchng-atomic: pio io (value at memory location is atomically exchanged with value of register) library: k k (call to platform-specific library function identified by 8-bit constant (remember, each operand is only 4 bits); calling conventions are as with syscalls) add-atomic-u16 pio i (value at memory location is atomically replaced with the sum of its previous value and a register) add-atomic-ptr pio i

1-operand instructions which were removed (this would fill up the 1-operand instruction opcodes): fence-acquire: p (the memory location p is ordered by an acquire memory barrier) fence-release: p (the memory location p is ordered by a release memory barrier) loadmulti-atomic: k (pop pointer (pi) from DATASTACK, atomically load k items onto DATASTACK) storemulti-atomic: k (pop pointer (pi) from DATASTACK, atomically store k items from DATASTACK) addstk-u16: sm (stack version of add-u16) substk-u16 leqstk-u16 addstk-ptr: sm (stack version of add-ptr) stackif: sm (pop top three items in stack, if top item was 0, push second item, otherwise push third item; C's ternary conditional operator)

0-operand instructions which were removed (this would leave 4 spots in the 0-operand instructions): casstk: (stack version of CAS, on DATASTACK) sra1stk-u16: (TOS = arithmetic_shift_right(TOS by 1 bit)) modeset: (change the processor state in some way (eg set error handler); 2 arguments passed on stack) regload: (pop k from datastack?, pop saved k registers from CALLSTACK and restore k registers starting with 8) regsave: (pop k from datastack?, save registers greater than 7 and less than 7+k, and push to CALLSTACK; stored register format is opaque (but is it one or many entries on CALLSTACK?!? b/c we want to be able to manually manipulate CALLSTACK and even move saved registers elsewhere in memory, right?))

other ideas for instructions:

note: REGSAVE and REGLOAD could just always push/pop from CODESTACK, in which case they could both be one-operand and could take a k argument for number of registers)

also add a calling convention. see next section.


calling convention thoughts

idea for HLL calls:

--

Suggested calling convention

All registers are volatile/caller-save, that is to say, the callee is permitted to modify any register and does not need to preserve any of them.

Parameters and/or return values are passed on DATASTACK. Caller-cleanup, that is, the callee does not pop parameters passed to it on the stack even if it doesn't need them anymore; the caller does that.

If success or failure is to be indicated upon return, then upon return, ERR has been set to a uint: 0 for success, or an error code otherwise. {zwsp}

--

note: Golang passes arguments on the stack, and they are pretty efficient, so if it's good enough for them..:

"

[2] [3] [4]

this is 5%-10% less efficient than using some registers [5], and even less efficient when calling external functions, because it doesn't match the (x86 and most others) platform ABI [6]

https://gist.github.com/dr2chase/5a1107998024c76de22e122ed836562d https://github.com/golang/go/issues/18597 is a Go proposal to use some registers. Some arguments against are:

in Boot's case, saying 'use platform ABI' probably isn't a good option, because we are pretty divorced from the underlying platform at its ABI level.

Using some registers instead of stack may be a good idea in some circumstances. Although, SMALLSTACK is probably in cache and so may not be as costly as using DATASTACK. Seems to me that the best thing would be to use some registers for 'low-level' function calls (like the way that a 'link register' can be used for single-level calls, but a callstack is better for many-leveled calls), where tracebacks and GC aren't happening anyways (eg inside the OVM implementation and maybe even somewhat higher up), and use 'all stack' at higher levels. at this moment i can't come up with a hard rule for defining what is sufficently 'high-level' so as to mandate 'all stack', and such a definition would be needed to give a well defined 'ABI'. i guess one definition is 'calls which don't provide either unified GC nor unified tracebacks across them'

for the low-level calls, we have 8 GPRs to work with, plus possibly the TOS of smallstack, which is aliased to a register. Calling conventions often have some mixture of link register, return register, caller-saves, callee-saves, and argument passing. Seems to me that it would not be crazy to do:

actually let's combine caller-saves and argument-passing (ForwardCom? does that)

btw earlier i thought we'd have a REGSAVE command which, since this is usually a VM running on a runtime, just swaps the pointer to the registers, making it not matter how many caller-saves there are. But i realized that this doesn't work, because we don't want to incur the overhead of allocating space for the new register file instance, unless we are allocating it on the stack, in which case why not just have the calling convention be to pass everything on the stack? So, the overhead does matter (but a REGSAVE would still be helpful because the VM implementation can probably do it faster).

what is optimal? Agner Fog says it hasn't been studied, and opines that 50%-66% caller-saved might be good for systems with few registers, and somewhat less for systems with more registers (in Fog's own ISA proposal, ForwardCom?, which has 32 registers and 32 vector registers, half of the registers are caller-saved, and half are callee [7]; arguments are passed in the caller-saved registers, and return values are returned in those also; note i think it is typical to consider argument-passing regs as a subtype of caller-saved [8] [9]; ForwardCom? does not use a link register, for reasons well explained in the manual, section 6.2 Implementation of call stack subsection Link Register page 88 PDF page 89; a cursory look at the wikipedia Calling Conventions page shows that some other architectures go approx. 50/50 as well):

" How many registers should be callee - save? I have never seen a study of the optimal ratio of caller - save to callee - save registe rs. Scratch registers are preferred for temporary values that do not have to be saved across a function call. Functions that do not call any other functions (leaf functions) and functions that have a low probability of calling other functions (effective le af functions) will prefer to use scratch registers. If a function has more than one call to other functions or calls another function inside a loop, and if it needs to store values of local variables across these function calls, then the function becomes s impler by using callee - save registers. If the called functions need to use the same registers, then there is no advantage in speed, but possibly in size. If the called functions can use other registers, then there is an advantage in speed as well. Since le af functions are the most likely ones to be speed - critical, it is reasonable to have as many scratch registers as are typically needed in a leaf function. Functions that call other functions, on the other hand, are likely to have more variables and thus ne ed more registers. Balancing these considerations, I would expect the optimal fraction of scratch registers to be between a half and two thirds for architectures that have few registers, and somewhat lower if there are plenty of registers. Some compilers have capabilities for whole - program - optimization, and we can expect such features to become more common in the future. If the compiler has information about the register needs of both caller and callee at the same time, then it can allocate different regis ters to the two functions so that no registers need to be saved. In this case, the optimal solution is to define callee - save registers only for system functions, device drivers and library functions. " -- http://www.agner.org/optimize/calling_conventions.pdf

(and what about who cleans up the stack when stuff is passed on the stack? [10] suggests that both caller- and callee-cleanup happen; ForwardCom? dodges the question by passing a pointer to an argument list in memory (which, i guess, is sort of like caller-cleanup, however); and the table 5 in chapter 7 (function calling conventions) of http://www.agner.org/optimize/calling_conventions.pdf suggests that both were used in the past but all of C and the 64-bit ABIs use caller-. so we should probably use caller-).

so in summary what i'm proposing is two conventions:

low-level, ie when there is not either of (a unified GC, or a unified traceback) unified across both sides of the call, and high-level (where there is either a unified GC or a unified traceback):

low-level:

high-level:

both:

---

i took this out because we don't have entry points anymore (also, would a debugger written in Boot that allows the user to manipulate the hidden stack really be illegal?):

The compiler must be able to resolve at compile time the stack offset of each access of the hidden stack, relative to the smallstack pointer at the beginning of function/code segment.

---

interesting idea from forwardcom:

"One operand of each instruction can be a register, a memory operand with different addressing modes, or an immediate constant. The other operands must be registers."

---

http://www.latticesemi.com/~/media/LatticeSemi/Documents/DataSheets/iCE/iCE40LPHXFamilyDataSheet.pdf says that this series has "16 RAM4K Memory Blocks", the 4k is 4 kilobits. It further appears to say that these 4k bits are organized into 256 16-bit words. So it has 4k 16-bit words of memory, in 16 256-word chunks.

---

two fundamental composite data types:

---

the 4th addr mode bit can be 'displaced' addressing mode; when set, the BOXED bit still does its thing, but the other 2 bits specify a displacement from +1 to +4 (no +0 because ordinary indirect mode already does that)

the other empty addr mode (11) could be 'dynamic displace', which uses a special register for displacement (7? right now that is TOS. 8?)

---

so we have 1 addr mode bit left (2 of them specify 1 of 4 basic addr modes, immediate, register, register indirect, and RESERVED; the other one specified BOXED or unboxed).

when this bit is set, it could force us into REGISTER INDIRECT mode regardless of the other bits. The BOXED bit still does its thing, but the other 2 bits are now used to specify a displacement from +1 to +4 (no +0 because ordinary indirect mode already does that)

or, it could force us into IMMEDIATE mode (or boxed immediate, which is constant table), but give us two more data bits, so 1024 immediates (and a constant table of size 1024) instead of just 256.

i guess i think that displacement in register indirect mode is more useful.

alternately, this bit could be used for custom addressing modes.

---

alternately, this bit could be used to load/store >16-bit words into/from registers on implementations that support larger words in registers, yet have 16-bit memory locations (32, 64, 128, 256 bit). This doesn't seem very consistent with our architecture, though; in our architecture, if registers are larger, than so are words in memory.

---

i guess the 4th addr mode could still be STACK?

i can see two ways to do it:

1) if op0 is STACK, then the effective address contains a constant if read, and at the end of the instruction the effective value is PUSHed onto the indicated stack. if op1 or op2 are STACK, then at the beginnin of the instruction they are POPed from the indicated stack, in order of op2 and then op1. At the end of the instruction any updates to them are lost. The motivation for reading a constant in op0 is that this is really meant for PUSHing, so we don't want to engage memory, and we don't want to do different things for different instructions

2) different instruction classes determined by opcode. The primary class is like (1) (out, in, in signature). But there is another class in which each of op2, op1, op0 are POPed before the instruction, then PUSHED afterwards (inout, inout, inout signature).

recall that in our previous OVM attempt, the six signatures were:

() (ANNOTATE only) (out, imm2) (LOADI) (in, imm2) (JZREL, etc) (out, in, in) (typical) (in, in, in) (JI, although it doesn't use the third argument, so i guess we could have used out, in, in, and set out=r0) (inout, inout, inout) (SWAP, although it doesn't use the third argument)

so if we do this, we should try to partition opcode space into these sigs (or even fewer classes). Which may impact Boot, because we (maybe?) want TWO, ONE, ZERO to fit into this (although it wouldn't be too bad to just have TWO, ONE, ZERO be exceptions). The () and imm2s could also be exceptions (just don't use STACK mode on these, or rather, treat them as out,in,in if STACK mode is used on op0); also, ordinary instructions could combine the two 'in's if they want). And we could probably get away without making STACK mode work well on the op0 of the (in,in,in)s.

i guess STACKIF would have to be:

(inout, in, in)

although we could also just have a separate STACKIF instruction rather than a TRINARYIF using stack mode.

so i think the way to go is probably to have two instruction classes, as far as STACK addressing mode is concerned:

(out, in, in) (inout, inout, inout)

---

in MEDIUM, do we really need to distinguish polymorphic and non-polymorphic opcodes? the only two places we used that in our previous attempts were:

so, in that system, each polymorphic instruction either resolves dynamically (and it's up to the instruction itself to deal with that), or it is dynamically replaced by another opcode that is not polymorphic.

eh, i guess that's pretty good actually. But then how many of our 256 opcodes do we reserve for polymorphism? we already burned 64 for the MEDIUM opcodes corresponding to Boot/SHORT opcodes, and surely now that we have signed and floating point arithmetic of different lengths, and division, and more comparisons, we'll have a bunch more (add, sub, mul, div, mod, neg, leq, eq) * (signed, unsigned, float) * (16, 32, 64) = 8 * 3 * 3 = 72; so that's already over 128 opcodes in total.

now, what we could do is say, we don't care if code can specify a non-polymorphic variant in opcode alone -- so we can give these non-polymorphic functions opcodes above 255 (between 256 and 65535). The problem with that is that now we need 64k of words of memory just to hold the table mapping opcodes to implementations. The compromise here would probably be to limit it to opcodes between 256 and 4k (12-bit opcodes).

so far i think maybe i'll stick with 0-127 is non-polymorphic, 128-255 is polymorphic (but what about 256-65535?)

---

so if the opcode identifies both polymorphicity and signature-for-stack-addressing, how do we do that?

Maybe:

0-127 is non-poly 128-255 is poly

0-55 is out, in, in 56-63 is inout, inout, inout 128-247 is out,in,in 248-255 is inout, inout, inout

---

for a few minutes i felt we should add back in loadmulti and storemulti right now to Boot. But actually the thing is, some platforms only offer atomic load/store up to their pointer bitwidth. And on naive implementations, each Boot memory location is already the full pointer bitwidth. So it would be a pain for the implementation to implement loadmulti and storemulti for any larger bitwidth.

---

i woke up from a dream one day feeling like some sort of 16bit instruction is missing. debug-like, a cursor.

---

i've been feeling that the FORKDONE (fork's join) is too special-case to occupy a whole Boot opcode.

FORKDONE could be replaced by channels that communicate changes in process state, but I don't want to create an operating system with regards to the special file path structure for process channels. So perhaps just generalize Fork done

Perhaps we can generalize it somehow to binary operations over processes, of which 'join' is just one of many?

And/or to all sorts of synchronization, eg synchronization barriers, of which 'join' seems to be one type?

And/or, start and ending of transactions

And/or, change-mode on some sort of control graph

---

" System programming languages differ from assembly languages in two ways: they are higher level and they are strongly typed. The term "higher level" means that many details are handled automatically so that programmers can write less code to get the same job done. For example:

    Register allocation is handled by the compiler so that programmers need not write code to move information between registers and memory.
    Procedure calling sequences are generated automatically: programmers need not worry about moving arguments to and from the call stack.
    Programmers can use simple keywords such as while and if for control structures; the compiler generates all the detailed instructions to implement the control structures."

---

Pillar

In some ways similar to C--. Used by the Functional Language Research Compiler (FLRC) / Haskell Research Compiler (HRC).

Links:

FLRC / HRC's MIL

The Functional Language Research Compiler / Haskell Research Compiler has an IL called 'MIL'.

Links:

---