proj-oot-old-ootAssemblyNotes21Old

should we specify a fixed calling convention for LIBRARY? Perhaps just have all arguments and all return values on MEMSTACK? But then LIBRARY is essentially variadic, and this means that, unless the implementation is told the signatures for all LIBRARY calls, it can't staticly know the stack map for DATASTACK. Is this a problem? Does the JVM have two stacks (call stack, operand stack) or one? If two, do JVM stack maps cover the call stack, or just the operand stack? i think what the JVM does is have the 'operand stack' within each single stack frame in the call stack, so there is just one stack. And stack maps are for each stack frame. So, yes, the JVM stack maps do map the call stack frames. So, yes, a variadic LIBRARY would be a problem, or at least require there to be a way to tell any program verification toolchain what the signatures of various LIBRARY calls are. so what does the JVM do with JNI? i think it does need the signatures; see [1]

decision: no leave that for further up the toolchain -- the spirit of Boot is not to make all sorts of restrictions for the sake of program analysis.

---

what i'm calling 'indirect indexed' is also called 'index-base' by some [2]

---

also we may need to add an 'absolute' addr mode -- one paper notes that this is essential for accessing global variables. BUT i think the 'constant' addr mode may work for this; or at the least, the constant addr mode can provide the base address for the global variable table, and we can add offsets.

also many other architectures allow displacements of 8- or 16- bits, which at first blush is much more than the 2 bits that we have. But, in LOADs and STOREs, which is where it probably matters most, we actually we get 7 bits (4 + 3 addr mode bits, since this is a constant, in 32-bit mode, and 12 bits (8 + 4 addr mode bits) in 64-bit mode), so we're probably good.

---

mb we want predecrement instead of postincrement. If you think about it, in a performance-sensitive for loop, you want to count down from n to 0, because then within the loop you only have to compare against 0, and you can forget about n. also

in the M6809 addr mode dynamic frequency ("Table IX-Dynamic indexed addressing statistics "), increment was much more common than decrement in some benchmarks.

"Table 16 - 68020 Addressing Mode Use" decrement was more common.

eh, i think we'll stick with increment. There's many times when you know the base address but not (without further computation) the end address of some data structure.

---

idea for 32-bit BootX? before taking out the -i16 instructions:

The 32 three-operand instruction opcodes and mnemonics and operand signatures are (note: in 32-bit instructions, constants ('c') are 7 bits, not 3 or 4 bits, because the addressing mode is treated as part of the constant): 0. bne-int: c ii ii (branch-if-not-equal on ints) 1. bne-ptr: c ip ip (branch-if-not-equal on pointers) 2. jrel: c c c (unconditional relative jump by a constant signed amount) 3. ldi: oi c c (load immediate 8-bit int) 4. ld: c o si (load from memory address plus unsigned constant) 5. st: c so i (store from register to memory address plus unsigned constant) 6. addi-int: c io ii (in-place addition of ints and immediate constant) 7. addi-ptr-int: c iop ii (in-place addition of ints and immediate constant to ptr) 8. ble-int: c ii ii (branch if less-than-or-equal on ints) 9. ble-ptr: c ip ip (branch if less-than-or-equal on pointers) 10. add-int: oi ii ii (addition of ints) 11. add-ptr-int: op ip ii (add a int to a pointer) 12. CAS 13. bne-i16 14. annotate: c c c (can be ignored) 15. bitor: io ii ii (bitwise OR) 16. bitand: io ii ii (bitwise AND) 17. bitxor: io ii ii (bitwise XOR) 18. sub-ptr: op ip ip (subtraction of pointers) 19. mul-int: oi ii ii (integer multiply) 20. sll: io c ii (shift left logical (multiplication by 2^c (mod MAX_INT+1))) 21. srl: io c ii (shift right logical (division by 2^c, rounding towards zero)) 22. sra: io c ii (shift right arithmetic (division by 2^c, rounding towards negative infinity)) 23. ble-i16 24. add-i16: io16 ii16 ii16 (integer addition of 16-bits) 25. mul-i16: io16 ii16 ii16 (integer multiplication of 16-bits) 26. add-f: iof if if (float addition) 27. mul-f: iof if if (float multiplication) 28. div-f: iof if if (float division) 29. bne-f: c if if (branch-if-not-equal on float) 30. ble-f: c if if (branch-if-less-than-or-equal on float) 31. instr-two:c ? ? (used to encode two-operand instructions))

The 16 two-operand instruction opcodes and mnemonics and operand signatures are: 0. cpy: o i (copy from register to register) 1. pop: o sim 2. push: som i 3. sysinfo (query system metadata): o c 4. neg: io ii (arithmetic negation) 5. bitnot: io ii (bitwise negation) 6. pickk: c sio (pick c on stack) 7. rollk: c sio (roll c on stack) 8. neg-f: iof if (arithmetic negation of float) 9. 10. 11. cvt-i16-i: oi ii16 (convert int16 to int (sign-extend?)) 12. cvt-i-i16: oi16 ii (convert int to int16 (how do we deal with ints bigger than 2^15-1 or smaller than -2^15? is that a critical error or do we mandate saturation or something like that?)) 13. neg-i16: oi16 ii16 14. fclass-f: oi if (classify a floating point number; see eg FCLASS in RISC-V, eg FCLASS.S in section 8.9) 15. instr-one: c ? (used to encode one-operand instructions)

The 16 one-operand instruction opcodes and mnemonics and operand signatures are: 0. jd: pc (dynamic jump) 1. 2. 3. cvt-f-i16: if (convert float to int16, pushing 4 entries onto SMALLSTACK) 4. cvt-i16-f: iof (convert int16 to float, popping 4 entries from SMALLSTACK) (is this round? floor? what do we do when the f is larger or smaller than the largest or smallest int16 -- is this a critical error or do we saturate or something else?) 5. coerce-f-i16: if (coerce float to int16, pushing 4 entries onto SMALLSTACK) 6. coerce-i16-f: iof (coerce int16 to float, popping 4 entries from SMALLSTACK) 7. malloc: op 8. mdealloc: ip 9. mrealloc: iop 10. peek-i16: ip (load i16 from external memory address, pushing onto SMALLSTACK) 11. poke-i16: iop (store i16 to external memory address, popping from SMALLSTACK) 12. 13. 14. syscall2 15. syscall: c (used to encode zero-operand misc instructions)

The 16 SYSCALL2 zero-operand instructions are: 0. exec 1. 2. get 3. put 4. spawn 5. pctrl (process control, eg join/wait, kill, etc -- or should these each be separate?) 6. time 7. rand 8. environ 9. getpid 10. signal (?? not sure if we want to do it this way -- signal handler setup) 11. create 12. delete 13. 14. divmod-i16: (on SMALLSTACK; consume 2 items and push dividend, then push remainder) 15. divmod-int: (on SMALLSTACK; consume 2 items and push dividend, then push remainder)

The 16 INSTR-ZERO zero-operand instructions are: 0. halt (terminate program execution) 1. break (mark breakpoint for a debugger) 2. fence-seq 3. peek-i8 4. poke-i8 5. devop 6. read 7. write 8. open 9. close 10. seek 11. flush 12. poll 13. cvt-int-codeptr: coerce int in T to ptr 14. log 15. library

---

in 32-bit format:

i'm thinking of breaking regularity and adding more instruction formats, with long immediates (or maybe just more LOADI-in-Boot-like instructions with concatenated arguments): a JMP with a large immediate jump target (PC relative or relative to program location zero?), and a LD/ST with a large immediate (absolute) address target (relative to start of global memory, or implementation-dependent?). Ppl say [3] that even 12-bit immediates are considered too narrow; in 32-bit format we'd have 14 bits this way. We can get rid of BNE-F and BLE-F and replace them with an FCMP (like RISC-V has). We'd need to get rid of one more, maybe MUL-F? that kinda hurts, but is probably worth it... also i'm thinking that LOADI should be a different format, not just concatenated, because it should use its addr mode bits as part of its value. I'm thinking that in code space (which is in bytes), 0 is the start of code, and in data space (which is in units of slots/native ptr sizes), 0 is the start of the data segment. The interpretation of negative addresses is platform-dependent (idea: some platforms could use each negative address in code space as a separate entry point to a library, allowing for 8k entry points); that means that, even with a concatenated format, we only have 14 bit immediates for LD/ST/LOADI, so without the sign bit that's 13 bits, or 8k. I guess we need another format for the JMP, which has 21 bits, or 20 bits without the sign (so 1 MiB?); that's pretty good.

done

---

in 32-bit format:

also CALL, RET, and conditional CCALL, CRET. CSEL generalizes conditional load, b/c we have register indirect addr mode.

done

---

in 32-bit format:

actually the conditionals are probably pretty useful, mb we should make them a high priority. 3-operand CSEL subsumes 2-operand CCPY, so mb make room for 3-operand CSEL. mb mv the rarely used mul-f and div-f down to zero-operand. Also remember that we need a 2-operand CMP, and we probably also want to add in one- or two-operand CINC, and mb CNEG, CINV.

done

---

in 32-bit format:

if we want CSEL and CCPY, we need either 3-operand CMP, or a 2-operand CMP-to-SMALLSTACK (actually more than one; CMP-to-SMALLSTACK-LE, CMP-to-SMALLSTACK-NE), or a zero-operand CMP-SMALLSTACK, or something similar. Mb move the floating point ops here down to one-operand or zero-operand instructions, and use SMALLSTACK for the other operand(s). Since we only have -SMALLSTACK versions of CSEL and CCPY anyways, 2-operand CMP-to-SMALLSTACK is sufficient, we dont need 3-operand CMP.

done

---

https://stackoverflow.com/questions/22168992/why-are-conditionally-executed-instructions-not-present-in-later-arm-instruction

" ...modern systems have better branch predictors and compilers are much more advanced so their cost on instruction encoding space is not justified...

    The A64 instruction set does not include the concept of predicated or conditional execution. Benchmarking shows that modern branch predictors work well enough that predicated execution of instructions does not offer sufficient benefit to justify its significant use of opcode space, and its implementation cost in advanced implementations.

And it continues

    A very small set of “conditional data processing” instructions are provided. These instructions are unconditionally executed but use the condition flags as an extra input to the instruction. This set has been shown to be beneficial in situations where conditional branches predict poorly, or are otherwise inefficient.

Another paper titled Trading Conditional Execution for More Registers on ARM Processors claims:

    ... conditional execution takes up precious instruction space as conditions are encoded into a 4-bit condition code selector on every 32-bit ARM instruction. Besides, only small percentages of instructions are actually conditionalized in modern embedded applications, and conditional execution might not even lead to performance improvement on modern embedded processors.

...

In addition, predication does not play well with out-of-order execution: it can require 4 data flow source operands (predicate, current value of destination register [needed if predicate is false], and two source register values) which must be checked for availability. AArch64's predicated instructions only require three sources (which is more likely to be supported by the OoO? machinery [e.g., to support FMA] and is more friendly to a cracking into 2-source µops [like Alpha 21264 did for CMOV]). -- paul-a-clayton

...

I couldn't even find a conditional branch to register or conditional return, and no conditional loads. -- ant6n

...

For AArch64 the number of registers has been doubled compared to 32-bit ARM, but again you don't have any remaining bits for the new 3 high bits of the registers. If you want to use the old encoding then you must "borrow" either from the narrow 12-bit immediate or the 4-bit condition. 12-bit is too small compared to other RISC architectures such as MIPS and reducing it making everything worse, so removing the condition is a better choice

...

Conditional execution is a good choice in implementation of many auxiliary or bit-twiddling routines, such as sorting, list or tree manipulation, number to string conversion, sqrt or long division. We could add UART drivers and extracting bit fields in routers. Those have a high branch to non-branch ratio with somewhat high unpredictability too.

However, once you get beyond the lowest level of services (or increase the abstraction level by using a higher level language), the code looks completely different: code blocks inside different branches of conditions consists more of moving data and calling sub-routines. Here the benefits of those extra 4 bits rapidly fade away. It's not only personal development but cultural: Culturally programming has grown from unstructured (Basic, Fortran, Assembler) towards structural. Different programming paradigms are supported better also in different instruction set architectures.

A technological compromise could have been the possibility to compress the five bit 'cond.S' field to four or three most frequently used combinations.

...

It's somewhat misleading to say that conditional execution is not present in ARMv8. The issue is to understand why you don't want to execute some instructions. Perhaps in the early ARM days, the actual non-execution of instructions mattered (for power or whatever) but today the significance of this feature is that it allows you to avoid branches for small dumb jumps, for example code like a=(b>0? 1: 2). This sort of thing is more common than you might imagine --- conceptually it's things like MAX/MIN or ABS (though for some CPUs there may be instructions to do these particular tasks).

In ARMv8, while there are not general conditionally executed instructions there are a few instructions that perform the specific task I am describing, namely allowing you to avoid branching for short dumb jumps; CSEL is the most obvious example, though there are other cases (e.g. conditional setting of conditions) to handle other common patterns (in that case the pattern of C short-circuited expression evaluation).

IMHO what ARM has done here is what makes the most sense. They've extracted the feature of conditional execution that remains valuable on modern CPUs (avoid many branches) while changing the details of the implementation to match the micro-architecture of modern CPUs.

"

---

[4]

---

180328

old proposals from boot_extended_reference.adoc:

==
meta stuff sysinfo (query system metadata; extended from Boot)
stack ops picki (push a copy of the n-th item in a stack onto the top of the stack), rolli (rotate the top n+1 elements on a stack), stackload (replace a stack with a saved stack), stacksave (save a copy of a stack)
bitwise operations bitor-uint, bitand-uint, bitxor-uint, bitnot-uint, slli-uint (shift left logical), srli-uint (shift right logical)
memory allocation mrealloc
shared memory synchronization cas-atomic (atomic compare-and-swap), fence
I/O and IPC (files and channels) and misc open, close, seek (seek to a location within channel/file), flush (flush or commit writes to channel/file), walk (navigate to subdirectory), delete (delete a channel/file), fork (create a new process), forkdone (wait for a process to terminate), time (get the current time)
==

TODO: this stuff was recently just copied from boot_reference and many of the instructions now have the wrong opcode

TODO OLD: , loadcodeptr (load constant pointer to a program location)

todo: add poll to this list and the one above, msize, mrealloc

todo: this list is way out of date, see the end of this file instead

The 16 three-operand instruction opcodes and mnemonics and operand signatures augmented by: 6. ldx: o i si (load from memory address plus index) 7. stx: o i si (store from register to memory address plus index) 13. cas-atomic: pio i i (atomic compare-and-swap)

The 16 two-operand instruction opcodes and mnemonics and operand signatures are augmented by:

4. bitor-uint: io ii (bitwise OR) 5. bitand-uint: io ii (bitwise AND) 6. bitxor-uint: io ii (bitwise XOR)

10. picki: sm k (push a copy of the n-th item in a stack onto the top of the stack) 11. rolli: sm k (rotate the top n+1 elements on a stack)

The 16 one-operand instruction opcodes and mnemonics and operand signatures are augmented by: 2. bitnot-uint: io (bitwise NOT, in-place) 3. IN todo 4. OUT todo 5. DEVOP todo 6. malloc: (allocate memory) 7. mdealloc: (deallocate memory) 8. msize: (what is the size of this allocated block of memory) todo or should we have mrealloc or nothing? 9. fence: p (the memory location p is ordered by a sequential consistency memory barrier) 10. cast-uint-ptr: (cast a uint to a ptr) 11. mrealloc: (resize memory allocation) 12. stackload: sm (replace a stack with a previously saved stack) 13. stacksave: si (save a copy of a stack)

The 16 zero-operand instructions are augmented by:

0. read (read from a channel/file) 1. write (write to a channel/file) 2. 3. time (get the current time) 4. open (open a channel/file) 5. close (close a channel/file) 6. loadcodeptr: (load constant CODEPTR) ?? todo what about just using annotations for this? what about generalizing to COERCE? 7. seek (seek within channel/file) 8. flush (flush/commit channel/file) 9. 10. delete (delete a channel/file) 11. fork (create a new process) 12. forkdone (wait for a process to terminate) 13. exec (execute another program, not returning to this one)

Three-operand instructions:

10. CAS-ATOMIC (compare-and-swap atomically): pio i i

Atomically, if the value at the memory location pointed to by the contents of the register op0 is equal to the contents of register op1, then set that memory location to the contents of register op2, and set register op1 to op2; otherwise, set register op1 to the the value at that memory location.

Two-operand instructions

These are encoded via instruction TWO (opcode 15), which uses field op2 as a secondary opcode. The values of this secondary opcode are listed below.

3. BITOR-UINT (bitwise OR, in-place): io ii

op0 = op0 OR op1

4. BITAND-UINT (bitwise AND, in-place): io ii

op0 = op0 AND op1

5. BITXOR-UINT (bitwise XOR, in-place): io ii

op0 = op0 XOR op1

Note: although there is no instruction named EQ-UINT in Boot, you can do a BITXOR between two values, which will result in 0 if and only if the values are equal.

6. SLLI-UINT (shift left logical by constant number of bits, in-place): io k

op0 = shift_logical_left(op0 by k bits)

Note: because the bitwidth of uints is implementation-specific, the result of shifting a bit left past position 15 is also implementation-specific; immediately after executing SLLI-UINT, a program should consider executing BITAND-UINT on the result of SLLI-UINT against the constant 65535, in order to guarantee the same behavior on all platforms.

7. SRLI-UINT (shift right logical by constant number of bits, in-place): io k

op0 = shift_right_logical(op0 by k bits)

9.

10. PICKI: sm k (push a copy of the n-th item in a stack onto the top)

Push a copy of the n-th item (zero-indexed) in the stack given by op0 onto the top of that stack, where n is the immediate constant given by op1.

The stack instruction found in some other systems as "DUP" can be imitated by giving 0 for op1; and "OVER" can be imitated by giving 1 for op1.

11. ROLLI: sm k (rotate the top n+1 elements on a stack) (note: this can also do swap (with k=0), rot)

Rotate the top n+1 elements of the stack indicated by op0, where n is the immediate constant given by op1.

The stack instruction found in some other systems as "SWAP" can be imitated by giving 1 for op1; and "ROT" (rotate the top three elements) can be imitated by giving 2 for op1.

12.

13.

One-operand instructions:

These are encoded via instruction ONE (opcode=15, op2=15), which uses field op1 as a secondary opcode. The values of this secondary opcode are listed below.

1. BITNOT-UINT (bitwise NOT, in-place): io

op0 = NOT op0

Note: because the bitwidth of uints is implementation-specific, the result of negating all bits is also implementation-specific; a program should consider BITANDing the result of bitnot with 65535 in order to guarantee the same behavior on all platforms.

2.

3.

4.

5.

6.

TODO CHANGE TO 1-OPERAND of IN OUT DEVOP MALLOC MDEALLOC MSIZE TODO FIX NUMBERING

10. IN (read device): o ii

The integer value in the register specified by op2 is a device index (see section 'Devices'). The mode of the indicated device is read from, and the result is placed in the register specified by op0. The result code is placed into register ERR.

IN is guaranteed not to mutate system state (other than the affected registers and PC and ERR).

If no device with that ID exists, a 'DEVICE DOES NOT EXIST' result code will be returned in ERR upon an attempt to execute the IN instruction.

11. OUT (write device): i ii

The integer value in the register specified by op2 is a device index (see section 'Devices'). The value of the register specified by op0 is written to the device. The result code is placed into register ERR.

If no device with that ID exists, a 'DEVICE DOES NOT EXIST' result code will be returned in ERR upon an attempt to execute the OUT instruction.

12. DEVOP (execute operation on system device): k ii

The integer value in the register specified by op1 is a device ID (see section 'Devices'). The immediate constant given in op0 is a device operation id. The operation indicated is executed on that device, with the value of register T used as the data argument, and the result is placed in register T. The result code is placed into register ERR.

If no device with that ID exists, a 'DEVICE DOES NOT EXIST' result code will be returned in ERR upon an attempt to execute the DEVOP instruction.

2. GETSTATE: k (get special register)

todo

3. SETSTATE: k (set special register)

todo

8. MALLOC (allocate memory): pio

TOS must hold the amount of memory to be allocated, in units of words. The second item on SMALLSTACK must be the memory domain in which to allocate it. Both of these items will be POP'd off of SMALLSTACK. Memory domain 0 is LOCAL (non-shared memory). Other memory domains are shared memory.

Note: each location in memory must be able to hold at least a 16-bit word (but could hold more, depending on the implementation); so the units of memory allocation is not equivalent to bytes.

If the allocation is successful, the output a pointer to the new memory is placed in the register specified by op0. Or, if the requested amount of memory is not available, then ERR will be set to the code for OOM (out-of-memory) and an undefined value may be placed into the register specified by op0.

See also MDEALLOC.

10. MDEALLOC (free memory): pio

The memory region pointed to by the pointer in the register specified by op0 is deallocated.

If this pointer was not returned by a previous MALLOC, or if this memory region has already been deallocated (and has not been subsequently returned by another MALLOC), then undefined behavior may result.

See also MALLOC.

10. MSIZE (what size is this allocated memory region): pio

Place into TOS the size of the memory region pointed to by op0.

If this pointer was not returned by a previous MALLOC, or if this memory region has already been deallocated (and has not been subsequently returned by another MALLOC), then undefined behavior may result.

See also MALLOC.

todo should this be optional?

7. FENCE: p (the memory location p is ordered by a sequential consistency memory barrier)

The memory location p is ordered by a sequential consistency memory barrier (see memory_order_seq_cst in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ; note that FENCE also implies an acquire and a release).

9. MREALLOC (resize memory allocation (like C's realloc): pio

The register specified by op0 is a pointer to the block of memory to be reallocated (this pointer must have been returned from a previous call to MALLOC, and must not have been provided to a subsequent call to MDEALLOC; otherwise undefined behavior may result). TOS holds a UINT giving the amount of memory to be reallocated, in words; this value will be POP'd off of SMALLSTACK.

If the reallocation is successful, the output (placed in the register specified by op0) is a pointer to the new memory. Or, if the requested amount of memory is not available, then ERR will be set to the code for OOM (out-of-memory) and the register specified by op0 continues to hold a pointer to the old memory region (although this region may have been moved).

See also MALLOC, MDEALLOC.

11. STACKLOAD: sm (replace a stack with a previously saved stack)

op0 is set to point to the saved stack pointed to by TOS (the stack is in the same format generated by STACKSAVE).

12. STACKSAVE: si (save a copy of a stack)

A copy of the stack indicated by op0 is saved. TOS must hold a pointer which must point to a memory area at least large enough to hold the size of the target stack, plus one (the sizes of DATASTACK, CALLSTACK, and SMALLSTACK are available through SYSINFO). The size of the stack is copied to the memory location pointed to by the pointer, and the entire stack is copied to the succeeding memory locations, in downward-growing format (that is, the top of the stack appears at the next memory location after the size, then the second-to-top item, etc, with the bottommost element on the stack appearing last, at the highest memory location).

The program is responsible for MALLOCing the memory needed to hold the stack before calling STACKSAVE, and for MDEALLOCing this memory later, after it is done with it.

13. SYSINFO (query system information): k

SYSINFO is already in Boot, but in BootX? its functionality is extended.

k determines which information is being queried. The output is placed in TOS. Currently defined:

TODO: move most of this to BOOTX

0. BootX? SYSINFO version. Returns 1. Note: this version number is distinct from the Boot specification version identifier (SdVer? version). 1. Capability overview. RESERVED. The idea is that this might be a bitmask showing which instructions are supported. 2. Size of base memory (the memory starting at 0x0) 3. Word bitwidth of UINTs. Returns 0 for 16-bits, 1 for 32-bits, 2 for 64-bits, 3 for 128-bits, 4 for 256-bits. Other values are RESERVED. (todo or should 2 denote 48 bits, etc? in which case, should 15 or 16 denote 256-bits?). This must be a constant that does not change for the duration of the program. 4. MAX_UINT. A UINT with the maximum value supported by this platform. This must be a constant that does not change for the duration of the program. 5. Current size of SMALLSTACK 6. Current size of DATASTACK 7. Current size of CALLSTACK 8. Remaining capacity of DATASTACK 9. Remaining capacity of CALLSTACK 10. Remaining capacity of MALLOC 11. 12. A pointer to a length-prefixed bytestring (wordstring) containing the current process's Process Address 13. Effective number of CPUs. This does not have to be physical CPUs, but rather is used as just a hint as to how many threads are optimal for parallel processing in eg a workstealing scheduler. 14. 15.

Other values for the input are RESERVED.

todo: do we really need 'effective number of CPUs' here? It will not be called very often and it will not be known statically. Mb put that in the 'special' pseudofilesystem.

Zero-operand instructions

0. LOADCODEPTR (load constant CODEPTR): one input and one output

Load constant pointer to a program location, by means of loading an integer constant and then converting that integer constant to a CODEPTR.

Must be immediately preceded by a sequence of LOADI, followed by zero or more LOADLOs, into the TOS register.

Replaces that constant in TOS (which is the input) with a CODEPTR (which is the output) which points to the n-th quad in this program, which n is the constant that was placed into TOS.

Note that because the integer input is interpreted as a relative offset from the beginning of the current program's code, programs that use this operation are still position-independent.

The maximum magnitude of the integer constants that are converted into CODEPTRs is platform-depedendent. On some platforms, this may be larger than the maximum magnitude of UINTs; so, the sequence of LOADI/LOADLOs preceding LOADCODEPTR might, on some platforms, be longer than otherwise allowed.

1. OPEN (open a channel or file): 2 inputs, 1 output on SMALLSTACK

The inputs are a file/channel path/address, and a mode, and the output is an open file descriptor. We use the same API to open a file, open a channel, or subscribe to an event stream. Processes can publish event streams as subpaths under their Process Address. todo: do we define '/' as a path separator? note: Plan9's 9P prohibits slashes in names, so i guess this is special treatment (this argues FOR using / as a path separator for us) todo: we could use '/' as the platform filesystem root, and '' as our own internal 'special' filesystem root (for things like Process Addresses, URNs, etc; we dont actually want to handle URNs at the Boot level but this could be a later extension at the Ovm level, as well as 9P's mount/bind stuff to allow custom services to serve stuff like that)

2. CLOSE (close a channel/file): 1 input on SMALLSTACK

The input is an open file descriptor.

5. POLL (wait until a message or event is pending, then receive it)):

todo: how exactly does our poll work? todo: timeout in milliseconds?

If the list of channels to be polled is empty, then poll sleeps for an indeterminate but finite amount of time. This can be used to yield control for cooperative multitasking.

6. SEEK (seek within channel/file): 3 inputs on SMALLSTACK

The inputs are an open file descriptor, a UINT, and a seek mode.

The second input is interpreted according to the seek mode. The seek mode can be:

7. FLUSH (flush or commit channel or file): 1 input on SMALLSTACK

The inputs is an open file descriptor. Writes/changes to the file or channel are flushed or committed, if necessary.

8. WALK (navigate to file/subdirectory within directory): 2 inputs, 1 output on SMALLSTACK

The inputs are an open file descriptor, and a subpath. The output is a file/channel path/address.

todo: consider replacing this with 'list' or 'query', and just using slash to descend into directories (this would mean the implementation must replace '/' with '\' on Windows systems

9. DELETE (delete a channel/file): 1 input on SMALLSTACK

The input is an file/channel path/address. The specified file or channel is deleted.

10. FORK (create a new process): 2 outputs on SMALLSTACK

The current process is forked (todo explain). The outputs are an integer which is either zero (in the child) or one (in the parent), and a pointer to the Process Address (this is like a file/channel path/address; todo explain) of the new process.

todo: really? windows doesnt support fork, right, so is this too powerful to be easy to implement efficiently?)

11. FORKDONE (wait for a process to terminate; fork's join/wait): 2 inputs on SMALLSTACK

Blocks until the indicated process terminates. The inputs are the target Process Address and a timeout, in milliseconds.

todo: i think we can get rid of this and just do this with POLL on some sort of channel whose path is a subpath of the Process Address which gives a stream of process state change events

12.

13. TIME (get the current time): 1 input and 1 output

The input is a clock_spec, and the form of the output depends on the clock_spec. If the requested clock_spec is not supported by the implementation, sets ERR. todo: i think we can get rid of this and just read from a file. The clock_spec can be part of the file path? or should it be a pre-opened Open File Descriptor? Probably have to open it, because there are too many potential clockspecs to pre-open them all. But wait -- a clock wants to return more than one thing specify the clock_spec (see my notes: So, i guess what we should actually do, rather than mandating 64-bit microseconds as the one true time format, is to provide a way to ask for a clock with certain properties, and let the implementation decide whether it will satisfy the request. Like the way that Linux gettime passes a clock_id integer with the request. So we don't need to specify that 'microseconds must be available'; perhaps ALL resolutions are 'available', but if the implementation can't actually provide it then it just adds zeros to what it can provide. And, since Boot (and Ovm) allow (but do not require) opaque words with greater-than-16-bit semantics, we can let the implementation decide what time bitwidth it wants to offer, too. We may want to specify a way for a 16-bit clock_id integer to encode all possible clock requests, however. This seems do-able. With just 4 bits, we can choose any resolution from second to yoctosecond (10^-24; three metric prefixes below femtoseconds) (and more). With just 3 more bits, we can choose any bitwidth from 16 to 256 (and more). With just 1 more bit, we can ask for monotonicity or not; with 2 more bits, whether the time is absolute, per-cpu, per-process, or undefined relative time (maybe cycles since OS/platform/Ovm-implementation start). And that's only 10 bits so far (and we might want to leave the other 6 bits or so for implementation-specific stuff). How to query the system's clock capabilities? It's tempting to have a request for the clock with id 0 represent such a query. But that's a special case, and we want to avoid special cases (they make typechecking etc harder). So we should have some sort of meta 'get system capabilities' thingee that includes clock capabilities in its API. Perhaps clock_id 0 would correspond to asking for 16-bit second-resolution not-necessarily-monotonic relative time, and "SHOULD" be supported whenever possible.)

Devices

Standard open file descriptors

todo revise this section for devices

These open file descriptors are already open upon the beginning of a Boot program. They may not be supported by all implementations: 0. Standard input, if available. 1. Standard output, if available. 2. Unused. Some implementations may use this for standard error, but programs are encouraged to use the LOG facility instead of standard error. 3. rand. Each value read is a (pseudo)random number uniformally distributed between 0 and MAX_UINT, inclusive.

Channel paths

todo revise this section for devices

If the first character of a channel path is ':', the rest of the path is either a special, or, if the second character is also ':', a URI (so URIs look like '::http://example.com'). If the first two character is '/', then the path is an absolute filepath (interpreted in an implementation-dependent manner; however it is suggested to interpret this like a URI with the prefix "file:///"). Otherwise, it is a relative filepath.

Specially defined 'directories' include:

:sysinfo

todo revise this section for devices

The space below this is RESERVED. It is intended to be used as an extension to SYSINFO.

:proc

todo revise this section for devices

The space below this is RESERVED. It is intended to be used to map to various processes.

Special system device types

BootX? does not currently define any special system devices. These are RESERVED for future use.

Implementation-dependent device types

Devices 256 thru 4095 are reserved for the implementation.

These could be used, for example, for hardware peripherals, or for interacting with channels, signals, and interrupts with a special meaning in the context of the underlying implementation.

Dynamically assigned device types

Dynamically assigned devices are various types of devices which are created at runtime and whose device IDs are not known before then.

Channel control devices

Each channel is associated with a channel control devices (whose device number is the same as the channel ID). IN and OUT applied to a channel's control device is NOT the same as reading from or writing to the channel.

IN: returns the number of words waiting to be read from the channel, or zero if there is nothing waiting currently to be read. Note that not all implementations support this -- check SYSINFO first.

OUT: RESERVED

DEVOP: RESERVED

Thread/process control devices

todo

Signal/interrupt hander configuration devices

todo. Signal devices allow the execution of a BootX? process to be interrupted by events.

IN: RESERVED

OUT: RESERVED

DEVOP:

1. Invoke(target_process, data): Send this signal to the process with process-id: target-process, with associated data data. todo: is this really the best way to invoke? also, don't we only get one argument here -- what to do about that?

2. Get pending: returns 0 if this signal is not pending, or 1 if it is. Note: if another signal of the same type comes in while one is already pending, there is no change.

3. unused

4. Get mask/unmask: returns 0 if this signal is masked (on hold), or 1 if this signal is currently unmasked (active).

5. Set mask/unmask(x): x is 0 to mask (place on hold) this signal, or 1 to unmask (resume processing of) this signal.

6. Get entrypoint: return this signal's entrypoint (callback that will be called when this signal is received); 0 if this signal is ignored by this process

7. Set entrypoint(x): set this signal's entrypoint

6. Get priority: return this signal's priority. A signal with a higher priority number is more urgent and, if the implementation supports it, can interrupt a signal with a lower priority number.

7. Set priority(x): set this signal's priority

6. Get subpriority: return this signal's subpriority. A signal with the same priority but a higher subpriority than another signal will not interrupt it, but will execute first if both are pending and then one of them has a chance to execute.

7. Set subpriority(x): set this signal's subpriority

The implementation saves and clears the context (registers, stacks) before invoking a signal handler, and restores them after the handler is done.

Note that not all implementations support signals, masking, individual masking without masking all signals at once, priorities, and/or subpriorities. Todo: how to ask if the current implementation supports each of these things, and to ask how many signals/priorties/subpriorities it supports?

Clocks

todo

Timers

todo

todo periodic system 'tick' timer

(Pseudo)random numbers

todo

Lock, event, condition, and semaphor devices

todo

Virtual devices

todo. Virtual devices allow BootX? programs to create and present a device API.

Concurrency semantics

In shared memory (memory allocated with malloc-shared), LOAD, STORE, and atomics (the operations labeled -atomic) (where supported by a particular implementation) are atomic. All implicit loads and stores of single Boot words (ints, or pointers, plus whatever 'OTHER' types the implementation provides) are tear-free/atomic. Nothing else in shared memory is guaranteed to be atomic. Some or all of the -atomic operations may not be available on all implementations. Some implementation may implement some atomics, or even all memory accesses to shared memory, via a lock at the granularity of the entire region allocated by one call to malloc-shared. Valid Boot programs must not share local memory (memory allocated with malloc to the local domain, that is, domain 0), that is, they must not dereference or manipulate a pointer to local memory which is owned by another process (local memory is owned by the process that allocated it).

Registers are never shared, hence single instructions that only do computations between registers (ie. that don't access main memory or I/O or IPC) are atomic.

In the absence of any FENCE instructions, atomic operations (LOAD, STORE, and CAS-ATOMIC; in the future more may be added) have only a relaxed memory ordering (see memory_order_relaxed in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ). Synchronization is only effective within a memory domain (which is provided as an argument to malloc-shared); this is to accomodate NUMA and distributed architectures in which one process may have access to multiple distinct memories. Use FENCE to ensure sequential consistency amongst all FENCEs in the same memory domain (see memory_order_seq_cst in http://wayback.archive.org/web/20170129132623/http://en.cppreference.com/w/cpp/atomic/memory_order#Constants ; note that FENCE also implies an acquire and a release). If you don't know what any of this means, then please always follow these guidelines:

The 32 three-operand instruction opcodes and mnemonics and operand signatures are (note: in 32-bit instructions, constants ('c') are 7 bits, not 3 or 4 bits, because the addressing mode is treated as part of the constant): 0. annotate: c c c (can be ignored) 1. INSTR-TWO-ARITH2: c ? ? (used to encode two-operand instructions) 2. INSTR-TWO-BOOTX: c ? ? (used to encode two-operand instructions) 3. 4. jrel: c c c (unconditional relative jump by a constant signed amount) 5. ldi: c c oi (load immediate 14-bit int) 6. ldabs: c c o (2-operand absolute instruction format; load from absolute 14-bit memory location; negative location are implementation-dependent, positive locations are slots from memory data segment start (if any)) 7. stabs (c c i): (2-operand absolute instruction format; store to absolute 14-bit memory location; negative locations are implementation-dependent, positive locations are slots from memory data segment start (if any)) 8. ld: c o si (load from memory address plus unsigned constant) 9. st: c so i (store from register to memory address plus unsigned constant) 10. bne-int: c ii ii (branch-if-not-equal on ints) 11. ble-int: c ii ii (branch if less-than-or-equal on ints) 12. bne-ptr: c ip ip (branch-if-not-equal on pointers) 13. ble-ptr: c ip ip (branch if less-than-or-equal on pointers) 14. addi-int: c io ii (in-place addition of ints and immediate constant) 15. addi-ptr-int: c iop ii (in-place addition of ints and immediate constant to ptr) 16. sll: c io ii (shift left logical (multiplication by 2^c (mod MAX_INT+1))) 17. srl: c io ii (shift right logical (division by 2^c, rounding towards zero)) 18. sra: c io ii (shift right arithmetic (division by 2^c, rounding towards negative infinity)) 19. add-int: oi ii ii (addition of ints) 20. add-ptr-int: op ip ii (add a int to a pointer) 21. bitor: oi ii ii (bitwise OR) 22. bitand: oi ii ii (bitwise AND) 23. bitxor: oi ii ii (bitwise XOR) 24. sub-ptr: op ip ip (subtraction of pointers) 25. mul-int: oi ii ii (integer multiply) 26. add-f: of if if (float addition) 27. csel: o i i (if top of SMALLSTACK is zero, then CPY op1 into op2, otherwise CPY op0 into op2) 28. pop-from-one-stack-push-to-another-multi: sio sio ii (operands are the two stacks, and the number of items to move between them) 29. push-multiple-registers-onto-stack: sio ii ii (stack, starting register, ending register) 30. pop-multiple-registers-from-stack: sio ii ii (stack, starting register, ending register; note: if a user stack, this is guaranteed not to actually destroy the 'popped' values, so if you only wanted to copy them non-destructively, just move the stack pointer back afterwards) 31. CAS: iop i i

The 16 two-operand INSTR-TWO-BOOTX instruction opcodes and mnemonics and operand signatures are: 0. INSTR-ONE-BOOTX: c ? (used to encode one-operand instructions) 1. 2. pickk: c sio (pick c on stack) 3. rollk: c sio (roll c on stack) 4. loadc: c o (load constant; the implementation of constant storage is up to the implementation, but one simple idea is to put them in the space before the start of the program) 5. sysinfo (query system metadata): c o 6. pop: o sim 7. push: som i 8. cpy: o i (copy from register to register) 9. neg: oi ii (arithmetic negation) 10. bitnot: oi ii (bitwise negation) 11. neg-f: of if (arithmetic negation of float) 12. cvt-int-f: of ii 13. fclass-f: oi if (classify a floating point number; see eg FCLASS in RISC-V, eg FCLASS.S in section 8.9) 14. call: sio icp (push PC to indicated stack, then jmp to codepointer) 15. ccall: sio icp (pop item from SMALLSTACK, if it's nonzero, execute CALL)

The 16 two-operand INSTR-TWO-ARITH2 instructions with results pushed to SMALLSTACK are: 0. INSTR-ONE-ARITH1 1. eq-i: ii ii 2. neq-i: ii ii 3. leq-i: ii ii 4. le-i: ii ii 5. eq-f: if if 6. neq-f: if if 7. leq-f: if if 8. le-f: if if 9. div-f: if if (float division) 10. min-f: if if (min) 11. max-f:if if (max) 12. mul-f: if if (float multiplication) 13. copysign-f: if if (op1's sign is changed to op2's sign) 14. negcopysign-f: if if (op1's sign is changed to the negation of op2's sign) 15. xorsign-f: if if (op1's sign is changed to xor of op1's and op2's signs)

The 16 one-operand INSTR-ONE-ARITH1 instructions with results pushed to SMALLSTACK are: 1. 2. 3. 4. 5. 6. 7. 8. 9. lea (load effective address; takes an addressing mode and an operand_value, and returns an effective address) 10. sqrt-f 11. roundeven-f-int ('round' in wasm) 12. trunc-f-int 13. ceil-f-int 14. roundawayfrom0-f-int ('round' in C) 15. sqrt-f

The 16 INSTR-ONE-BOOTX one-operand instruction opcodes and mnemonics and operand signatures are: 0. instr-syscall: c (used to encode zero-operand misc instructions) 1. instr-syscall2: c 2. 3. 4. 5. 6. 7. 8. 9. malloc: op 10. ret: sio (pop codeptr from indicated stack, then jmp to it) 11. cret: sio (pop item from SMALLSTACK, if it's nonzero, execute RET) 12. jd: ipp (dynamic jump) 13. cinc: io (conditional increment) 14. mdealloc: ip 15. mrealloc: iop

The 16 SYSCALL2 zero-operand instructions are: 0. 1. exec 2. get 3. put 4. spawn 5. pctrl (process control, eg join/wait, kill, etc -- or should these each be separate?) 6. time 7. rand 8. environ 9. getpid 10. signal (?? not sure if we want to do it this way -- signal handler setup) 11. create 12. delete 13. floor-f-int 14. divmod-i16: (on SMALLSTACK; consume 2 items and push dividend, then push remainder) 15. divmod-int: (on SMALLSTACK; consume 2 items and push dividend, then push remainder)

The 16 INSTR-ZERO zero-operand instructions are implementation-dependent. Implementations which use only a fixed-width 32-bit BootX? instruction set (as opposed to a mixed-width instruction set mixing 16-bit Boot instructions with 32-bit BootX? instructions) should define these to be the same as the 16 INSTR-ZERO zero-operand instructions in Boot.

---

there are a ton of from-t and to-t conditionals here, and all they do is shorten two-instruction sequences like 'leq w a b; cZZZ-nz x t' to 'cZZZ-le x a b'. The runtime can use macro-op fusion here anyways. So i removed these:

18. cadd-from-t-le-int: ioi ii ii: (conditional add T; if op1 <= op0, then op2 = op2 + T) 19. cadd-from-t-lt-int: ioi ii ii: (conditional add T; if op1 < op0, then op2 = op2 + T)

19. cadd-from-t-eq-int: ioi ii ii: (conditional add T; if op1 == op0, then op2 = op2 + T) 20. cadd-from-t-ne-int: ioi ii ii: (conditional add T; if op1 != op0, then op2 = op2 + T) 21. cadd-from-t-eq-ptr: iop ii ii: (conditional add T; if op1 == op0, then op2 = op2 + T) 22. cadd-from-t-ne-ptr: iop ii ii: (conditional add T; if op1 != op0, then op2 = op2 + T) 23. cadd-from-t-le-ptr: iop ii ii: (conditional add T; if op1 <= op0, then op2 = op2 + T) 24. cadd-from-t-lt-ptr: iop ii ii: (conditional add T; if op1 < op0, then op2 = op2 + T) 25. cadd-to-t-le-ptr: ii ii ii: (conditional add T; if op1 <= op0, then T = T + op2) 26. cadd-to-t-lt-ptr: ii ii ii: (conditional add T; if op1 < op0, then T = T + op2) 27. cadd-to-t-eq-ptr: ii ii ii: (conditional add T; if op1 == op0, then T = T + op2) 28. cadd-to-t-ne-ptr: ii ii ii: (conditional add T; if op1 != op0, then T = T + op2) 29. cadd-to-t-le-int: ii ii ii: (conditional add T; if op1 <= op0, then T = T + op2) 30. cadd-to-t-lt-int: ii ii ii: (conditional add T; if op1 < op0, then T = T + op2) 31. cadd-to-t-eq-int: ii ii ii: (conditional add T; if op1 == op0, then T = T + op2) cadd-to-t-ne-int: ii ii ii: (conditional add T; if op1 != op0, then T = T + op2)

similiarly:

30. cbitnot-le: ioi ii ii: (conditional bitnot; if op1 <= op0, then op2 = bitnot(op2)) 31. cbitnot-lt: ioi ii ii: (conditional bitnot; if op1 < op0, then op2 = bitnot(op2))

cbitnot-eq: ioi ii ii: (conditional bitnot; if op1 == op0, then op2 = bitnot(op2)) cbitnot-ne: ioi ii ii: (conditional bitnot; if op1 != op0, then op2 = bitnot(op2)) cinc-le: ioi ii ii: (conditional increment; if op1 <= op0, then op2 = op2 + 1) cinc-lt: ioi ii ii: (conditional increment; if op1 < op0, then op2 = op2 + 1)

22. ccpy-from-t-le: o ii ii: (conditional cpy from T; if op1 <= op0, then op2 = T) 23. ccpy-from-t-lt: o ii ii: (conditional cpy from T; if op1 < op0, then op2 = T) 24. ccpy-from-t-eq: o ii ii: (conditional cpy from T; if op1 == op0, then op2 = T) 25. ccpy-from-t-ne: o ii ii: (conditional cpy from T; if op1 != op0, then op2 = T) 26. ccpy-to-t-le: o ii ii: (conditional cpy to T; if op1 <= op0, then T = op2) 27. ccpy-to-t-lt: o ii ii: (conditional cpy to T; if op1 < op0, then T = op2) 28. ccpy-to-t-eq: o ii ii: (conditional cpy to T; if op1 == op0, then T = op2) 29. ccpy-to-t-ne: o ii ii: (conditional cpy to T; if op1 != op0, then T = op2)

csel-multi-smallstack-eq: o ii ii (pop 2 items from SMALLSTACK, and conditionally select; if op0 == op1, then copy the first item to op2; otherwise copy the second item to op2) csel-multi-smallstack-ne: o ii ii (pop 2 items from SMALLSTACK, and conditionally select; if op0 != op1, then copy the first item to op2; otherwise copy the second item to op2) csel-multi-smallstack-le: o ii ii (pop 2 items from SMALLSTACK, and conditionally select; if op0 <= op1, then copy the first item to op2; otherwise copy the second item to op2) csel-multi-smallstack-lt: o ii ii (pop 2 items from SMALLSTACK, and conditionally select; if op0 < op1, then copy the first item to op2; otherwise copy the second item to op2)