- Boot (Oot BootX?) reference
Version: unreleased (0.0.0-0)
BootX? is a set of optional extensions to Boot. These extensions variously add instructions, define additional syscalls, define additional SYSINFO behavior, or further specify details which are unspecified in Boot.
Profiles
For convenient reference, certain subsets of these extensions can be referred to as 'profiles'.
The following profiles are defined:
- small: very common functionality
- standard: small + common operating system-provided functionality
- performance: standard + various common operations which could be computed using the primitives provided by the Standard profile, but which may be faster if implemented natively
Any of these may be prefixed with either 'vanilla 32-bit' or 'vanilla 64-bit' to indicate the combination of the indicated profile with the indicated 'vanilla' restrictions.
Stubs
Much functionality may be trivally implemented in a way that always returns a null result or an exceptional condition, or in a way that does not take advantage of native platform facilities, even when present. This is compliant provided that the corresponding functionality is indicated as 'stub' or 'partially stubbed'.
For example, if the Small profile is implemented but every attempt to allocate memory returns null, the implementation must not be described as "implementing the BootX? Small profile", but may be described as "implementing the BootX? Small profile, partially stubbed".
For another example, if the Standard profile is implemented in a way that prevents more than one thread/process from executing simultaneously, yet the target platform natively provides true parallel processing, then the implementation must be described as "Standard profile, partially stubbed".
Small profile
The Small profile consists of the following functionality enhancement extension:
plus the following new instruction extensions:
- Floating point 1
- Integer division
- Misc
plus the following new syscall extensions:
- Local memory allocation
- Memcpy
Standard profile
The Standard profile includes everything in the small profile, plus the following extensions:
Instruction:
Syscall:
- Clocks
- Environment Variables
- Filesys
- Process control
- Shared memory allocation
- TUI
Performance profile
The Performance profile includes everything in the Standard profile, plus the following extensions:
Instruction:
- Atomics rc 2
- Atomics seqc 1
- Atomics seqc 2
- Floating point 2
- Floating point Triglog
- Non-branching conditionals
- SIMD
Syscall:
Functionality extensions
Syscall functionality extension
TODO
Instruction extensions
These extensions add new instructions.
Integer division instruction extension
Floating point 1 instruction extension
floating point | i2f lf sf ceil flor trunc nearest addf subf mulf divf copysign bnan binf beqf bltf fcmp |
Floating point 2 instruction extension
Includes Floating point 1 and adds:
additional floating point | remf sqrt minf maxf powf bnonfinite bgtf beqtotf blttotf ftotcmp |
Floating point Triglog instruction extension
Includes Floating point 2 and adds:
additional floating point | TODO |
trig, log, exp etc fns from math.h (at this point are we missing any other math from C library or math.h?)
Floating point elusive eight instruction extension
additional floating point | TODO |
TODO
https://www.evanmiller.org/statistical-shortcomings-in-standard-math-libraries.html#functions
Misc instruction extension
TODO should these be separated?
HINTs may be executed as NOPs. They are intended for forward compatibility; later versions of the specification may define semantics for various HINTs with the understanding that some implementations may execute them as NOPs.
Atomics seqc 1 instruction extension
atomics (sequential consistency) | lpsc lwsc spsc swsc casrmwsc casprmwsc fencesc |
Atomics seqc 2 instruction extension
atomic additional rmw ops (sequential consistency) | addrmwa aprmwa andrmwa orrmwa xorrmwa |
Atomics rc 1 instruction extension
atomics (release consistency) | lprc lwrc sprc swrc casrmwrc casprmwrc |
Atomics rc 2 instruction extension
atomics (relaxed consistency) | lprlx lwrlx sprlx swrlx casrmwrlx crsprmwrlx |
atomic additional rmw ops (release consistency) | addrmwrc aprmwrc andrmwrc orrmwrc xorrmwrc addrmwrc |
atomic additional rmw ops (relaxed consistency) | addrmwrlx aprmwrlx andrmwrlx orrmwrlx xorrmwrlx addrmwrlx |
TODO: aren't the normal Boot operations already relaxed consistency?
Non-branching conditionals instruction extension
SIMD extension
TODO
Syscall extensions
These extensions add new syscalls.
Filesys syscall extension
Syscalls: TODO explain syscalls only; mb put these sort of extensions under a separate heading?
filesys | read write open close seek flush poll |
Environment variables extension
IPC 1 syscall extension
TODO
IPC 2 syscall extension
TODO
TUI syscall extension
TODO
Process control syscall extension
TODO
Local memory allocation syscall extension
TODO
- ## xlib 2: malloc(size: uint32) ### Memory allocate a new region of SIZE bytes and return a pointer to the beginning of it.
- xlib 3: mfree(region: ptr) ### Free a region of memory beginning at pointer REGION.
REGION argument must have been returned by a previous malloc, and must not have been previously mfree'd.
Memcpy syscall extension
TODO
- ## xlib 1: memcpy(dst: ptr, src: ptr, size: int32) ### Copy SIZE bytes starting at memory location SRC to memory starting at memory location DST.
Shared memory allocation syscall extension
memory allocation | malloc_shared malloc_local |
(TODO: which of malloc_shared/malloc_local is ordinary malloc? i think the ordinary malloc is already malloc_local)
Clocks syscall extension
TODO
- enumerate available clocks and/or request clock with capabilities
- default wallclock clock
- default monotonic clock (may reset upon each Boot invocation) (units unknown?)
- get current time of clock
- get info about clock (e.g. precision/units of clock)
- what about asking for 32-bit vs 64-bit precision? What about getting the date? Do we offer seconds since unix epoch? Nanoseconds?
- do timers and alarms go in here, or elsewhere? alarms seem like a process control thing
- is setting clocks allowed (probably not?)
see https://stackoverflow.com/questions/3523442/difference-between-clock-realtime-and-clock-monotonic https://man7.org/linux/man-pages/man2/clock_gettime.2.html
Restrictive extensions
These extensions further specify details which are unspecified in Boot.
Vanilla 32-bit extension
- integers are defined to be 32-bit, represented using little-endian with twos-complement for signed values
- arithmetic is mod 2^32
- pointers are defined to be represented as 32-bit integers
- INT32_SIZE is 4
- INT16_SIZE is 2
- PTRD_SiZE? is 4
Vanilla 64-bit extension
- integers are defined to be 64-bit, represented using little-endian with twos-complement for signed values
- arithmetic is mod 2^64
- pointers are defined to be represented as 64-bit integers
- INT32_SIZE is 4
- INT16_SIZE is 2
- PTRD_SiZE? is 8
TODO
from old boot:
- instructions for compare-and-swap and memory fence
Boot instructions fall into three categories:
- Small profile: These can be easily ported almost anywhere
- Standard profile: This is what OVM requires. This profile adds integer division, reading the program counter, indirect branching to a previously read program counter value, floating point, atomics, constant tables, and 'systems' instructions for allocating memory, I/O and filesystem operations, interoperation, querying metadata about platform capabilities, and logging.
- Optional instructions: These are not required but can be added, either to expose additional facilities to Boot programs, or to provide more efficient native implementations of certain operations.
pushi popi pushp popp
arithmetic of ints (result is undefined if the result is greater than 32 bits):
- add $dest $src1 $src2: $dest = $src1 + $src2
- addi $dest $src1 #imm8: $dest = $src1 + #imm8
- sub $dest $src1 $src2: $dest = $src1 - $src2
- mul $dest $src1 $src2: $dest = $src1 * $src2
standard profile adds (52 instructions, for 92 total; all opcodes are below 128):
== |
---|
constants and constant tables | lkp lkpb jk lkf |
non-branching conditionals | cmovi cmovip cmovpp |
other control flow | lpc |
== |
---|
optional instructions (all opcodes are 128 or greater):
== |
---|
implementation-defined | impl1 thru impl16 |
interop | xentry xcall0 xcalli xcallp xcallii xcallmm xcallim xcallip xcallpm xcallpp xcall xcallv xlibcall0 xlibcalli xlibcallm xlibcallp xlibcall xlibcallv xret0 xreti xretp xpostcall |
64-bit jumps
indirect control flow | lci jy |
general lci, with target that doesnt have to be xentry
lpc, for (intrusive) debuggers?
ldptrd, for data, in addition to ldptri (lci)?
- lkp &dest #imm16: LoaD? K-th Ptr constant into &dest
- lkpb #imm24: LoaD? K-th Ptr constant into &3
- pcmp &dest &src1 &src2: &dest = 1 if &src1 > &src2, or 0 if &src1 == &src2, or -1 if &src1 < &src2 todo if we have opaque refs, what if they are incomparable?
- lpc $dest _ _: $dest = PC (program counter)
- jk #imm24: Jump to the #imm24-th pointer in the pointer constant table
- jt $index #imm16: Jump to index within local jump table (jump table is embedded in instruction stream immediately following JT instruction; table length is #imm16 (so there are #imm16 32-bit entries in the table, taking up the same space as #imm16 Boot instructions). Each table entry is a 32-bit signed integer offset, in bytes, from the program location following the end of this jump table (since Boot instructions are always 32-bits, these offsets should always be a multiple of 4; if the platform stores Boot instructions in some other format it may need to adjust these offsets before executing the jump). The quantity $index is interpreted as an unsigned index into this table. If $index is less than #imm16, a jump is performed to the program location specified by the offset in the table entry at the given index; if the index provided is greater than or equal to #imm16, then execution continues from the program location following the end of this jump table (equivalent to a jump to a table entry of offset 0))
atomics (sequential consistency):
- casrmw{sc,rc,rlx} &dest $new $old: compare-and-swap atomic (must be within the same memory domain). Upon success, $3 = $new; otherwise, $3 = the contents of &dest. The sc/rc/rlx indicates one of sequential consistency, release consistency (casrmwrc is both an acquire and a release), or relaxed semantics.
- casrmw{sc,rc,rlx}p &dest &new &old is like casrmw{sc,rc,rlx}, but where the values are pointers instead of integers (and &3 is used instead of $3).
- fencesc $memory_domain _ _: instruction/memory access reordering barrier; prevents any memory operations on the given memory_domain from appearing to be reordered across the FENCE instruction. Sequential consistency semantics.
- malloc_shared &dest $size $memory_domain: Requests allocation of a block of $size bytes of memory in memory domain $memory_domain. If successful, a pointer to the new block is stored at &dest; otherwise the null pointer (&0) is stored as &dest. memory_domain is RESERVED for future use; always use $0 for now.
- malloc_local &dest $size: Like malloc_shared but the allocated memory must only be used for thread-local storage. All atomic operations lose their atomicity and ordering guarantees when acting on local memory (e.g. so lpsc becomes equivalent to ordinary lp, etc).
- mrealloc_local &dest $newsize &oldptr: attempts to allocate a new block of local memory of size $newsize, copy the contents of the entire block at &oldptr into it, and then mfree &oldptr. If it succeeds, the new block is assigned to &dest; if it fails, the null pointer (&0) is assigned to &dest; in this case &oldptr is not mfree'd.
- mrealloc_shared &dest $newsize &oldptr: attempts to allocate a new block of memory of size $newsize &oldptr
- mfree &src: deallocates &src
- {lp,lw,sp,sw}{sc,rc,rlx} are like {lp,lw,sp,sw} but atomic, and with {sequential consistency, release consistency, or relaxed} semantics, respectively.
(also need an instruction to flush icache? this might belong in some sort of self-modifying extension tho b/c a boot->platform compiler/interpreter might not be available at runtime)
Relaxed semantics operations are atomic but provide no other guaranteed beyond their corresponding non-atomic variants. Release Consistency semantics are defined later but if you are familiar with it, they are RCpc; that is, the ordering operations themselves are ordered with Processor Consistency semantics. Release Consistency loads are acquires and stores are releases. Release Consistency also implies atomicity. Sequential Consistency operations provide the same guarantees as the corresponding Release Consistency operation, and in addition all Sequential Consistency operations also appear in program order in a single total order over this memory_domain observed by all threads along with all other sequentially consistent instructions.
- devop $value $device #imm8: implementation-dependent control operation of type #imm8 on a device. If successful, $3 is set to 0; if unsuccessful, a non-zero error code is written to $3.
- xcall0 &target_address _ _: call external subroutine with no arguments
- xcalli &target_address $arg1 #imm8: call external subroutine with one integer argument ($arg1 + #imm8)
- xcallp &target_address &arg1 #imm8: call external subroutine with one integer argument (&arg1 + #imm8)
- xcallm &target_address #imm16: call external subroutine with one integer argument (#imm16)
- xcallii &target_address $arg1 $arg2: call external subroutine with two integer arguments
- xcallmm &target_address #arg1_imm8 #arg2_imm8: call external subroutine with two immediate integer arguments
- xcallim &target_address $arg1 #arg2_imm8: call external subroutine with one integer argument and one immediate integer argument
- xcallip &target_address $arg1 &arg2: call external subroutine with one integer argument and one pointer argument
- xcallpm &target_address &arg1 #arg2_imm8: call external subroutine with one pointer argument and one immediate integer argument
- xcallpp &target_address &arg1 &arg2: call external subroutine with two pointer arguments
- xlibcall0 #libfn_imm24: call external library function #libfn_imm24 with no arguments. Equivalent to doing an lkp to load a pointer constant #libfn_imm24, then doing an xcall0 to that pointer.
- xlibcalli $arg1 #libfn_imm16: call external library function #libfn_imm24 with one integer argument $arg1. Equivalent to doing an lkp to load a pointer constant #libfn_imm16, then doing an xcalli to that pointer with $arg1 0.
- xlibcallm #arg1_imm8 #libfn_imm16: call external library function #libfn_imm24 with one integer argument $arg1. Equivalent to doing an lkp to load a pointer constant #libfn_imm16, then doing an xcalli to that pointer with $0 #arg1_imm8.
- xlibcalli &arg1 #libfn_imm16: call external library function #libfn_imm24 with one pointer argument $arg1. Equivalent to doing an lkp to load a pointer constant #libfn_imm16, then doing an xcallp to that pointer with &arg1 0.
- xlibcall{ii,im,mm,im,ip,pm,pp}: etc (todo document, and add to tables above)
- lentry32 &dest ; #imm32: &dest = Load register with a code pointer to an xentry instruction
- jmp32 ; #imm32: unconditional jump
Some instructions are followed by data.
A semicolon means that the instruction is followed by data; 'instr ; data'.
jump constants only 32 bits
lentry and JMP data is relative to beginning of program
make move instructions non-branching conditionals:
- cp $dest $src $cond: if $cond == 0, then CoPy? int from register to register (equivalent to addi $dest $src 0)
- cpp &dest &src $cond: if $cond == 0, then CoPy? Pointer from register to register
a way to write Boot code into memory and then jump into it?
select (nonbranching conditional)
The motivation for malloc_local is that, in order to be able to provide the concurrency guarantees required by this spec, some Boot implementations may create and use locks to control access to blocks of shared memory returned by malloc; in some cases even non-atomic, unordered load or store instructions could cause the Boot implementation to acquire a lock. malloc_local lets such an implementation know that it does not have to setup and use locks for this memory segment, and represents an assurance by the programmer that this memory segment will only ever be accessed by the same thread that called malloc_local.
Note that an implementation may legally provide sequential consistency when the program requests only release or relaxed consistency; furthermore, the additional RMW ops may be implemented using the CAS primitive; therefore, all of the atomics in the optional instructions may legally be implemented using only the atomics in the small profile as primitives.
instructions to allow alignment?
other forms of in,out which read many bytes at a time to/from preallocated buffers, and identify the device using pointers
- in2 &dest &device $length: read in up to $length memory locations from device whose pointer is &device, to buffer at pointer &dest
- out2 &device &src $length: write out up to $length memory locations from buffer at pointer &dest to device whose pointer is &device
also nonblocking
---
floating point and other arith and other instrs from wasm and llvm
syscalls: plan9, klambda, posix, windows, macos, musl libc, android, ios, aws, freertos, python, l4, lists of frequent syscalls, wasm
clocks file rw seek file handle management open/close file management mv attributes networking nonblocking io python event loop
concurrency rw instructions (loads/stores with various memory orders) concurrency rmw instructions (cas, etc) concurrency process management instrs (fork etc)
tui: setcursorabsolute, setcursorrelative, getdimensions, setdimensions, clearscreen, printcharatcursor, getchar
graphics setpixel, getpixel, setpalette, setmode, getmodes, setcustommode? (custom screen size, custom #s of colors)
audio
pico8 https://www.lexaloffle.com/bbs/?tid=28207
---
note in boot spec that bootx will define some syscalls below 128? and some sinfos? and some/all instructions? or maybe just dont mention it much? or maybe say that some things are RESERVED for extension languages?
---
sinfo for:
- FEATURES bitmask
- is instruction N supported? (subquery)
- is syscall N supported? (subquery)
- xcall &target_address #nargsi_imm8u #nargsp_imm8u: eXternal CALL subroutine
- xentr _ #nargsi_imm8u #nargsp_imm8u: eXternal ENTRypoint to Boot function
- xaftr _ #nretsi_imm8u #nretsp_imm8u: place immediately AFTeR? xcall or xlib
- xret0 &return_address _ _: RETurn void to external platform
- xreti &return_address $return_val #imm8: return int32 ($result + #imm8)
- xretp &return_address &return_val #imm8: return ptr (&return_val + #imm8)
- xcall: call external subroutine with #nargsi_imm8 integer arguments and #nargsv_imm8 pointer arguments. Before executing this instruction, the arguments must be placed as per the Boot Calling Convention
- xentr: This instruction should be placed at each entry point that may be called from foreign code. #nretsi_imm8 and #nretsp_imm8 indicate the number of integer and pointer arguments expected. Every code path starting in xentr must end in an xret or xtail, with no other 'xentr's in between. The xentr most immediately previous to an instruction, if any, is considered to begin an 'xentr subroutine' containing that instruction. No source instruction shall branch or jump to a target location within any xentr subroutine, unless either both source and target are within the same xentr subroutine, or the jump is by way of one of the instructions: xcall, xtail, xlib, xret0, xreti, or xretp.
- xaftr: reenter Boot function after returning from external subroutine call. This instruction should immediately follow each xcall. #nretsi_imm8 and #nretsp_imm8 indicate the number of int32s and ptrs being returned, respectively.
- xret0, xreti, xretp: return from Boot function to external platform at &return_address. xreti's int32 $return_val is interpreted as signed. &return_address must be the value that was in pointer register 4 upon the corresponding xentr.
- xretp: if &return_value holds a ptrc, then $imm8 must be 0
- xtail: xtail cannot be used to call any function taking more than 3 integer arguments or more than 3 pointer arguments, unless it is within an xentr routine.
If the function being called takes a variable number of arguments, then the total number of integer arguments is passed in register $11 and the total number of pointer arguments is passed in register $12.
If more than 3 integer arguments or more than 3 pointer arguments need to be passed, then a pointer to the remaining integer arguments is passed in &11 and/or a pointer to the remaining pointer arguments is passed in &12. The contents of the memory holding the additional arguments may be overwritten by the callee, just as with registers 5,6,7. However, the registers 11,12 (both banks) themselves are still callee-saved and, if modified, must be restored before return. The callee must not deallocate the memory pointed to by pointer registers 11 or 12 (that is, the memory holding the additional arguments).
On platforms which pass values which are neither integers nor pointers, when arguments are passed which are neither 32-bit integers nor pointers, if the value is guaranteed to fit within 32-bits, it is passed as an integer, otherwise the value is stored in memory and a pointer to the value is passed.
memory allocation | mallo mfree |
interop | xcall xentr xaftr xret0 xreti xretp xtail |
---
- l8m #imm9u: Load 8-bit iMmediate int constant (the 8 least-significant-bits of the #imm9u) into either $1 or $2, depending on if the most-significant-bit of the #imm9u is 0 or 1, respectively
---
undef behav:
- trying to return, using the xret functions, to any &return_address other than the one passed in upon xentr
- branching or jumping between distinct xentr subrountines, or into an xentr subrountines from outside of it (without using the interoperation instructions).
- failing to restore callee-saved registers before returning or tail calling with xret0, xreti, xrept, xtail
- mfreeing the memory allocated by a caller for extra arguments
---
- jmp #imm22u: unconditional JuMP?
---
split 8-bit immediate offsets into 2 4-bit immediate offsets, and have one of those be ints, and the other be ptrs, so that you can specify an offset into a struct mixing ints and ptrs
---
something like RISC-V's RV32V (see section 'Why RISC-V's RV32V vector extension is better than fixed-width SIMD' in the plBook RISC-V chapter for why this instead of traditional SIMD)
see also ARM SVE, SVE2
---
at least 16-way permutes/shuffles (register) scatter/gather (memory; can be used for longer permutes, but in memory)
e.g. ARM NEON VTBL, VTBX; see https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-5-rearranging-vectors
e.g. consider also stuff like vpshufb, vpermps, vcompressps, vpscatterdd, vpgatherdd; also see https://branchfree.org/2018/05/30/smh-the-swiss-army-chainsaw-of-shuffle-based-matching-sequences/ ?
if we restrict ourselves to 16-way stuff, then we have 4-bit indices, and we can pack 16 indices into 64 bits.
see also https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-5-rearranging-vectors https://www.cnx-software.com/2017/08/07/how-arm-nerfed-neon-permute-instructions-in-armv8/
---