proj-oot-bootxReferenceOld201101

  1. Boot (Oot BootX?) reference

Version: unreleased (0.0.0-0)

VERY EARLY DRAFT

BootX? names and defines various conventions, extensions to, subsets of, and flavors of Boot.

The extensions are grouped into 'profiles'. Each profile contains all of the functionality of all preceding profiles.

To reduce fragmentation, it is suggested that implementors try to implement entire profiles rather than arbitrary combinations of extensions, however individual extensions are also defined because in special circumstances it may make sense to implement some profile plus some other extensions.

  1. # Conventions
      1. The Boot standard calling convention (and register usage convention)

Registers:

The memory stack grows downwards (so the 'top' of the stack is 'lowest' in memory).

All arguments are passed on the stack. Each stack frame is divided into two blocks:

Within each block, arguments are in right-to-left order (the last argument specified in the function documentation is highest in memory).

If there are or might be a variable number of arguments, then they are on the stack, in the appropriate block by type, below (lower in memory) than the fixed arguments, and then below that, in the integer block, is the number of integer variable arguments, and below that in the integer block is the number of pointer variable arguments.

The caller puts the return address into the link register (&4) before the call.

The callee cleans up (bumps up the stack pointer before returning).


TODO


These extensions variously add instructions, define additional lib calls, define additional sinfo queries, or further specify details which are unspecified in Boot.

These extensions are grouped into 'profiles'.

A calling convention is defined.

  1. # Profiles The profiles are arranged in a linear ordering from smallest to largest, and each profile includes all of the smaller profiles.

The following profiles are defined:

Any of these may be suffixed with either 'vanilla 32-bit' or 'vanilla 64-bit' to indicate the combination of the indicated profile with the indicated 'vanilla' restrictions.

  1. # The Boot Calling Convention ##

TODO: looking ahead to lovm, mb we dont want to save any arguments in these registers, so that we put them all in the smallstacks and have our nice register window-like convention to rotate them from output argument registers to input argument registers. OTOH we want some of our syscalls to work without memory access. Maybe the thing to do is to have a separate calling convention for syscalls. Or, maybe just call that the 'Boot calling convention', and have a different one for LOVM.

Registers:

todo: the numbers below are out of date

todo: now that we have the smallstack, probably pass all arguments on smallstack (right-to-left order; the last argument is pushed first). Also consider leaving the last smallstack spot open? otoh that would only be 3 arguments. - so i guess the convention is that the smallstack must be empty except for these arguments? - also consider putting the first argument in a register, or even the first 2, so that (i) fns with few args can have them all in regs, (ii) the first arg is available at any time. Not sure if that's a good idea tho. - if you do stuff like that, mb make sure at least 2 (regs+stack locs) are free even with the max # of args), so that you can at least do a SWAP w/o spilling anything to memory first

todo: I read somewhere that you should have callEE clean up of the stack for tail calls? I don't understand that but anyway shouldn't that apply to our extended argument allocated memory also? because that allocated memory could be on the stack.

todo: Also should we just use the stack pointer register to point to the allocated memory for extra arguments (e.g. most of the time this means putting it on the stack)?

Registers 5 and 6 (both integer bank and pointer bank) are used to pass arguments and return values (from lower to higher number get arguments from left to right). If the function is variadic (that is, if the number of arguments is not fixed), then the number of arguments is passed in $1. If more than 2 integer arguments or more than 2 pointer arguments must be passed, then a pointer to all arguments after the first is passed in &1. The callee may overwrite the memory passed in &1 but must not deallocate it. The conventions for passing return values are the same as the conventions for passing arguments, except that the number of return values must be fixed and must fit in registers 5 and 6 (both banks) plus whatever memory, if any, was allocated for additional arguments and passed in &1.

So, upon being called, the callee expects to find the first two int32 arguments in $5 and $6, the first two pointer arguments in &5 and &6, and the return address in register &7. If the function is variadic, the number of arguments is in $1. If there are more arguments than fit in registers $5, &5, $6, &6, a pointer to the additional arguments is in &1.

There may or may be a memory stack, but if there is a stack, then the recommended convention is: the memory stack pointer is in &3, and the stack grows downward. Note: when there is a stack, and &1 is being used to point to additional arguments, &1 may or may not point into the stack.

The callee can overwrite and use registers 4 thru 7 (both banks) for any purpose. If additional arguments were passed in memory, the callee can also overwrite those arguments.

Upon returning, the caller expects a fixed number of return values. These are stored first in registers 5 thru 6 (both banks), and then, only if the caller passed a pointer to additional memory in &1, in that memory. The values in registers 1 thru 3 (both banks) are the same as they were at the beginning of the call. If the caller passed a pointer to additional memory in &1, then that memory has not been deallocated by the callee.

TODO: describe also the convention for spilling the smallstacks to the stack

frame layout of top frame on memory stack: TODO (this is planned to be used to store the counts of integer smallstack items and pointer smallstack items spilled to the memory stack, in conjunction with $3; altho mb not b/c we don't really need that many bits for that)

todo revise if necessary

todo current space left:

3-operand instructions: 18 2-operand instructions: 6 1-operand instructions: 3 0-operand instructions: 5

  1. # Tiny profile The Tiny profile consists of the following new instruction extensions:

New architectural features:

The following new instructions are defined:

integer division 1div32
floating point 1fadd fsub fmul fdiv fcmp i32tof ftoi32 fcp fld fst beqf bltf floor ceil nearest trunc binf bnan
misc break

count of new 3-operand instructions: 8 count of new 2-operand instructions: 11 count of new 1-operand instructions: 0 count of new 0-operand instructions: 1

todo: gonna have to introduce 2 new pseudoinstructions (similar to opcode 63) to house all these 2-operand fp instructions

New opcodes: opcode, mnemonic, has_i32_data, op0_type, op1_type, op2_type, op0_w, op0_r, PC_w, PC_r, mem_w, mem_r ,div32,0,ri,ri,ri,1,0,0,0,0,0 ,fadd,0,rf,rf,rf,1,0,0,0,0,0 ,fsub,0,rf,rf,rf,1,0,0,0,0,0 ,fmul,0,rf,rf,rf,1,0,0,0,0,0 ,fdiv,0,rf,rf,rf,1,0,0,0,0,0 ,fcmp,0,ri,rf,rf,1,0,0,0,0,0 ,beqf ,bltf

2-operand ones(?): ,i32tof,0,rf,ri,u3,1,0,0,0,0,0 ,ftoi32,0,ri,rf,u3,1,0,0,0,0,0 ,fcp,0,rf,rf,ri,1,0,0,0,0,0 ,fld,0,K,rf,rp,u3,1,0,0,0,0,1 ,fst,0,rf,rp,u3,0,1,0,0,1,0 ,floor ,ceil ,nearest ,trunc ,binf ,bnan

New undefined behavior:

Semantics notes:

The floating-point width is implementation-dependent.

Most of the details of the semantics of floating point is implementation-dependent.

todo note: fcmp also serves as fclass; it returns a status word todo note: itof takes a rounding mode as an argument? i think what wasm does is only offer trunc to turn floats into ints; the rounding stuff is float->float todo define behavior of immediates in i32tof and ftoi32

todo: replace Ks with constants todo: expand headers to match metadata table in boot_reference.md todo: make simplified 'decode table' like in boot_reference.md todo: those all hold for the other tables in here, too

todo: do we really want to have fcp fld fst i32tof ftoi32 without doing MORE16 so as to save encoding space?

New sinfo queries:

These sinfo query results are static.

  1. ## Small profile ### The Small profile consists of the following new instruction extensions:

plus the following new lib call extensions:

The following new instructions are defined:

I/O 1in out in1 out1

todo: we could really just do two operands if this is all we need.. in MORE16

New opcodes: opcode, mnemonic, has_i32_data, op0_type, op1_type, op2_type, op0_w, op0_r, PC_w, PC_r, mem_w, mem_r ,in1,0,rf,rp,u3,1,0,0,0,0,1 ,out1,0,rf,rp,u3,0,1,0,0,1,0

The following lib calls are defined:

  1. ## xlib 2: malloc(size: $5) Memory allocate a new region of SIZE bytes and return a pointer to the beginning of it.
      1. xlib 3: mfree(region: &5) Free a region of memory beginning at pointer REGION.

REGION argument must have been returned by a previous malloc, and must not have been previously mfree'd.

New undefined behavior:

      1. Standard profile ### The Standard profile includes everything in the small profile, plus the following extensions:

Instruction:

Syscall:

      1. Performance profile ### The Performance profile includes everything in the Standard profile, plus the following extensions:

Instruction:

Syscall:

Functionality extensions

  1. ## Syscall functionality extension == TODO
      1. Oot assembly functionality extension == TODO (longer mnemonics?)

Instruction extensions

These extensions add new instructions.

  1. ## Floating point 1 instruction extension ###
floating point i2f lf sf ceil flor trunc nearest addf subf mulf divf copysign bnan binf beqf bltf fcmp

could save a few fp opcodes by having ftoi take an immediate operand specifying rounding type: floor ceil round nearest. mb could elide either bltf beqf or fcmp. could compress isinf isnan into fclass like riscv does. so: fadd fsub fmul fdiv fclass fcmp itof ftoi fcp fld fst. thats still a lot tho.

  1. ## Floating point 2 instruction extension ### Includes Floating point 1 and adds:
additional floating point remf sqrt minf maxf powf bnonfinite bgtf beqtotf blttotf ftotcmp
  1. ## Floating point Triglog instruction extension ### Includes Floating point 2 and adds:
additional floating point TODO

trig, log, exp etc fns from math.h (at this point are we missing any other math from C library or math.h?)

  1. ## Floating point elusive eight instruction extension ###
additional floating point TODO

TODO

https://www.evanmiller.org/statistical-shortcomings-in-standard-math-libraries.html#functions

  1. ## Atomics seqc 1 instruction extension ###
atomics (sequential consistency)lpsc lwsc spsc swsc casrmwsc casprmwsc fencesc
  1. ## Atomics seqc 2 instruction extension ###
atomic additional rmw ops (sequential consistency)addrmwa aprmwa andrmwa orrmwa xorrmwa
  1. ## Atomics rc 1 instruction extension ###
atomics (release consistency)lprc lwrc sprc swrc casrmwrc casprmwrc
  1. ## Atomics rc 2 instruction extension ###
atomics (relaxed consistency)lprlx lwrlx sprlx swrlx casrmwrlx crsprmwrlx
atomic additional rmw ops (release consistency)addrmwrc aprmwrc andrmwrc orrmwrc xorrmwrc addrmwrc
atomic additional rmw ops (relaxed consistency)addrmwrlx aprmwrlx andrmwrlx orrmwrlx xorrmwrlx addrmwrlx

TODO: aren't the normal Boot operations already relaxed consistency?

  1. ## Non-branching conditionals instruction extension ###
      1. SIMD extension ### TODO

Syscall extensions

These extensions add new syscalls.

  1. ## Filesys syscall extension ###

Syscalls: TODO explain syscalls only; mb put these sort of extensions under a separate heading?

filesys read write open close seek flush poll
  1. ## Environment variables extension ###
      1. IPC 1 syscall extension ### TODO
      2. IPC 2 syscall extension ### TODO
      3. TUI syscall extension ### TODO
      4. Process control syscall extension ### TODO
      5. Local memory allocation syscall extension ### TODO
      6. Shared memory allocation syscall extension ###
memory allocation malloc_shared malloc_local

(TODO: which of malloc_shared/malloc_local is ordinary malloc? i think the ordinary malloc is already malloc_local)

  1. ## Clocks syscall extension ### TODO

see https://stackoverflow.com/questions/3523442/difference-between-clock-realtime-and-clock-monotonic https://man7.org/linux/man-pages/man2/clock_gettime.2.html

Restrictive extensions

These extensions further specify details which are unspecified in Boot.

  1. ## Vanilla 32-bit extension ###
      1. Vanilla 64-bit extension
    1. Details about semantics
      1. Stubs

Much functionality could be trivally implemented in a way that always returns a null result or an exceptional condition, or in a way that does not take advantage of native platform facilities, even when present. This is compliant provided that the corresponding functionality is indicated as 'stub' or 'partially stubbed'.

For example, if the Small profile is implemented but every attempt to allocate memory returns null, the implementation must not be described as "implementing the BootX? Small profile", but may be described as "implementing the BootX? Small profile, partially stubbed".

For another example, if the Standard profile is implemented in a way that prevents more than one thread/process from executing simultaneously, yet the target platform natively provides true parallel processing, then the implementation must be described as "Standard profile, partially stubbed".

Note that "stub" doesn't have to be said in cases where the reader already known that the native target does not support the given functionality.




TODO




from old boot:

Boot instructions fall into three categories:

 pushi popi pushp popp 

arithmetic of ints (result is undefined if the result is greater than 32 bits):

standard profile adds (52 instructions, for 92 total; all opcodes are below 128):

==
constants and constant tables lkp lkpb jk lkf
non-branching conditionals cmovi cmovip cmovpp
other control flow lpc
==

optional instructions (all opcodes are 128 or greater):

==
implementation-defined impl1 thru impl16
interop xentry xcall0 xcalli xcallp xcallii xcallmm xcallim xcallip xcallpm xcallpp xcall xcallv xlibcall0 xlibcalli xlibcallm xlibcallp xlibcall xlibcallv xret0 xreti xretp xpostcall

64-bit jumps

indirect control flow lci jy

general lci, with target that doesnt have to be xentry

lpc, for (intrusive) debuggers?

ldptrd, for data, in addition to ldptri (lci)?

atomics (sequential consistency):

(also need an instruction to flush icache? this might belong in some sort of self-modifying extension tho b/c a boot->platform compiler/interpreter might not be available at runtime)

Relaxed semantics operations are atomic but provide no other guaranteed beyond their corresponding non-atomic variants. Release Consistency semantics are defined later but if you are familiar with it, they are RCpc; that is, the ordering operations themselves are ordered with Processor Consistency semantics. Release Consistency loads are acquires and stores are releases. Release Consistency also implies atomicity. Sequential Consistency operations provide the same guarantees as the corresponding Release Consistency operation, and in addition all Sequential Consistency operations also appear in program order in a single total order over this memory_domain observed by all threads along with all other sequentially consistent instructions.

Some instructions are followed by data.

A semicolon means that the instruction is followed by data; 'instr ; data'.

jump constants only 32 bits

lentry and JMP data is relative to beginning of program

make move instructions non-branching conditionals:

a way to write Boot code into memory and then jump into it?

select (nonbranching conditional)

The motivation for malloc_local is that, in order to be able to provide the concurrency guarantees required by this spec, some Boot implementations may create and use locks to control access to blocks of shared memory returned by malloc; in some cases even non-atomic, unordered load or store instructions could cause the Boot implementation to acquire a lock. malloc_local lets such an implementation know that it does not have to setup and use locks for this memory segment, and represents an assurance by the programmer that this memory segment will only ever be accessed by the same thread that called malloc_local.

Note that an implementation may legally provide sequential consistency when the program requests only release or relaxed consistency; furthermore, the additional RMW ops may be implemented using the CAS primitive; therefore, all of the atomics in the optional instructions may legally be implemented using only the atomics in the small profile as primitives.

instructions to allow alignment?

other forms of in,out which read many bytes at a time to/from preallocated buffers, and identify the device using pointers

also nonblocking

---

floating point and other arith and other instrs from wasm and llvm

syscalls: plan9, klambda, posix, windows, macos, musl libc, android, ios, aws, freertos, python, l4, lists of frequent syscalls, wasm

clocks file rw seek file handle management open/close file management mv attributes networking nonblocking io python event loop

concurrency rw instructions (loads/stores with various memory orders) concurrency rmw instructions (cas, etc) concurrency process management instrs (fork etc)

tui: setcursorabsolute, setcursorrelative, getdimensions, setdimensions, clearscreen, printcharatcursor, getchar

graphics setpixel, getpixel, setpalette, setmode, getmodes, setcustommode? (custom screen size, custom #s of colors)

audio

pico8 https://www.lexaloffle.com/bbs/?tid=28207

---

note in boot spec that bootx will define some syscalls below 128? and some sinfos? and some/all instructions? or maybe just dont mention it much? or maybe say that some things are RESERVED for extension languages?

---

sinfo for:


If the function being called takes a variable number of arguments, then the total number of integer arguments is passed in register $11 and the total number of pointer arguments is passed in register $12.

If more than 3 integer arguments or more than 3 pointer arguments need to be passed, then a pointer to the remaining integer arguments is passed in &11 and/or a pointer to the remaining pointer arguments is passed in &12. The contents of the memory holding the additional arguments may be overwritten by the callee, just as with registers 5,6,7. However, the registers 11,12 (both banks) themselves are still callee-saved and, if modified, must be restored before return. The callee must not deallocate the memory pointed to by pointer registers 11 or 12 (that is, the memory holding the additional arguments).

On platforms which pass values which are neither integers nor pointers, when arguments are passed which are neither 32-bit integers nor pointers, if the value is guaranteed to fit within 32-bits, it is passed as an integer, otherwise the value is stored in memory and a pointer to the value is passed.

memory allocation mallo mfree
interop xcall xentr xaftr xret0 xreti xretp xtail

---

---

undef behav:

---

---

split 8-bit immediate offsets into 2 4-bit immediate offsets, and have one of those be ints, and the other be ptrs, so that you can specify an offset into a struct mixing ints and ptrs

---

something like RISC-V's RV32V (see section 'Why RISC-V's RV32V vector extension is better than fixed-width SIMD' in the plBook RISC-V chapter for why this instead of traditional SIMD)

see also ARM SVE, SVE2

---

at least 16-way permutes/shuffles (register) scatter/gather (memory; can be used for longer permutes, but in memory)

e.g. ARM NEON VTBL, VTBX; see https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-5-rearranging-vectors

e.g. consider also stuff like vpshufb, vpermps, vcompressps, vpscatterdd, vpgatherdd; also see https://branchfree.org/2018/05/30/smh-the-swiss-army-chainsaw-of-shuffle-based-matching-sequences/ ?

if we restrict ourselves to 16-way stuff, then we have 4-bit indices, and we can pack 16 indices into 64 bits.

see also https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-5-rearranging-vectors https://www.cnx-software.com/2017/08/07/how-arm-nerfed-neon-permute-instructions-in-armv8/

---

I/O in inp ?

and out outp?

I/O:

y'know, if we just make device a register, but make both STDIN and STDOUT device &0 (null ptr), then we're probably set (b/c the &0 register works for that). Then we dont need the #imm8u. Although we may still want in1 and out1 so that we can do i/o from registers, w/o main memory. Can i think of a clever way to stuff those into one instruction without changing the 'addressing mode' of operands depending on circumstance?

maybe for in, if &dest is &0 then $len is the dest? and for out, if &src is &0 then $len is the src?

i dunno, that's a little ugly. At least we can stuff in1 and out1 into MORE.

---

---

---

  1. # I/O ## If standard console streams STDIN, STDOUT, exist on the platform and are supported by the implementation, they must be devices #0, #1, respectively, and device #2 must be STDERR if it exists, and otherwise should be an alias to STDOUT or may be a null device (one which never emits anything and to which writing has no effect).

An implementation does not have to support INP, OUTP.

(actually later i decided that both stdin and stdout should be device #0)

---

old stuff from that: When using xlib, there is no need to set the link register, these instructions will set it if needed. Upon making a call, up to 3 integer arguments are in integer registers 1, 2, 4, and up to 3 pointer arguments are in pointer registers 1, 2, 4, and the return address is found in pointer register 5. Upon returning from a call, up to 3 integer and up to 3 pointer return values will be found in registers 1,2,4 using the same convention as for calling.

---

undefined behavior for:

---

---

---

---

rem(ainder)

--

misc log break hint impl more16 nop

TODO should these be separated?

HINTs may be executed as NOPs. They are intended for forward compatibility; later versions of the specification may define semantics for various HINTs with the understanding that some implementations may execute them as NOPs.

---

in, out

57,in,0,ri,rf,u3,1,0,0,0,0,0 58,out,0,rf,rf,ri,1,0,0,0,0,0

---

---

---

mb make a 'restriction' (anti-extension) which bans mul, j32 and everything else with immediate data after the instruction in the instruction stream etc, and mb is 16-bit instead of 32-bit

---

i decided to leave them embedded in the instruction stream. Otherwise labels could only be loaded with an 8k range, which would make implementations jump thru hoops. As a bonus, now we can assume that any Boot implementation supports larger programs.

We can have a BootX? anti-extension that disallows this.

---

  1. # Boot subsets

This section provides names for various useful subsets of Boot.

The Boot subsets in this section omit various features that are required for an implementation to be compliant with the Boot standard. Therefore an implementation which implements only these cannot be said to be a full Boot implementation, but rather can only be claimed to implement a subset of Boot.

  1. ## Static control flow subset

An implementation or program is said to support/use only the 'static control flow' subset of Boot if it does not provide/use:

    1. Tiny subset

An implementation or program is said to support/use only the 'tiny' subset of Boot if it does not provide/use:

    1. Boot flavors

This section provides names for various useful guarantees on the behavior of Boot implementations or programs.

Note that Boot implementation of these flavors may be full, compliant Boot implementations, since nothing in the flavor definitions require a Boot implementation to contradict the Boot standard.

  1. ## Pure flavor

A Boot implementation or program is called 'pure' if it does not provide/use:

      1. Vanilla 32-bit flavor

A Boot implementation is called 'vanilla 32-bit' if it guarantees that:

      1. Vanilla 64-bit flavor

A Boot implementation is called 'vanilla 64-bit' if it supports the 64-bit integer extension (see above), and if it also guarantees that: