Oot Bytecode and Oot Assembly

Introduction

Oot Bytecode (OotB?) is a simple, easy to implement virtual machine. Oot Core is implemented in OotB?. OotB? only has a few 'primitive' instructions; this language is called 'primitive OotB?'. The OotB? Stdlib contains implementations of other instructions built out of the primitive instructions. This makes it easy to bootstrap a naive implementation of Oot; all you have to do is to implement an interpreter for primitive OotB?.

Motivation

Why did we invent yet another bytecode format? Design goals include:

portability:
- very easy implementation of a naive, slow Oot Bytecode interpreter or compiler on new platforms
- and then, after implementing a naive Oot Bytecode interpreter or compiler on a new platform, it should be easy to incrementally improve its performance up to a reasonable level
interop language:
- we want to be able to compile an Oot program to Oot Bytecode, and then compile Oot Bytecode to a foreign High-Level Language (HLL), and get foreign language code that interoperates with other code in that language, using that language's native data structures, etc.
- we want to be able to compile an Oot program to Oot Bytecode, and then compile Oot Bytecode to a foreign High-Level Language (HLL), and get readable results that preserves much of the fundamental high-level constructs of the source code (eg while-loops, lists, structs, etc). It should be fairly easy to write a compiler from Oot Bytecode to a foreign HLL that achieves this (although a naive compiler that only recognizes Oot Bytecode primitives won't do this).

Popular existing bytecodes are not 'very easy' to port to a new platform, and are too 'low level' to directly preserve high-level language constructs (HLL constructs could be preserved via annotation, but then the annotation format effectively forms another programming language anyways).

Oot Bytecode is very easy to initially implement because it has so few primitive instructions. This naive implementation will be very slow but can be incrementally improved by overriding the default interpretation of various stddlib instructions with custom versions using the target platform's native constructs.

Oot Bytecode preserves high-level language constructs via the 'stdlib' instructions that are built out of the primitives. This allows a compiler to an HLL to emit somewhat readable target code that uses the platform's native data structures just by recognizing those stdlib instructions that have an obvious counterpart in the target platform.

Note that we want to be able to efficiently compile the Oot Core implementation from OotB? to even a very inexpressive, static language. Therefore we try to limit run-time dynamicity in OotB? and in the Oot Core implementation.

Overview

Oot Bytecode is the object language of an imperative virtual machine with very few primitive instructions. We hoped that typical Oot Bytecode will be able to be compiled with relatively little effort into relatively readable code in foreign HLLs.

It is similar to the ISA of a CISC 3-operand memory-memory machine, but it is more abstract (perhaps similarly to Lua bytecode). It abstracts away from details such as data representation format, and its 'memory locations' or 'registers' have semantics more similar to local variable identifiers rather than an array of bytes (in this way it is more similar to eg Lua bytecode). It has operations analogous to pointer arithmetic, but these actually correspond to moving across fields in a struct rather than to indexing into an array of bytes. It is relatively agnostic regarding the actual bit width of integers, pointers, etc; in fact, a single item in the constant table can represent an arbitrarily large object. It is a memory-memory (rather than register) machine, but it can directly address (with absolute immediate addressing) only 4k 'memory locations'; however, opaque pointers can point to other 'memory locations' beyond what can be directly addressed.

It has 8 addressing modes, plus a 'boxed' addressing mode bit. The use of 3-operand format and 8 addressing modes is designed to reduce the need for intermediate instructions moving data between virtual registers, which makes it slightly harder to compile programs in other VM architectures into readable HLL code.

It has 3 instruction encodings, SHORT, MEDIUM, and LONG, all three of which consist of a sequence of 64-bit units. SHORT packs 8 fixed-length instructions into each 64-bit unit; MEDIUM is a fixed-length encoding with 64-bits per instruction; and LONG is a variable-length encoding. MEDIUM is the basic encoding; SHORT is a denser encoding that can only represent a subset of instructions; and LONG is an extensible encoding.

The compilation unit for Oot Bytecode is an Oot Bytecode Module. Modules have important limits as a consequence of length limitations in the 64-bit MEDIUM encoding; most notably, the number of functions directly exported by one single module cannot exceed 2048 (a single module can, however, also export other modules, and collectively these can export more than 2048 functions).

Oot Bytecode has polymorphic typing and also has first-class functions and has a function calling syntax in which the opcode field of the instruction, instead of containing a number identifing a Oot Bytecode Core instruction, contains a reference to a memory location which itself contains a reference to a particular function. This is accomplished by giving an addressing mode to the opcode (hence, the opcode field is called the 'opcode operand').

In addition, many integer opcodes are assigned to refer not to Oot Bytecode Core primitive instructions, but rather to 'custom instructions' implemented in the Oot Bytecode Standard Library. These opcodes do not need to be implemented by an Oot Bytecode implementer; they need only implement the Oot Bytecode Core instructions, since the Library is ultimately defined in terms of those primitives. Advanced implementations can treat the custom instructions as compile-time macros and expand and inline them at compile-time; or they can replace the provided implementations with platform-native implementations.

Oot Assembly, the textual language, also has a separate compile-time macro facility, which are expanded during compilation to Oot Bytecode. Why have 'custom instructions' if you already have compile-time macros? Because Oot Bytecode 'custom instructions' can more easily be replaced by with platform-native implementations by an interpreter.

SootB

todo now it's the same as SHORT, plus 64-bit LOADI (16-bit immediates) with restricted addr modes, and maybe a STOREI that can write to 11-bits of registers

SootB? is a simple stack-machine single-threaded subset of OotB?. SootB? is designed to be really easy to implement an interpreter for. There is a compiler from OotB? interpreter to SootB? implemented in SootB?, so this allows you to bootstrap OotB? and hence Oot from just a SootB? interpreter. This will be slow and you will probably want to eventually extend your SootB? interpreter into a full OotB? interpreter, but in the meantime you'll have a working Oot installation. Since SootB? is a subset of OotB?, you can work on extending it incrementally (however you won't experience any speed gains until you are all done, at which point you can throw out the provided OotB?-on-SootB? interpreter and replace it with your own native OotB? interpreter).

In the following, TOS stands for 'top-of-stack', the value on top of the stack.

A SootB? machine has program code, a program counter, a stack, a linear memory, and an error register. It is untyped. Each location in the stack and the memory can hold a value which can be treated alternately as an unsigned short integer (which can at least hold values between 0 to 32767, inclusive), or as a pointer. Pointer arithmetic and arithmetic between pointers and integers is permitted.

Program inputs including any file descriptors needed for console I/O are placed on the stack before starting the program. The program terminates when a HALT instruction is encountered.

SootB? instructions:

todo: update this with proposal from see ootAssemblyNotes12
ANNOTATE annotation_type4,annotation_subject4,annotation_value4: informative/notification (not consequential/impactful) annotation; has no effect (could be hints, information like stack maps used for program verification, etc). NOTE: should reserve one or more annotation_types for implementation-defined annotations (or does 'comment' serve as this as well?)
J addr12: absolute jump
JS: Pop and jump to popped addr
BRZ n12: Pop; if the value popped was zero, then relative branch by (signed) n 4-bit segments
BRNZ n12: Pop; if the value popped was non-zero, then relative branch by (signed) n 4-bit segments
DUP: duplicate
ADD-U15: add
SUB-U15: subtract
LEQ: <=
LOAD addr12: Load value from memory location addr12 onto stack
STORE addr12: Pop stack and store in memory location addr12
LOADS: Pop and load value from popped memory location onto stack
STORES: Pop twice and store first value from stack into memory location of the second value
LOADI immediate_constant16: Push constant
SYSCALL n12: Call out to external library function n12
SYSCALLS: Call out to external library function

Instructions must be 64-bit aligned. Note that ANNOTATE, J, BRZ, BRNZ, LOAD, STORE, SYSCALL, and LOADI are encoded as 16 bits (4 bit opcode and 12 bit data), and everything else is just 4-bits (just the opcode).

The code of the initial program is not necessarily accessible via memory. However, you can allocate memory and put code in it and jump into that.

Program code does not necessarily have to fit in 4k, but you can only absolute jump to the first 4k of memory. You could have a jump table and LOAD an item from it and then JPOP to that, though. Similarly, by storing a pointer in the 4k you can load/store beyond 4k, but you can only directly load/store to the first 4k.

Annotation types:

0: comment; this also serves as a NOP.
1: stack type (stack map): subject is depth in stack, value is type (todo: indicate exactly what the following instruction will do to the stack)
2: label
TODO: implementation-defined (or does 'comment' serve as this as well?)

some syscalls:

BREAK: breakpoint into debugger (implementation-specific; should not be in production code)
RECV: Read from a file
SEND: Write to a file
OPEN: Open a file
CLOSE: Close a file
FILELEN: on filehandle, or file, or what? or http://stackoverflow.com/questions/238603/how-can-i-get-a-files-size-in-c http://stackoverflow.com/a/238607/171761 or stat or fstat ? also, it would be nice to get the length of something that was MALLOC'd
MALLOC: Allocate memory
RELEASE: Free memory
PUSHERR: Push the contents of the error register (??)
HALT: Terminate the program.
LOG: Like "print to STDERR"
LOADMODULE (loadoob and loadraw and loadob0)
LOADKEXTERN: Load a constant (which can be anything exported, like a function) from an external module
PLATFORM: Reserved for platform-specific syscalls.
IMPLEMENTATION: Reserved for implementation-specific syscalls.
CAS
LL/SC: LL and SC
EVAL1

In more detail:

J addr: Set the PC to addr.
BZ: Pop two items, 'cond' and 'offset'. If cond is zero, then let soffset = offset - 2047, and add soffset to the PC
BNZ: Pop two items, 'cond' and 'offset'. If cond is non-zero, then let soffset = offset - 2047, and add soffset to the PC
DUP: Pop x, push x, then push x again
ADD-U15: Pop two items, y and x. Push x + y. Any program in which x+y >= 2^15 is illegal
SUB-U15: Pop two items, y and x. Push x - y. Any program in which x - y < 0 is illegal
LEQ: Pop two items, y and x. push 1 if x <= y, 0 otherwise
LOAD: Pop 'addr' from TOS, then read the value of memory location addr and push that value
STORE: Pop two items, 'value' and 'addr', and store value in memory at addr
LOADK constant: Push constant
RECV: Pop 'file_descriptor'. Read from the indicated file_descriptor, pushing the result
SEND: Pop two items, 'file_descriptor' and 'value'. Write value to the indicated file_descriptor
SYSCALL: Pop fnptr. Call external library function fnptr. The arguments to the external library function should be on the stack, and after calling, the result, if any, should be placed on the stack.
OPEN: Pop 'filepath'. Open 'filepath' and push the opened file_descriptor
CLOSE: Pop 'file_descriptor'. Close file_descriptor
MALLOC: Pop size. Push a pointer to the new memory block, or 0 if there was an error.
RELEASE: Pop ptr. Free memory block at ptr (ptr must have been previously returned by MALLOC, and not previous freed)
PUSHERR: Push the contents of the error register
HALT: Terminate the program

todo: we need some way for GC to query for a list of pointers in any memory subspace, or at least if a given memory location contains a pointer

Instruction encoding:

Each instruction is encoded as a fixed-length 64-bit chunk, given below:

todo

How SootB is a subset of OotB

Each SootB? instruction is also an OotB? instruction:

J addr: J constant
BZ: BZ POP, POP
BNZ: BNZ POP, POP
DUP: CPY PUSH, *DS
ADD-U15: ADD-U15 PUSH, POP, POP
SUB-U15: SUB-U15 PUSH, POP, POP
LEQ: LEQ PUSH, POP, POP
LOAD: CPY PUSH, POP
STORE: CPY POP, POP
LOADK constant: CPY PUSH, constant
RECV: RECV PUSH, POP, 0
SEND: SEND POP, POP, 0
OPEN: OPEN PUSH, POP, 0
CLOSE: CLOSE POP, 0
MALLOC: MALLOC PUSH, POP
RELEASE: RELEASE POP
PUSHERR: CPY PUSH, ERR
SYSCALL: SYSCALL POP
HALT: HALT

OOTB imports and loading

Behavior inherited from Oot

how module loading works

To minimize PATH manipulation, the PATH must be manipulated via the path_insert() and path_remove() functions. To minimize filesystem queries, the contents of filesystem directories in the PATH may be examined and cached at any time (they may or may not be examined lazily rather than at program start); you must call the system path_cache_invalid(subpath) function if you need to discard part of this cache after a possible mid-runtime change in the contents of these directories.

The init() functions of modules are run at some point before the namespace exported by a module is accessed by other modules, but possibly lazily, in parallel, and in any order. A module may import a module that imports it (cyclic imports are allowed). However, the init()s of modules must not cyclicly depend on one another.

If a module is imported by multiple other modules, these may or may not be realized as separate instances of the module.

Authentication

A module is considered 'weakly authenticated' or just 'authenticated' if it is signed by a public key trusted by the implementation (typically this will include keys given on the commandline and keys in a persistent keystore), and if all of its imports are authenticated (or, in the case of cyclic imports, if all modules in the cycle are signed by a public key trusted by the implementation). It is considered 'strongly authenticated' if it is authenticated and all of its non-official import statements include an inline public key or public key hash, and all of its imports are strongly authenticated (or, in the case of cyclic imports, if all modules in the cycle meet these conditions).

Oot implementations may include commandline switches to determine:

whether to cause an error upon unauthenticated modules or weakly authenticated modules
whether to prefer strongly authenticated and authenticated versions of modules when different versions with different authentication levels are available
which unsigned files to exempt from these preferences

Instruction encodings

todo: the following is a little out of date. Now the shortest format is 8 bits, with 2 form bits (Bootx; 64 opcodes which are just shortcuts for longer fully specified instructions); the next is 16 bits (Boot), with 1 form bit (and 4 bit opcode, 3 bit operand, 2x4-bit operands); the next is 32 bits (BootX?), with 3 form bits (and 5 bit opcode, 3x4bit operands, 4x3bit addr modes); the next is 64 bits (OVM), with 4 form bits (and 5x8bit fields, 5x4bit addr modes; fields are opcode, generic polymorphism type specifier, 3x operands; addr modes are as in BootX? plus another bit for boxed/unboxed).

Oot Bytecode instructions come in (64-bit-aligned) chunks of 64 bits. Each 64-bit chunk is either in SHORT, MEDIUM, or LONG format. The first one or two bits (MSBs) are 'format bits' which indicate whether the format is SHORT, MEDIUM, or LONG. If the first bit is 0, then the format is SHORT. If the first two bits are 11, then the format is MEDIUM. If the first two bits are 10, then the format is LONG. That is to say,

0x: SHORT format (the second bit is used by the short format)
10: MEDIUM format
11: LONG format

SHORT encoding format:

4 opcode bits + 3 x (2 addr mode bits + 2 operand bits)

(note: every 64 bits, the first opcode bit must be "0")

MEDIUM encoding format:

2 form bits, 2 implementation-dependent bits, 4 x (3 addr mode bits + 12 operand-data bits)

LONG encoding format:

TODO mb add csexp and XML stuff, see ootAssemblyNotes13.

LONG format is a variable-length instruction format. Then length of LONG instructions range from 64-bits to 2048-bits (128 16-bit words). The length of a LONG instruction is always a multiple of 64-bits.

LONG consists of a stream of 16-bit words. This is organized into:

an 'initial LONG header' word
a sequence of 'phrases' which are groupings of one or more words, each beginning with a 'phrase header' word. Phrases may contain sub-phrases.
trailing padding

The first word is the initial LONG header:

1 bit: 1
7 bits: length: # of words in this LONG instruction (not including trailing padding).
8 bits: LONG_TYPE

The amount of trailing padding can be inferred by rounding up 'length' to the next multiple of 4, and subtracting the given 'length' from that number.

LONG_TYPE:

0: This LONG holds an ignorable annotation. Interpreters may skip it.
64: This LONG holds an annotation that interpreters must understand.
65: This LONG holds an OotB? instruction. Interpreters must understand it.

In general, LONG_TYPEs below 64 may be skipped.

A phrase header word is of one of two formats, distinguished by the value of their most-significant bit:

Single-word phrase header format:

1 bit: 0
4 bits: role
3 bits: addr mode
8 bits: payload

Multi-word phrase header format:

1 bit: 1
7 bits: phrase length (in words, not counting the header, 0-indexed; must be >= 1)
1 bit: payload-is-phrase
7 bits: phrase role

When the payload-is-phrase bit is 0, this means that the word immediately following the header is a 16-bit literal constituting the 'payload' (role 'operation/payload') of the phrase; if the phrase length is longer than 1, then the next word after that must be a (sub-)phrase header. When the payload-is-phrase bit is 1, this means that the word immediately following the header is a (sub-)phrase header (and the payload, if there is one, will be a subphrase will role 'payload', or with roles 0-15).

To encode a payload literal longer than 16 bits (available choices are 32 bits, 64 bits, 128 bits, and 256 bits), set payload-is-phrase to 'true' and then encode the payload within a subphrase with some role 0-15 (the role here indicates the amount of padding within the subphrase). Note that these subphrases must have leading padding as needed so that the payload literal is aligned. Phrases with roles 0-15 must always have 'payload-is-phrase' set to 0 and must not have any subphrases.

A sub-phrase is just a nested phrase and has the same format as a phrase.

The 'phrase role' defines the semantics of the phrase with respect to its parent phrase (or with respect to the entire instruction, if top-level). (A subset of) 32 roles are currently defined:

roles 0-15:

multiword literal preceeded by 0 words of padding
multiword literal preceeded by 1 word of padding
multiword literal preceeded by 2 words of padding ...
multiword literal preceeded by 15 words of padding

roles 16-31:

padding, comment, or annotation that can be ignored
annotation
operation/payload
input
output/destination
both input and output; pass-by-reference
context
addr mode
modality
constraint
conjunction
alternate format
grouping
?
?
?
?
?

A null single-word phrase header (that is, one whose value as an integer is 0) is typically used for padding. Note that any phrase header whose value as an unsigned integer is less than 2048 has the most-significant-5-bits set to 0, meaning that it is a single-word phrase header of role 0, and can be ignored.

A top-level 'operation' role is like the opcode field of a MEDIUM instruction (field 1). A top-level 'input' role is like the input operands of a MEDIUM instruction; to give multiple operands, have multiple phrases with the 'input' role, with the 'rightmost' one first (eg the first phrase with an 'input' role corresponds to field 4 of a typical MEDIUM instruction, and the next phrase with an 'input' role corresponds to field 3 of a typical MEDIUM instruction). A top-level 'output' role is like the output operands of a MEDIUM instruction, that is, the 'dest' field (field 2).

(todo: what about instructions with atypical numbers of inputs and outputs? in that case, do we follow instruction semantics, or stick with MEDIUM field correspondence?)

A top-level 'context' role is like the 'context' field (field 0) of a MEDIUM instruction.

To encode an immediate constant operand, just place the constant literal as the payload of a phrase with a suitable role (eg to encode an immediate literal opcode, create a phrase with an 'operation' role and place a word with the opcode itself as the payload of this phrase).

To encode an operand with some other addr mode, set the operand as the payload, but include a sibling phrase of role 'addr mode' with the desired addr mode as payload.

To encode a CALL instruction with many arguments, just give many phrases with role 'input'.

The semantics of the other roles (annotation, both input and output, modality, constraint, conjunction, alternate format, grouping) have not yet been defined.

Addressing modes

addr modes: 000: immediate constant 001: register direct (note: in SHORT mode, register direct applied to DATASTACK or CALLSTACK (registers 0 and 1) operates on registers 4 and 5 instead; in other modes, it is simply illegal to apply register direct to DATASTACK or CALLSTACK). Similary, during execution of a custom instruction in SHORT mode, register direct on register 2 redirects to register 6. 010: register indirect 011: stack / register indirect with the value in the register post-increment on read, pre-decrement on write

100: constant table 101: boxed register direct 110: boxed register indirect 111: boxed stack / register indirect with the value in the register post-increment on read, pre-decrement on write

With the side-effecting addr mode (post/pre increment/decrement, 0011), if there are more than one in the same instruction, then for the primitive instructions, they are processed in order of right to left.

Note that only addressing modes 000-011 are available for the SHORT opcode operand.

immediate constant: the integer contained in the operand is the value. If an immediate constant is given as an argument to an instruction which expects a writable effective address, then a 'dummy' effective address is generated, but writes to it are discarded. Note: a read from a 'boxed' immediate constant is a constant table read.

constant table: the integer contained in the operand is an index into the module's constant table. If an constant table is given as an argument to an instruction which expects a writable effective address, then a 'dummy' effective address is generated, but writes to it are discarded.

register indirect: the operand is interpreted as an address, and the effective address is the value found at that address. Note: This address mode provides the functionality provided by LOAD/STORE in RISC architectures.

indirect-offset: The operand is split into two operands; let A = the first 6 bits, and let B = the second 6 bits. A is interpreted as an address; let C be the value found at that address, where C is to be interpreted as an address. The effective address is C + B. Note: this addressing mode provides efficient access to fields within a struct, where a pointer to the struct is in one of the first 64 memory locations, and the struct has up to 64 fields.

stack (indirect with read post-increment, write pre-decrement): This is like 'register indirect', except that if the effective address is read from, the value at the address in the operand is incremented after the rest of the instruction executes, and if the effective address is written to, the value at the address in the operand first incremented before the effective address is determined.Note: This addressing mode provides the equivalent of POP and PUSH commands when applied to a stack pointer with a downward-growing stack, and also can be used for scanning an array in a loop.

boxed addressing modes: whenever a value is read or written, a 'boxing function' is called; in the case of a read, it is passed the value and returns the value, in the case of a write (to a location other than the DISCARD location, 0), it is passed an effective address and is expected to write to that address. todo: how to determine the boxing function? The two best choices seem to be: per-program, or per-type(class). but even if per-program, doesnt the boxing fn need to be told the type, so that it can add in a type tag? what about preserving boxed state over fn calls, doesn't that imply that the boxer is a wrapper rather than a getter/setter; or are Oot Views implemented some other way?

(note / design choices / motivations: why provide stack addressing if there is also register addressing? (a) a way to segregate 'temporaries' from 'local variables' (b) natural data structure for some language-y things, such as call stack (c) a way to hold more things in memory than there are registers with efficient allocation/deallocation, as long as you only use the ones near the top of the stack)

Opcodes

Opcodes 0-15 are 'primitive'. Opcodes 16-127 are either unassigned or custom. Opcodes 128-255 are sugar for first-class-function-style CALLs to stdlib functions.

Custom instructions

The implementation for a custom instruction cannot be recursive; it cannot call other custom instructions that directly or indirect call itself; that is, the static call graph (or dependency graph) between custom instructions must be acyclic.

The implicit 'call stack' of custom instructions has a depth limit of 4; a custom instruction cannot call a custom instruction which calls a custom instruction which calls a custom instruction which calls a custom instruction.

The implementation of a custom instruction, if any contained custom instructions are recursively expanded until only primitive (Oot Bytecode Core) instructions remain, must take up less than 2048 Oot Bytecode Core instructions.

These restrictions allow an Oot Bytecode implementation to implement custom instructions by:

caching their implementation (or recursively expanded versions of it) while retaining guarantees on how large such a cache could be
recursively inlining them, while retaining guarantees on how much code size might expand
or by an implicit call stack, while allowing static allocation of the stack without any possibility of stack overflow, and without the stack taking up too much space

Note that these restrictions don't apply to all dynamic first-class-functions, just to custom instructions. Custom instructions (like core instructions) are invoked by an address mode of 0000 (immediate constant) on the opcode operand. Ordinary first-class functions are invoked by other address modes on the opcode operand.

(todo: also, can a 'custom instruction' call an ordinary function in the standard library? surely it can CALL an address passed into it, so the question is not whether it can CALL at all, it's just, is it allowed to have prior knowledge of the standard library functions? i'm leaning towards yes; this isn't much different from allowing it to access numerical constants in its module's constant table)

Custom instructions do not necessarily execute atomically; care should be taken to write custom instructions in a threadsafe manner. Stdlib custom instructions are threadsafe unless otherwise noted.

Global constants

Constant 0 is nil. Constant 1 is the current module.

Primitive types

0: BOTTOM. This type has no values (not even NIL). Every type is a supertype of this type.
1: DYN (dynamic). This type is a supertype of every type. Note: The Oot implementation, expressed as bytecode, does not itself have variables of type DYN; however, other programs written in Oot might be compiled to code that uses DYN.
2: PTR (pointer).
3: UINT16 (unsigned integer). Has values in the range [0, 65535].
4: CAPABILITY
5: TYPE

Note: type safety is extremely important for the 'capability' type because an implementation could break capability security if it could forge capabilities by implicit coercion from other types. Therefore, items of type CAPABILITY can only appear in the CAP register or in address spaces created by CAPMALLOC.

Stdlib types

todo: move this to a separate file

5: UINT (unsigned integer) (abstract type).
6: INT (signed integer) (abstract type).
7: INT16. Values in the range [-32767, 32767].

these types used in initial loading:

256: imported constant
257: registers memory segment
258: read-only memory segment ('r' permission)
259: code memory segment ('rx' permission)
260: read/write memory segment ('rw' permission)
261: uninitialized read/write memory segment ('rw' permission)
262: rwx memory segment
263: binary blob
264: mmapped file resource

(note: even though typeclasses and types (abstract and concrete types) are both listed here, they are distinct in that a type signature like

(==) :: (Eq a) => a -> a -> Bool

cannot be satisfied if the two inputs are different concrete types. )

(note: in general, actual function signatures use type constraints and type variables, like Haskell; just using abstract and concrete types directly cannot represent the difference between

(==) :: (Eq a) => a -> a -> Bool and (==) :: (Eq a, Eq b) => a -> b -> Bool

eg, whether it is a typechecking error to compare ints and strings for equality)

we could have a shorthand though where these 'types' are used directly in the function signature (would that mean the first or the second of these possibilities?)

also, i guess we could use these 'types' are used in stack maps. That implies that they should be the second possiblity when used in the type signature, eg

(==) :: Eq -> Eq -> Bool

is shorthand for

(==) :: (Eq a, Eq b) => a -> b -> Bool

because there can be no guarantee that two positions in the stack map that are both Eq have the same concrete type, just that each of their concrete types is Eq.

)

Memory

There are 256 directly-accessible memory locations or 'registers'.

The first 8 registers have special meanings:

0: DATASTACK pointer
1: CALLSTACK pointer
2: reserved for implementation-dependent use (also, 'hidden stack' pointer when used from within custom instruction implementation)
3: GPR, but also, in SHORT mode, the only register from which first-class functions can be called.
4: ERR. In SHORT mode, register addressing mode on 0 is mapped to this.
5: GPR in MEDIUM mode, context register in SHORT mode. In SHORT mode, register addressing mode on 1 is mapped to this.
6: PC (read-only) (this points to the NEXT instruction to be executed, not the currently-executing one)
7: reserved for implementation-dependent use

todo: frame pointer? (%rbp vs %rsp)

todo: really?: It is illegal for user code to access the DATASTACK or CALLSTACK registers using REGISTER addressing mode (register indirect is allowed though); this is because the stacks might not be implemented using memory-mapped stacks. Similarly for R3 if during the execution of a 'custom instruction'.

It is illegal for user code to access the 'reserved for implementation' registers.

Threads and processes

OotB? threads and processes aren't necessarily OS threads and processes. An OotB? thread has its own registers, data stack, and call stack, but shares other memory with other threads in the same process. An OotB? process has its own memories, unlike threads; it also (like threads) has its own registers and stacks.

Each thread and each process has an integer ID, called a TID and PID respectively. Distinct threads are guaranteed to have distinct TIDs, and distinct processes are guaranteed to have distinct PIDs, but that's all that is guaranteed about them.

New threads and new processes are created via FORKTHREAD and FORKPROCESS, respectively. These take an argument which is an address to JMP to; the new child thread/process will start out as a clone of the old parent, but then the child will JMP to the given address whereas the parent will continue by executing the next instruction after the fork instruction. They return the TID or PID of the child process.

Different processes and tasks may or may not execute in parallel. An implementation must guarantee, however, that even if one process or tasks goes into a tight infinite loop, that all other processes and tasks still progress.

When callbacks are present, there are no guarantees on when they may be called, and they may or may not execute in parallel on a separate thread.

Primitive instructions

Primitive opcodes range from 0 to 255 (although right now there are only TODO of them). The first 8 of these are reserved for metaprogramming.

Primitive instructions execute atomically.

0: comment 1: annotation (some kinds of annotation: code type annotation; memory type annotation; variable name annotation, plus how many instructions this name annotation is in scope for) 2: source map 3: assertion ?: are there prefix modifiers? (mb: 4: prefix annotation (?: i don't like this because the implementation must know its semantics in order to execute the code)) 5: block begin/end (some kinds of blocks; stereotyped HLL idiom (execute this); data (skip this); code block (skip this)) 6: alternative translation block (eg CALL as a single LONG instruction vs CALL as many medium instructions; but in general every alternative block may span multiple instructions). Perhaps this has the effect of an unconditional JMP to the last alternative block, for those interpreters that can't understand the other alternatives 7: ?

8: J (note: the operands are added together; if constant addressing mode, they refer to the LABEL constant table) (is the dest operand used as a link register, eg put the instruction after the current one into that? probably) 9: BZ (branch if zero) 10: BNZ (branch if not zero) 11: CPY (copy) 12: ADD-U15 (addition) (todo should this just be polymorphic ADD?). The result of ADD-U15 must satisfy result < 2^15, otherwise it may crash or trap or or yield the wrong result. 13: SUB-U15 (subtraction). The result of SUB-U15 must satisfy 0 <= result, otherwise it may crash or trap or or yield the wrong result. 14: LEQ (<=) 15: EQ (arithmetic equality) 16: RECV: recv(dest, channel, callback). If 'callback' is 0, then blocking I/O. todo: how about just an address to write '1' to upon success? 17: SEND: send(channel, msg, callback). If 'callback' is 0, then blocking I/O. todo: how about just an address to write '1' to upon success? 18: SYSCALL: syscall(fn). Call external library function. This tells the implementation to use platform native calling conventions to 'fn', which is a reference to a native library function. The arguments to fn are found on the data stack, and after the call the result, if any, should be placed on the data stack.

Note that the result of ADD-U15 and SUB-U15 must satisfy 0 <= result < 2^15, otherwise they may crash or trap or or yield the wrong result.

??: POLL: poll(dest, channel, timeout) ??: todo if instead of callbacks, OPEN and RECV etc write a '1' to some memory location upon finishing, then can we use POLL to look for these?

?: OPEN: open(dest (channel output), filepath, callback) (note: 'filepath' may also be an integer file descriptor, may refer to channels as well as files, etc). If callback is 0, then blocking. todo: how to handle file mode? encoded in filepath? filepath is actually a tuple? callback is actually a tuple? todo: how about just an address to write '1' to upon success? ?: CLOSE: close(channel, callback). If callback is 0, then blocking. todo: how about just an address to write '1' to upon success? ?: LOADMOD: loadmodule; loadmod(dest, module_name) ?: UNLOADMOD: unloadmodule; unloadmod(module)

(note: SUB-U15V is not needed because we can use LEQ to test if y <= x when considering doing (x - y)) ?: SEQ (structural equality) ?: LOG (todo) ?: NOP (no-op) ?: HALT (end program) ?: BREAK (break to debugger) ?: BAD (illegal instruction) ?: MALLOC and MFREE (MALLOC requires the 'm' capability. MALLOC conveys the capabilities to the new memory by putting them in the CAP register. All capabilities are conveyed unless the process lacks the 'e' capability, in which case 'x' will not be conveyed.) ?: MLEN (how many memory locations are in the given subaddress space?). ADDRSIZE(dest, pointer)

?: CAS. Compare-and-swap. CAS(dest, old, new); compares the contents of 'dest' with 'old' and if they match, sets 'dest' to 'true' sets ERR to 1 if the compare failed, that is, if 'dest' did not hold 'old'; sets ERR to 0 if the compare succeeded. This is all atomic because all primitive instructions are atomic. ?: LOADL. Load-link. LL(dest, src). Like CPY except that in addition, the platform begins 'watching' the src address. Which address is being 'watched' is threadlocal. Only one address may be 'watched' at a time by any given thread; if another LOADL is executed, the old address is no longer watched. ?: STOREC. Store-conditional. STOREC(dest, src). Like CPY if (a) the address 'dest' is being 'watched' because it was 'src' in the most recent LOADL invocation on this thread, and (b) no thread has altered 'dest' (regardless of whether the value was changed back later) since the most recent LOADL on this thread; if these conditions are true, the value from 'src' is written to 'dest', and 0 is written to ERR; otherwise, 'dest' is not written to by this instruction, and 1 is written to ERR.

??: FORKTHREAD(tid (dest), addr_to_jump_to_in_new_process). Requires the forking process to have the 'm' permission. ??: FORKPROCESS(pid (dest), addr_to_jump_to_in_new_process). Requires the forking process to have the 'm' permission. ??: GETTID(dest) ??: GETPID(dest)

?: CAPMALLOC, CAPMFREE: create/release a new subaddress space that can hold capabilities. CAPMALLOC requires the process to be holding 'c' capability and the 'm' capability. ?: CAPMLEN (like MLEN for CAPMALLOC'd memory) ?: CAPUNION: capability union: CAPUNION(dest, cap1, cap2) ?: CAPONLYADDR: from a capability, create a new capability that only contains those permissions relating directly to the address subspace containing the given pointer. CAPONLYADDR(dest, cap, pointer) ?: CAPONLYPERM: from a capability, create a new capability that only contains those permissions in a certain type mask. CAPONLYPERM(dest, cap, permission_type_mask). ?: CAPGETPERM: query the permissions granted by a capability for the subaddress subspace containing the given pointer (returns a 16-bit permission_type_mask). CAPGETPERM(dest, cap, pointer) ?: CAPGETADDRS: query a list of all subaddress spaces (directly) contained in a capability, and create a (CAPMALLOC'd) address subspace containing them, returning this new subaddress space in a pointer. CAPPUSHADDRS(dest, cap). (note: this is a neat trick to create a low-level list, but it means that accesses to this list won't be boundschecked by higher-level code at typechecking time. I am a little uncomfortable with that. I suppose we can encapsulate this with a function returning a higher-level list later. But in fact, i'm somewhat uncomfortable with the very idea of subaddress spaces with dynamic size. I guess this can't be helped if we have dynamic-size MALLOCs, though?) ?: CAPSEAL: 'seal' a capability to an address, rendering it unusable except by CAPUNSEAL or CAPJMPSEAL. CAPSEAL(dest, cap, address). The purpose of this instruction is to allow an untrusted 'middleman' to transport capabilities without being able to use them. ?: CAPUNSEAL: unseal a sealed capability; requires 'u' permission on the address subspace containing the seal address. CAPUNSEAL(dest, cap) ?: CAPJMPSEAL: JMP to the given address, after first replacing CAP with the union of the contents of the given (potentially sealed) capability CAPEXE and the (potentially sealed) capability CAPDATA, provided that (a) both CAPDATA and CAPEXE are sealed to the same address, and (b) CAPEXE has (potentially sealed) execute permission on the address that is being JMP'd to. CAPJMPSEAL(addr, CAPEXE, CAPDATA). Note that 'u' unseal permission is NOT required. If there is an error preventing the JMP from completing, CAP will not be affected. The purpose of this instruction is to allow certain areas of memory to only be operated upon via calling into certain code.

??: EXEC_INSTR (from stack)

(note: ordinary CPYs can swap capabilities in and out of the CAP register, but a CAP can only be stored in an address subspace created with CAPMALLOC; all of these primitives must check to guarantee that capabilities are only being input from and output to either the CAP register or an address subspace created by CAPMALLOC)

(note: CAPDIFF and CAPISECT (set difference and intersection) and CAPCALLSEAL can be non-primitives)

Bootstrapping and VM plugins

In order to make it easy to write an OotB? interpreter, much of the functionality of OotB? has been implemented in OotB? in the form of custom instructions, syscalls, and a standard library. These components are loaded by the interpreter before program execution begins. This means that the only thing that an implementor must implement is (a) primitive OotB?, and (b) this bootstrapping system. The code of these components is part of the OotB? specification and is not modifyable by conforming OotB? programs or modules. The purpose of this is just to make it easier to port OotB? to a new system, not to allow users to metaprogram OotB?.

Although there will surely be some use-cases for implementations that allow the user to replace the provided components with others, for the most part, advanced interpreters can and should eventually override this 'plugin' system with an efficient native implementation of equivalent functionality, if developer time allows. Note however that changes to the code of these components will not itself be considered a backwards-incompatible change to OotB?, provided that the change does not necessitate any backwards incompatibility observable by end users (even though any such change may require advanced interpreters, which have replaced the component in question with an optimized native implementation, to make deep changes to those native implementations). As OotB? stabilizes, we might someday change the policy and begin considering any change to the code of these components as backwards-incompatible.

There are three files, loaded in order:

custom_instructions.ob0
syscalls.ob0
stdlib.oob

custom_instructions.ob0 and syscalls.ob0 share a format given below. stdlib is in the format of a OotB? module file.

The format for custom_instructions.ob0 and syscalls.ob0 is:

All byte ordering is little-endian. The file must be <64KiB?.

2 bytes: magic number
2 bytes: version
2 bytes: requested stack depth
2 bytes: TOC length
- 2 bytes: id of the item being provided (opcode of custom instruction or number of syscall)
- 2 bytes: offset to the header of the item, in units of bytes, from the beginning of this file (will always be divisible by 2)
- 2 bytes: length of the item's code (not counting the header), in units of bytes (will always be divisible by 8)
?? bytes: implementation-dependent stuff (must be an even # of bytes)
?? bytes: items, as listed in the TOC. Note that item offsets must be strictly increasing.
- item header:
  - 2 bytes: length of header, counting trailing padding
  - 2 bytes: number of inputs
  - 2 bytes: number of outputs
  - for each input, 2 bytes (a type identifier)
  - for each output, 2 bytes (a type identifier)
  - trailing padding, if needed, so that the following code is 64-bit aligned
- OotB? code, in MEDIUM format

Custom instruction signatures must be one of the following:

1 input, 0 outputs: in imm2
3 inputs, 0 outputs: in in in
2 inputs, 1 output: out in in
3 inputs, 3 outputs: inout inout inout

TODO should we also put some jump tables in here? the SYSCALLS might be able to use that

TODO do we need a signature for each instruction so that stack addr mode knows which operands to decrement and which to increment and which to do both (=neither)?

More about custom_instructions.bin:

Note that 'requested stack depth' should be used to give the precomputed maximum stack depth required for the 'hidden stack' (since the custom instruction CFG is acyclic, a finite maximum depth can be achieved if each custom instruction has a maximum local stack usage).

Limitations on code in custom_instructions.bin:

Custom instructions may not contain absolute JMPs.
Custom instructions must not recurse; the static call graph of the custom instructions must be acyclic.
Custom instructions are intended to be inlinable, by recursively expanding them into their implementations until only non-custom instructions are left (and modifying surrounding relative branches as needed); at this point the custom instruction is considered 'fully expanded'. A single custom instruction, fully expanded, must be less than 256 instruction packs (64-bit/8-byte groupings) (so, max 2040 bytes/1020 words). This limit ensures that surrounding relative branches do not expand by more than a factor of 256, which in turn ensures that the arguments of relative branches after expansion can be held in 16 bits if the arguments before expansion could be held in 8 bits.

Execution environment of custom_instructions.bin:

RET instructions return from the custom instruction to user code
At entry, the return address and the effective addresses of all three operands are pushed onto the 'hidden stack'

More about syscalls.bin:

The following syscalls are typically included:

vminit
vmexec_instr
vmlear, vmsear
vmlea0, vmlea1, vmlea2, vmlea3, vmlea4, vmlea5, vmlea6, vmlea7
vmlea, vmsea, vmjmp
vmtimer
vmshort, vmlong
vmterminate
vmloadk
vmsyscall
vmread, vmwrite? (these get the type of the thing they are supposed to be reading or writing as an argument, so that they can do different things if it is a pointer or an int, etc)
malloc, mfree, mlen
capmalloc, capmfree, capmlen
capunion, caponlyaddr, caponlyperm, capgetperm, capgetaddrs, capseal, capunseal, capjmpseal
loadmodule
unloadmodule
halt
handlesignal

TODO: can't we at first just put most of these into vmexec_instr for easier bootstrapping?

The following syscalls are considered 'vm plugins': vminit, vmexec_instr, vmlea*, vmsea, vmjmp, vmtimer, vmshort, vmlong, vmloadk, vmsyscall, vmterminate. This means that they are invoked automatically by the VM in the course of processing ordinary code (but not while processing code inside a syscall implementation). In conjunction with the other syscalls in syscall.bin, these are responsible for implementing 'address subspaces', cababilities, pre-emption (via vmtimer), decoding of SHORT and LONG encodings, module management, etc.

Limitations on code in syscalls.bin:

Constant table addressing mode may not be used; only immediate constants are supported. Custom instructions which contain, directly or indirectly, code using constant table addressing mode may not be used.

Execution environment of syscalls:

The syscall global address space: The implementation provides a special global address space for use by syscalls (16 memory locations). The syscall global address space may contain arbitrary random data and should be initialized by vminit. It is not expected that this is the only global memory available; rather, syscalls are expected to MALLOC more blocks of memory as needed and store a root set of pointers to MALLOC'd blocks from these 16 global locations. todo: perhaps to avoid hidden state, we should make this part of the register set, and just forbid r/w from it unless ur in a syscall? sort of like the CAP register, i guess?

Each syscall is called with the following on the hidden stack: (a) the return address, and (b) the arguments, and (c) a pointer to the syscall global address space. The following instructions and syscalls have special behavior:

Absolute JMPs don't refer to a JMP table, as usual; instead they refer to offsets within this file, in units of 2-byte words
RET instructions return from the syscall to user code
- TODO: wouldn't it be cleaned to have a RET syscall for this, and leave RET alone?
the MALLOC and MFREE syscalls call a primitive malloc and free (not the MALLOC/MFREE implementation in this file)
- todo: vmsyscall should allow user code to make a 'tower' of further overridings of syscalls, but where each implementation of an overriding is executed on the previous level of the 'tower'.

Note that although ordinary CALLs (including first-class functions) are caller-saved, SYSCALLs are not. SYSCALLs must avoid unintentionally clobbering registers.

At program start, vminit is called. When not inside a syscall, instructions are executed by calling vmexec_instr. All reads from memory go thru vmlea, all writes to memory go thru vmsea, and all jumps (but not relative branches) go through vmjmp. All reads and writes from/to 'special' registers (todo which reg #s are those; these are things like PC, CAP, etc) go thru lear and sear. SHORT encoded instruction groups are executed by calling vmshort and executing the code it returns. LONG encoded instructions are executed by passing control to vmlong. MALLOC and FREE syscalls call the implementations in syscalls.bin. Constant-table addressing is processed via the vmloadk syscall. vmtimer is called every k instructions (where k is positive finite implementation or user-defined constant). vmsyscall is called to dispatch syscalls. vmterminate is called upon abnormal program termination (and is ordinarily called by HALT). These syscalls are all called through the 'vmsyscall' call.

More about stdlib.oob:

stdlib.oob is loaded via the 'loadmodule' syscall, into module 1. When told to load to module 1, Loadmodule should check if module 0 has already been loaded (via checking a flag that it maintains in the global syscall address block); if module 1 has been loaded before, it should return with an error. If module 1 has not been loaded, it should load the stdlib, but bypass signature checking, and then if the load is successful, it should store a reference to the stdlib into the global syscall address block, and globally mark module 1 as previously loaded. This procedure allows signature checking to be skipped for the stdlib (since the stdlib will be requested to be loaded by the implementation, which is trusted, before program start).

Capabilities security

Primitive capabilities (must be implemented by any primitive OotB implementation)

There is a 'capability register' CAP, which holds a CAPABILITY (which is a set of 'permissions'). Unboxing is capability-aware in the following sense; upon unboxing a capabilty input, the current value of CAP is saved on an internal stack, and CAP is replaced by CAP unioned with the new capability; upon the end of this instruction, however, the internal stack is popped and the previous CAP is restored.

The current capability set determines whether, for each subaddress space, memory addresses in that subaddress space may be:

read ('r') (permission_type_mask 4)
written to ('w') (permission_type_mask 2)
jumped to/branched to or otherwise executed ('x') (permission_type_mask 1)
unsealed ('u') (permission_type_mask 8)

Other bits in the capability are not per-address space (that is, a process can use these permissions when the contents of the CAP register contains them):

CAPMALLOC can be called ('c') (permission_type_mask 16)
can allocate executable memory ('e') (permission_type_mask 32)
can allocate any memory ('m') (permission_type_mask 64)

permission_type_mask is a 16-bit mask.

The primitive OotB? implementation must be sure that all reads check the 'r' capability, all writes check the 'w' capability, and all inter-subaddress-space jumps/branches/executions check the 'x' capability.

There is a TRANSITIVE 'Access' relation on subaddress spaces; if one subaddress space X 'has Access to' or 'Access'es another one Y, then if you have a capability on X it applies to Y too.

After calling MALLOC, the CAP register contains full permissions on the newly allocated memory. Note that none of the other capability instructions provide any way to increase the permissions in a capability unless you already have another capability with the other permissions.

Note that any executing code always has full permissions on the 'register' address space. todo: and beyond the top-of-the-stack for the data stack eg it can push new items and read what it pushed? for both stacks?

todo: a way to create subaddress spaces for the stack(s) 'up to some point'.

Non-primitive (boxed) capability-augmented pointers (implemented by the stdlib on top of primitives)

Boxed pointers may carry capabilities, holding the following data:

'sealed' bit
if sealed or if 'u' permission: 'seal key' of the seal (can anyone seal to any key or is that restricted?)
permissions (applied to everything directly in the accessible address subspace 'inside' this 'pointer'):
- r: read
- w: write
- x: execute
- u: unseal
dict of custom permissions (keys are arbitrary names (like ruby :symbols; compiled into pairs (module from which this symbol was imported, symbol id in that module); values are bools)
plain pointer to subaddress space

Array bounds are represented by the subaddress space, not by the capability; so narrowing such bounds is done by creating a new subaddress space (which is a view onto a subset of an existing one).

Introspection

Constant 0 is an object holding information about the current module and implementation. This information is constant over each invocation of the program. It includes:

the module name
all of the other information in the .OotB? file for this module
platform features
- # CPUs
- whether async I/O (non-blocking I/O, nio, aio) is available
- whether capability security is enforced when calling out to foreign code
- whether CAS is available
- whether LLSC is available
- whether dynamic (runtime) loading of Oob code from memory (for execution) is possible
- possibly, whether IEEE 754-standard floating point arithmetic is available (although couldn't we implement this in software? in which case the flag just says if fast IEEE-754 is available, which is just performance, not semantic)
whether EVAL1 syscall is available

ABI, calling convention, stack convention

Oot Assembly doesn't fix an ABI or calling convention in the sense of a specific, standard way for OotB? binaries to be called into from outside code at a level below the FFI. For ease of interoperability, it is expected that each Oot implementation will adopt the most popular ABI on their target platform. However, within the Oot Assembly model, we have CALLs and we have a stack, so we need to define a convention for calling within OotB? code. This does not mean that implementations 'actually' follow this ABI on the metal, just that the simulated OVM uses this convention within the VM being simulated.

The stack is full descending; 'full' meaning that the stack pointer points to item at the top of the stack (as opposed to the next empty space beyond that item), and 'descending' meaning that the stack grows downwards in memory (so the 'top' of the stack is the 'lowest' memory address currently used by the stack)
The stack is 16-bit aligned (word aligned, since OVM's words are 16-bit). Note again that an actual implementation of the OVM might introduce padding into the stack so as to satisfy target platform alignment; but this will be an implementation detail that is invisible to OotB? code.
Registers are caller-saved
All arguments are passed on the stack
Space for return value(s) is reserved on the stack below the space for arguments (recall that because the stack is descending, stackwise 'below' means 'above' in terms of memory addresses)
The return address is passed separately, on the call stack rather than the data stack
Arguments are pushed right to left (so that the top of the arguments, stackwise, is the first (leftmost) argument)
The caller 'owns' the stack space for return values
The callee 'owns' the stack space for arguments and the return address (note: this is needed b/c if the callER had owned this, then you couldn't have tailcalls...)
Part of the function signature includes how much stack space the callee wants for internal use. The caller should update the stack pointer to make room for these, so that at the time of a call, if there was any such space requested, the stack pointer should point to the stackwise top of this empty space.
At the time of a return from a call, it is the callee's responsibility to move the stack pointer to point to the stackwise highest return value; ie at this point, the arguments have already been popped
Variadic functions are not supported (an HLL could support them by pushing a pointer to a list as its last argument, though)

OOTB File format

An Oot Bytecode module file has the following format. It is little-endian.

The filename must match the modulename (except that on platforms that support it, an '.OotB?' extension must be added at the end), except on weird platforms where this isn't practical. Modulenames must start with a letter and contain only ASCII lowercase letters, numbers, and underscore. The length of every import path must be less than 256 (so modulenames should be well under 256). Motivated by Windows compatibility [1], the following module names are forbidden: CON, PRN, AUX, NUL, COM1, COM2, ..., COM9, LPT1, LPT2, ... LPT9.

Header

The fixed-length header contains everything needed to determine whether this OotB? module can satisfy some import request given in another module (assuming that this module is valid; that is checked later) (except for the name of the module, but the loader already knows that since that must match the filename). It contains the module version, custom attributes, and public key of the signer, if any. It also contains the timestamp of the HLL source code from which this module was built, to allow the implementation to dynamically recompile the OotB? module if its source has changed.

4 bytes: magic number/file format signature: in hex: F6 6F 74 62
2 bytes: Oot Bytecode format version number. Currently 0. 0 can be used for development/testing/private versions.
2 bytes: reserved for future use and/or development (my plan is to use the first byte as a 'minor version number' during development, and the second as a 'profile', eg 'this program only needs 1-byte integers, that is, it can handle it if arithmetic is mod 256'; could also specify 'environment' requirements eg console, windowing system, etc)
4 bytes: Unix-format timestamp of last modification of the source from which this file was built (so that we can dynamically recompile Oot source as necessary, like .pyc files in Python). For deterministic builds, this must be 0 if the pre-release version is 0.
2 bytes: major version of this module.
2 bytes: minor version of this module. This number must be incremented whenever the release of the module changes in such a way so as to break backwards compatibility, except when: the major version has changed, or the major version is 0, or this version is 0.
2 bytes: micro version of this module. This number must be incremented whenever a new version of the module is released yet neither the major nor minor version is incremented. Since this is used for bugfixes, typically every importer of the module will want to use the highest available micro version given their choice of max major and minor version.
2 bytes: pre-release version of this module. This number should be 32767 or greater for any released version (the value should in fact not be greater than 32767, but if it is, then that's a released version). This is for separating different versions during development. Modules with non-release pre-release versions will not be used unless '--dev' or similar is requested on the commandline. Values 32767 or greater are not considered ordered with respect to each other. Oot tooling sets this to 1 by default.
8 bytes: custom variant attributes (suggestion: use the first 4 bytes for language+culture, which are each 2 characters)
32 bytes: Ed25519 public key used to sign this file, or 0 if this file is unsigned
TOC (todo bytes)

Header::TOC

The TOC lists offsets to the other parts of the module. This is to allow the module to be loaded in some way other than proceeding through each section one by one. The ordering of the other file segments is fixed, and each segment (except for the last segment, the code segment) is preceded by some sort of length, so the loader can also skip the TOC and just load everything else in order.

Each entry in the TOC is 4 bytes; it is an offset from beginning of file to beginning of the specified segment, or 0 if the specified segment is empty (even empty segments take up 2 bytes of space to say that they are of 0 length).

reserved
filename
signature
import table (should one of the imports have to be the language itself, so that we specify which language version we depend upon?)
constant table
function table
initial memory table
code

todo:

custom instruction table
label table
annotation table
stack maps
source maps
symbol names
permission requests
exported symbols (including exported functions or labels, and their types, unless that's a separate table)
custom types table

Filename

2 bytes specifying length (min 1, max 255)
the filename

Signature

2 bytes: 0 if unsigned, 1 if signed
64 bytes: Ed25519 signature

Import table

2 bytes: # of entries in table
entries

These entries are numbered starting with 1; number 0 refers to this module.

Each entry:

2 bytes: import type
type-specific import entry

module import:

module name and min-version,max-version (inclusive), and optionally custom variant attributes (eg lang/culture), and optionally public key hash, and optionally 'preferred' major and minor version

file import:

relative filepath
SHA-256 hash

do we allow exclusion of specific versions?

Note: like Ruby Bundler or Rust Cargo .lock files, the point of the 'preferred' minor version is to give a version for all of the dependencies that the developer knows is jointly available and works together. No patch version is given because the system still prefers the highest available patch version (so as to support security updates); if the highest available patch version is 'bad' then the user will have to pick a different one on the commandline.

Constant table

2 bytes: # of entries in table (including the filename)
entries

Each entry:

2 bytes: type (todo; one type should be 'initial memory segment')
2 bytes: source module (0 for this module, otherwise an index into the import table
type-specific entry

todo: some items in the constant table will be imported from other modules; have a field with an index into the import table to specify the source module in that case

Note: the constants in the table will be loaded into constant 1 and up, because constant 0 is reserved for introspection on this module and on the runtime.

todo:

items: * ?type? * 2 bytes: 1/2 of length in bytes (note: min value is 1, max value is 2047; so min length is 2 bytes, max is 4094 bytes) * n bytes: data note: there should be a way to import an external file (at compile time) into a single constant. This will have no length limit.

all memory segment types:

2 bytes: index into initial memory segment table
2 bytes: offset

binary blob:

2 bytes: length of data, in bytes
n bytes: binary data

imported constant (note: after import, the type of this item will be whatever the type of the imported item is):

2 bytes: index into import table (must be non-zero, because 0 would refer to this module)
2 bytes: index into constant table of that module

mmapped file resource:

2 bytes: index into import table

Function table

todo: need a way for functions to declare that they won't use more than n registers and that they won't use more than m stack space (directly/per invocation; if they recursively call themselves they could use more overall)

Initial memory segment table

2 bytes: # of segments
entries

each entry is of the form:

2 bytes: segment type
2 bytes: segment length (in units of bytes)
segment content (omitted if the segment type is 'uninitialized')

Segment 1 is the program (type 'code segment'), segment 1 is the registers (type 'registers'), segment 2 is the data stack ('read/write memory segment'), segment 3 is the call stack ('read/write memory segment').

The segment content consists of two 16-bit fields for each location in memory. The first 16-bits are interpreted as an 'address mode'. 0 and 1 are immediate or constant table. Other address modes 2-15 are illegal here, for now. The second 16-bit field is either an immediate value or an index into the constant table.

To encode a pointer to the beginning of another memory segment, the index to that memory segment (that is, an index into this table) is put into the constant table, and then the corresponding constant index is encoded here. Note that the constant table memory segment entries can also contain an offset, to construct pointers targeting the middle of a segment.

The purpose of this table is to minimize startup latency by allowing the module to 'preload' an initial state.

The reason this is separate from the constant table is to avoid cluttering the constant table with these segments, which are probably large and which will probably not be referred to after initialization.

Code segment

This is the last segment in an OotB? file and has no length header.

Oot Assembly format (textual format)

Note that if you want to port Oot to a new platform, it is not necessary for you to implement an assembler that can read this format. All you have to do is implement Oot Bytecode. An assembler has already been implemented in Oot Bytecode.

The motivation for a textual assembly format is merely to provide a standard way for implementors to manually read and write Oot Bytecode without using a hex editor. Programmer conveniences are limited, as it is expected that most of the time implementors will not be writing Oot Bytecode directly.

todo: probably want to add S-expr representation for 'subexpression' addr mode

The Oot Assembly format makes the constant tables implicit.

An Oot Assembly file consists of ASCII text. Its syntax is 'line oriented'; a 'line' is delimited by either the character '\n' or the character ';'. However, a backslash followed by a newline is ignored (removed from the Oot Assembly file at an early stage of processing), and a semicolon inside of a string literal is treated as an ordinary character. A double backslash is treated as a single backslash. A single backslash is often treated as a control character sequence. Whitespace, consisting of the characters [ \t\n\v\r\f] space, horizontal tab, newline, vertical tab, carriage return, and form-feed, are separators are otherwise ignored, except for \n which terminates lines as described above.

The '#' character is used for single-line comments; the '#' character and anything following it, up to the end of the line, is ignored. (these are NOT included in the Oot Bytecode target using the 'comment' opcode).

Empty lines (including lines consisting of nothing other than whitespace and comments) are ignored.

A line's type is determined by its first non-whitespace character, which is one of the following:

'.': Oot Assembly directive
':': label declaration (line is : immediately followed by an identifier, eg ":hello")
[{}]: block begin/end (this may be followed by a block type (of the form ^blocktype) or a label declaration)
'=': compile-time constant definition (line of the form "= CONSTANT (literal

variable)")

[a-z0-9()]: ordinary line

Identifiers match the regex "[a-zA-Z_][a-zA-Z0-9_]+". Labels and constants are identifiers.

Directives are:

.include FILEPATH: textual insertion of the contents of the given file (FILEPATH uses forward-slash as a directory separator)
.import FILEPATH
.comment: either .comment "string literal" or .comment 'binary data literal'. These break the string or binary data literals into bytes and render them in the Oot Bytecode as a string of COMMENT opcodes. If you want to use a single COMMENT instruction which references data in the constant table, just use COMMENT as an instruction.
.d (@label:)(identifier) (:: type): declare variable identifier, optionally scoped to labeled lexical block (by default, module-scoped), optionally with type annotation. Undeclared variable identifiers default to being scoped to their containing lexical block.
.d @label :: type: declare type annotation of label

Label declaration are either:

'::IDENTIFIER': an ordinary label
':IDENTIFIER': a 'relative label'
':': an anonymous relative label (this is just a relative label whose name is the empty string)

Ordinary label identifiers must not start with a lower case letter. Ordinary labels must be unique within a module (and cannot have the same name as a compile-time constant) and are mapped to integers and included in the modules's label constant table in the generated Oot Bytecode. Relative labels are resolved at compile-time into PC-relative offsets.

Non-anonymous local labels must be unique within the module. Anonymous local labels need not be unique.

Literals are:

unsigned integers, in decimal format (eg 1000)
integers, in decimal format (eg +1000, -1000)
signed integers, in hexadecimal format (eg +0x100d, -0x100D; note that hex is case-insensitive)
decimal reals (eg 1.23)
IEEE floating point constants (INF, -INF, NAN)
ASCII strings, surrounding by double-quotes (eg "hello\nworld")
binary data (in hex), surrounded by single-quotes (eg '636f77')
predefined constants, including:
- NIL (global constant 0)
- CURRENTMODULE (global constant 1)

Predefined compile-time constants are:

_ or PC (%0)
ERR (%1)
DS (the data stack pointer) (%2)
CS (the call stack pointer) (%3)

Label references look like:

@@IDENTIFIER: to reference an ordinary label
@IDENTIFIER: to reference a relative label
@-: to reference the previous anonymous relative label
@+: to reference the next anonymous relative label
@--: to reference the anonymous relative label before the previous anonymous relative label
@++: to reference the next anonymous relative label after the next anonymous relative label

Variables look like either:

identifier, where the first letter of identifier is lower-case
%number, where number is either an unsigned decimal integer or hex literal

Ordinary lines look like:

instruction_name operand1, operand2, operand3

By convention, operand1 is the 'destination' or 'output' or 'write' operand and operand2 and operand3 are the 'input' or 'read' operands, but some instructions don't have 2 inputs and 1 output and hence don't follow this convention. Actually, the opcode is itself in also an operand, which we call the 'opcode operand' or 'operand0'.

To syntactically indicate an immediate or constant mode operand, just write the constant as a literal (the assembler will automatically identify constants which don't fit in immediate mode and put them in the module constant table).

To syntactically indicate a register mode operand, just use a variable.

todo much of the following is out of date.

For example, the following has register mode for the first (destination) operand, and immediate mode for the second and third (input) operands:

ADD x, 2, 2

To syntactically indicate a register indirect operand, give a variable prefixed with an asterisk ('*'). For example:

ADD *x, 2, 2

To syntactically indicate an indirect-offset operand, give '*(' followed by a variable followed by a '+' followed by an unsigned decimal integer or hex literal in the range [0, 63] followed by ')':

ADD *(x+20), 2, 2

To syntactically indicate an indirect-read-post-increment-write-pre-decrement operand, use the form PUSH or PUSH(variable) or --variable for operand1 or POP or POP(variable) or variable++ for operand2 or operand3; if '(variable)' is omitted, then '(DS)' is assumed:

ADD PUSH, POP, POP

To syntactically indicate an indirect-read-post-decrement-write-pre-increment operand, use the form ++variable for operand1 or variable-- for operand2 or operand3; if '(variable)' is omitted, then '(DS)' is assumed:

ADD ++x, y--, z--

To syntactically indicate an subexpression mode operand, use the form (@label) (or () to indicate a 'subexpression return'):

ADD (), 2, (@COMPUTE_SECOND_ADDEND)

(todo: is there a way to directly do something like a subexpression addr mode on a 0-ary first-class function? would we we'd like an addr mode for that? of course we can always put the first-class function in a labeled block and call it)

Unboxed address modes are used by default. To use boxed (eg addr modes with the MSB set; addr modes 8-15), prefix with $. Eg:

ADD $(), $2, $(@COMPUTE_SECOND_ADDEND)

yields addr modes 15, 8, 15 instead of 7, 0, 7.

Instruction_name can be a numerical opcode or constant instead of an instruction name identifier.

To get the numerical opcode or constant refering to an instruction, prefix the instruction name with &. For example, to copy the numerical opcode for ADD into variable 'x', use:

CPY x, &ADD

To write an instruction with an addressing mode other than immediate or constant, replace instruction_name with a parentheses-surrounded operand expression. For example, to add 2 and 2 and place the result into variable 'y', use:

CPY x, &ADD
(x) y, 2, 2

todo:

Check out:

LLVM
WebAssembly
already checked out: FASM/NASM basic syntax
but see https://en.wikibooks.org/wiki/X86_Assembly/FASM_Syntax and https://en.wikibooks.org/wiki/X86_Assembly/NASM_Syntax for more interesting parts
check out: FASM/NASM macros
HLA macros
TAL
https://en.m.wikipedia.org/wiki/General-purpose_macro_processor
- m4 is in POSIX: https://en.m.wikipedia.org/wiki/M4_(computer_language) and is used by autoconf
TAL: https://www.cs.cornell.edu/talc/default.html
COQASM: http://research.microsoft.com/en-us/um/people/nick/coqasm.pdf

and some specifics:

ppl don't like % prefix for regs, but since regs are just numbers for us i guess we need 'em. i like LLVM has this.
integer constants have no sigil
should we use [] for deref, or *, like C?
LABEL: (colon is required)
dest, src
if we have stuff like segment:[base + scale*index + displacement] then represent it like that
PC-relative addressing by default? or mb only labels? or relative to code segment?
'extern' (does this only include if used? do we need a separate extern thingee for that?)
when we cant autodisambiguate size from instruction, do we use mnemonic suffices like qdwb, or do we use keywords like DWORD?
if we have prefix ops (we don't as of now), do they go on the same line or on separate lines?
it should often be possible for the user to look at a single line of assembly and tell which opcode is generated by it
i like 0x1F as the format for hex constants

are there subexpressions, blocks?

Limits

There are only 2k directly-accessible memory locations or 'locals' or 'registers'.

Implicit lexical nesting via the 'subexpression' addressing mode must be acyclic and is limited to a depth of 255.

Within any one module, there can be only:

64k imports
64k constants
64k jump labels
64k functions
64k types

(numbers 0-2047 of each of these are reserved for the Oot Bytecode Stdlib)

Note that only the first 4k of these (including the 2k Stdlib ones) are directly accessible in MEDIUM mode.

Within any one function, only:

2k type variables
2k arguments

In the textual Oot Assembly format, some assemblers may have the capability automatically creating implicit structs and breaking modules apart into many submodules so as to give the illusion that some of these limits do not exist.

usage tips

Three ways to programatically generate Oot Bytecode:

for some languages, we have a library to help in generating Oot Bytecode
otherwise, you could manually generate it
or, you can generate Oot Assembly then run the Oot Assembler

todo, finish this 'spec'

todos

So what if we say "ALIAS r3 to x[33]; ALIAS r4 to r3; ALIAS r3 to y[2]"; do changes to r4 now affect x[33], or y[2]?
- actually i think this may be resolved now if we use an address-of mode and GETALIAS and SETALIAS, see below
is the 'alias level' identical to the 'meta level' above a 'perspective' (view), or do we need yet another mode for that? i'm guessing it's a field within the metadata in the meta level. But mb the meta level associated with an address-of the cell, not of the value contained at that address.
how to deal with immutable (COWd) data vs mutable; how to distinguish COW pointers within data structures from 'actual' pointers?
constraints on stack mode?
constraints on stack verifiability? distinction between oot code and oot user code?
constraints on run-time reader macros? how to allow oot core to efficiently represent oot's syntactic sugar?
is all of oot parsing already in oot core? if so, then where do we write the portable parser implementation? in oot asm?
is run-time polymorphism handled by the Oot VM? Are there polymorphic VM instructions (eg ADD)?
constraints on SMALL format? eg no control flow; safe stack ops (simple stack verifiability) only?
constraints on stack verifiability? eg do we, like Java, insist that we know the height of the stack, and the type of each item on it, at all times? Distinction between oot code and oot user code? eg maybe the Oot language implementation is not subject to these constraints, but user libraries are?
how to allow oot core to efficiently represent oot's syntactic sugar? reader macros? constraints on run-time reader macros, for efficiency?
is all of oot parsing already provided in order to parse oot core (the rest is just non-parse-related syntactic sugar)? if so, then where do we write the portable parser implementation? in oot asm?
is run-time polymorphism handled by the Oot VM, or implemented fully on top of it by the HLL? Are there polymorphic VM instructions (eg ADD)?
so how do we define our 'boxing' functions? looks like the best choices are per-type(class?) or per-program; but even if per-program, doesnt the boxing fn need to be told the type, so that it can add in a type tag? what about preserving boxed state over fn calls, doesn't that imply that the boxer is a wrapper rather than a getter/setter; or are Oot Views implemented some other way (see note in ootAssemblyNotes10 about letting a boxed operand1 have full control and run the instruction within its wrapper)
what kind of I/O primitives do we want?
what kind of macro or other metaprogrammy functionality in the Oot Assembly language?
should the subexpr addr mode use the ordinary data stack, or a special implicit stack? If the ordinary data stack, shouldn't we remove the nesting level limit? But in that case, what if someone GOTOs out of the subexpression or creates or uses a continuation? One thing we could do is require subexpressions to have only inlinable control flow and to never jump outside of the subexpression and to be side-effect-free and to never reduce the size of the stack. But we probably don't really have the capability to check side-effect-free, and also dont we want to allow eg arithmetic to write to the ERR register? Alternately, we could just define the subexpr addr mode operationally in terms of doing a topological sort of the subexpression nodes, running each subexpression in turn, and pushing and popping to the data stack -- and if the user is so careless so as to allow a GOTO outside of this, which would leave parts of the expression on the data stack, that's not our problem; in this case, is the ordering of side effects undefined?
instead of any topo sort, do left-to-right bottom-to-top evaluation order
do we have a block stack, or just a data stack and call stack?

(copied from [2])

The idea of 'alias' and 'value' addressing is that each memory cell has two 'levels'; the 'alias' level, which might contain a pointer (or actually, maybe an entire 'alias record', which is a pointer with some attached metadata), or nothing is the cell is not a symlink but an actual value; and the 'value' level, which in the case of non-symlinked cells contains the contents of the cells, and which in the case of the symlinked cells looks like it contains the contents of the symlink target. So when you use value addressing on a symlink cell (that is, one with a non-empty alias level), you are really manipulating the cell that is the symlink target; and then you use alias addressing on this cell, you are reading or mutating its symlink record.

In order to CREATE an alias, a third addressing mode is needed, because you want to take the 'address' of a cell, and place it into the alias level of a different cell. Instead of 2 different alias-and-addressing relating modes, we could have some special instructions. Note that an 'address-of' mode could not be written to, only read, so maybe that one should be omitted?

We also need a sentinel to represent an empty alias cell. We can use 0, which would mean that the PC cannot be aliased, which is no great loss.

Also, is this 'alias level' identical to the 'meta level' above a 'perspective' (view), or do we need yet another mode for that? i'm guessing it's a field within the metadata in the meta level

if we used an 'alias' addressing mode, and a special instruction/opcode for GET_ADDRESS_IDENTIFIER and CREATE_ALIAS then:

If we wanted to look at or manipulate metainformation in the alias record, we would first move the alias record itself into an ordinary value cell, by using issuing a CREATE_ALIAS instruction whose input had alias addressing (r) and whose destination had value addressing (*r) (this create an alias from the destination cell to the alias record which is controlling the input cell; if we wanted to alias a destination cell to the same symlink target as a source cell, we would instead do CREATE_ALIAS with value addressing in the input (the more typical case). (note: if this is what we're doing, tho, then 'alias' addressing in CREATE_ALIAS's output is useless, which seems like a waste). Note: this is irregular in that the CREATE_ALIAS and GET_ADDRESS_IDENTIFIER opcodes would interpret value addressing different from every other opcode.

A more regular solution would be to use an address-of addressing mode, and to have GETALIAS and SETALIAS operations. To create an alias from r4 to r3, you would SETALIAS(input= address-of(r3), destination= address-of(r4)). To create an alias from r4 to whatever r3 is directly aliased to, GETALIAS(input=address-of(r3), destination=r1); SETALIAS(input= r1, destination= address-of(r4)). Etc. I like this better.

A downside there is that address-of can't go in the destination operand. But i guess that's good, it means we have a spare 1/2 addressing mode. hmm, mb too clever, but we could use that in place of SETALIAS... presumably SETALIAS will be somewhat common, but GETALIAS will not. to clear a symlink, we can copy 0 in using this addressing mode

Note that SETALIAS is a special case of setting .__GET and .__SET.

File format todo

todo: code, constant table, function constant table, JMP label constant table, stack maps(?)
- mb 2-byte lengths for each constant. Does each constant come with a type, or just binary data?
- TOC? other bytecodes dont seem to have one. if so, 4-byte offsets (32-bit, 4gb max module file size).

at some point we'll want to have, written in Oot Bytecode, not just the OotB? stdlib, but also an OotB? disassembler, assembler, and API for OotB? manipulation.

MB look to YASM's library for guidance on that stuff.

---

if the second goal of Oot Assembly is to serve as a generic AST format that preserves HLL intent, then we want a clear way to represent an HLL call even if it has more than 2 arguments; so we want (1) an annotation to mark the beginning of a CALL block that pushes arguments onto the stack and then consumes them with a CALL, and (2) a stereotyped form for such a block. and what to do about keyword arguments?

also, generalize this; mb give it its own 'annotation' opcode, in addition to the usual one (and in addition to the 'prefix annotation' one and in addition to the 'data block (skip this)' one, if any, and in addition to the 'comment' one)

or, should we just use LONG format for this sort of thing (eg for CALLs with more than 2 arguments)? i guess that would mean finishing the LONG format.

---

it's essential that the interpreter know where all pointers are at all times, so that it can resize the stack and the heap and move stuff around as it does so.

if things are statically typed, great, but when things are of type DYN, must make sure that there is a pointer map (like in Haskell) at all times -- this must be a required component of boxing, i guess, and DYN must always be boxed of course (to have the type tag, and now also to have the pointer map).

---

should custom instructions execute atomically? Maybe use STM there.

---

todo: at some point we need to add a way ('GRANTADDR'?) for one subaddress space to ACCESS another. And a way to delete those grants later. And a way to list them.

---

todo LOADL requires an effective address in both the 'input' and the output.

---

some sort of local .constant definition so as to reserve space in the constant table when a constant is only used in one place (so that the whole constant table has more chance of fitting in cache)?

---

in case thread or process forking isn't available, consider a non-cloning variation on forkthread and forkprocess

---

i'm still going back and forth between:

making OotB? include the capabilities, and the addr space API, and the unboxing
or, making OotB? include the 'hooks'

the arguments for the 'hooks' are:

the porters doesn't have to implement any complex semantics
it's kind of cool to have such an extensible VM
having such an extensible VM will allow us to specify alternate semantics for a given block of OotB? code, which may allow us to emulate 'what this OotB? code would do if translated into other languages , which may be useful later
- for example, should expression evaluation be eager or lazy?

the arguments against the 'hooks' are:

there are either a lot of hooks, or we just stuff everything into exec_instr, which is inefficient
the porters has to implement executing the hook code within an unhooked VM, as well as executing ordinary code in a hooked VM; perhaps this is almost as hard to implement as just implementing the other semantics
an efficient implementation is going to end up implementing the other semantics anyways, at which point the hooks are (might be?) vestigial
the hooks essentially means that there is another 'SootB?' layer in our interpreter tower, complicating developer understanding

perhaps after implementing one of them i'll feel that it is or isn't too complicated.

---

have to think carefully about FFIs; that's one thing that i don't know much about that could have a large impact on the design, since a big part of the point of making OotB? easy to implement on various platforms including HLLs is to make interop easy with these HLLs.

---

TODO need a way to load a struct corresponding to metadata for any imported module (and remember, module 0 is self, 1 is stdlib). This allows you to get first-class-functions for module fns, and to access more than 256 exports.

---

ok so even if the stack is not mmapped, we still need a way to save it to the heap, and to swap in another stack saved in the heap (and create a new stack in the heap of a certain size, but assuming that the format of the stack when saved on the heap is not opaque, we can just use MALLOC for that)

we also probably need a way to get the current height of the stack

---

ok, the primitive instruction set and instruction encoding is coming along. Things that remain to be done:

timer hook
type signatures, including for functions
layout of composite data structures in address subspaces (eg if there is a struct whose first item takes 8 words, then accessing x.2 still accesses the second item, not x.8)
- and .data declaration in textual assembly format?
address subspace hook methods?: getters, setters, constructors, serialize; how to malloc them
cow (copy-on-write) (the possiblity to implement this, not necessarily a primitive) (does this need to be able to be done implicitly as a setter, or is explicit enough?)
capabilities
aliases?
ffi
should we have a primitive hook-ful interpreter (vmget, etc) and implement capabilities and fancy addr subspaces etc on top of this, or just implement this stuff directly?
exit with exception: need to specify a way for an Oot program to exit to the host platform but with an expection.
- do we also need the possibility to let the host tell us to resume after the exception?
  - this would provide us with coroutines, so may be a good idea
need a way to query the height of a stack
currently we have one addr mode bit free. hook?
how to do unboxing hook?

(copied from proj-oot-ootAssemblyNotes13).

---

Recall that the goal here is to be able to directly execute untrusted Oot Assembly code, sandboxing it only via placing restricted capabilities in the Capabilities register before calling it (via SEALCALL).

---

regarding a program which writes instructions into an addr_subspace and then wants to execute them. The program should have to 'load in' the addr_subspace (either as raw, or as ob0), using one or more SYSCALLs. These SYSCALLs may not be available on every platform. Eg if you compile Oob to assembly for some embedded platform, then most likely you will not be able to compile dynamically-generated Oob to assembly on the embedded platform at runtime.

---

JZREL: in imm2. # conditional relative jump (branch if 'in' is zero). Note: imm2 input is signed. 1 is added to non-negative inputs (b/c 0 is just an infinite loop, not useful), so range is (-32768:-1 unioned with 1:32768)

---

i guess the constant table can be unified with the function table and the jump table

---

memory leak when new registers are defined and then finished with, yet still referenced in addr_subspaces
this is actually a little tricky since in theory Oob code could capture the registers and try to use them later. How to restrict that? Or do we just use refcounting or tracing GC? TODO

---

ordinary code shouldn't have execute permissions on a CODE pointer within custom syscalls

---

so we have 1 addr mode bit left (2 of them specify 1 of 4 basic addr modes, immediate, register, register indirect, and RESERVED; the other one specified BOXED or unboxed).

when this bit is set, it could force us into REGISTER INDIRECT mode regardless of the other bits. The BOXED bit still does its thing, but the other 2 bits are now used to specify a displacement from +1 to +4 (no +0 because ordinary indirect mode already does that)

---

i should probably re-describe the relationship between Boot, Ovm, and the Ovm encodings.

Boot is a dead-simple VM. Ovm runs on top of Boot. In addition, Ovm is an extension of Boot, in that the Boot instruction encoding is the same as the Ovm SHORT instruction encoding.

Ovm provides all the various instructions that Boot appears to leave out (eg signed integer arithmetic and floating point arithmetic). Ovm is more extensible and more powerful than Boot, with addressing modes, custom instructions and subaddress spaces, dynamic generic instructions, and first-class functions.

Ovm supports three instruction subencodings: SHORT (which is identical to Boot), MEDIUM (the 'main' one), and LONG (an extensible but more complex encoding).

Both Boot (=Ovm SHORT) and Ovm MEDIUM have fixed-length instruction encodings in which each instruction can be interpreted independently of the others. This is why, in MEDIUM encoding, we don't have quoted Raw bytestrings of binary data, or 64-bit absolute jump constants after a JMP instruction, or 64-bit LOADIs, or meta-commands that switch instruction sets or encodings. LONG can have that sort of thing. Ovm LONG has things like grouping constructs, quoted commmands, transaction begin/ends, 64-bit LOADIs and jump constants, and quoted raw binary data.

---

toread

http://www.atmel.com/Images/compiler.pdf

---

in lowEndTargetsUnsorted2 i analyzed the histogram of how much memory various microcontrollers (aka MCUs akak uCs) have on Digikey. My analysis is:

about 1/2 of MCUs have RAM <=4k and Program Memory <= 32k.
about 6/7ths of MCUs have RAM <= 32k and Program Memory <=256k.

recall also that:

for awhile now (Intel Core microarchitecture through the present-day Kaby Lake), Intel's per-core L1 cache is 32k instruction cache + 32k data cache
64k is the addressing limit of 16-bit pointers
Arduino Uno has 32k flash memory (program memory)
each Epiphany core in the Parallella has 32kb local memory [3]
the 1st and 2nd generation Transputers had 4k RAM per core [4]. The T9000 Transputer had 16k RAM per core.
each XMOS Xcore tile has 8 cores and 256k RAM (so i guess 32k per-core)
not very relevant, but each core on the GA144 has about 128 (18-bit) words
Each propeller core has 2KiB?, and also they share 32k shared memory [5]
how much private memory do Blue Gene cores have? Or do they not really work that way?
so in sum, when ppl build these manycore processors with small slow CPUs and private local memory, they tend to have between 0.5k to 64k RAM locally, with most of them between 4k and 32k, inclusive
PEZY-SC 1k-core processor used in some supercomputers has 16KiB? local storage per core (4k 32-bit words)

Why do we care about fitting in cache when we're not performance-oriented, or microcontrollers when we're not a systems language? The reason is that my idea is that programs written in Oot will be running on massively parallel machines, machines which have at least tens of thousands of processors, with each processor being relatively small and slow. We need to make sure that Oot's runtime fits in the small memory space of each of these processors. Since consumer machines like that don't exist yet, we have to guess what the memory limit will be.

Note that https://jaycarlson.net/microcontrollers/ defines microcontrollers (aka MCUs aka uCs) as "processors with completely self-contained RAM, flash, and peripherals". Is that really what we want to target? No, probably not -- we want LOTS of CPUs, for cheap (which probably implies high power efficiency too) -- we don't care if each one has peripherals (although we probably do need each one to have its own memory). It would probably be cheaper/more efficiency to maufacture and distribute a ton of CPUs, each with their own memory, at once, without giving each one its own peripherals. So MCUs are just a guideline, not an actual target.

There are some MCUs, like the Propeller (8 cores) and XMOS (2-24 cores?) and Epiphany (18 cores in Parallela board, 1024 cores in Epiphany-V chip), which all multicore. These would at first seem like good targets, but actually we want MASSIVE parallelism, which is probably very different from 8-core parallelism -- eg the XMOS has some stuff to help the cores synchronize and communicate and share memory and schedule threads, and without knowing much about it, i bet this stuff wouldn't scale from 8 cores to 64k cores very easily, just like the cache coherency in PC multicores won't scale. Presumably the Propeller (8 cores) the same, although i'm not sure. Fans of the Propeller and the XMOS say that the point is not really parallel processing (for that, they recommend just running multiple threads on one higher-end core), but rather to do multiple real-time things with hardware without bothering with interrupts; the Propeller and the XMOS have less peripherals than other MCUs but the idea is that you can devote some of the cores to acting as peripherals; and the reason avoiding interrupts is good is that it makes it easier to prove that the system will be deterministic and will meet timing requirements (also it's simpler to understand).

However, it might be worthwhile to learn about the programming paradigm offered to programmers who write programs for the Propeller and the XMOS, because presumably this is a good paradigm for concurrency. (note: XMOS seems to be the ususal message-passing over channels, plus some thread synchronization barriers (are those just for cores on the same tile?), plus maybe some event stuff that i havent had time to look at yet)

The Epiphany is presumably more scalable -- they claim it should scale to a billion processors -- but so far there hasn't been evidence of more than 1024. Supposedly even the Parallela boards can be connected to each other: "The architecture also allows parallella boards to be combined into a cluster with a fast inter-chip 'eMesh' interconnect, extending the logical grid of cores (creating almost unlimited scaling potential)."

People talk about FPGAs but they also say that it's much, much harder to program these, compared to MCUs. I assume that with FPGAs you really have to get into Verilog/VHDL, RTL, and all of the timing and layout complexity (and hardware programming language non-intuitiveness) that that implies.

In the end the most cost-effective thing to do will probably end up being to use GPUs. GPUs are mostly SIMD but, at the cost of inefficiency, i think we can stil use them to run many instances of Turing-complete state machines. Another alternative is Adapteva/Parallela/Epiphany.

The Epiphany-V local stores are 64k.

GPU cores tend to have access to at least 2k of local memory per core, often at least 8k (Adreno 330 GPU on the Qualcomm MSM8974) or 32k (ARM Mali T604). GPU kernels tend to have access to 4-128 registers, often configurable per kernel (with a parallelism vs registers-per-core tradeoff); sometimes the registers are 256 bits (32 bytes), meaning from 128 bytes to 4k bytes of register memory.
current-day real-world 'full-service' CPUs tend to have 32k of icache and 32k of dcache. This goes down to 16k/16k in many lower-end systems, and 16k of unified cache in slightly out-of-date, really low-end systems. The first L1 cache seems to have been 8k.
embedded intepreters for languages such as Lua (eLua implementaton) recommend at least about 128k of ROM. Small interpreters for small languages such as BASIC tend to need at least about 8k (note that many need 16k, and also many of the early <=8k interpreters were for computers with 16-bit or 8-bit word sizes). Some interpreters which try hard get this down to about 4k. There are a few smaller ones (e.g. Parallax's SPIN interpreter fits in one Cog memory, which is 512 32-bit words, eg 2k bytes).
memory size per parallel process in GPUs. in bytes: from 4k to 64k. in 256-bit words: from 128 to 2048. See lowEndTargets/gpus.txt for details.
program size in GPUs. looks like on OpenGL? 3.x (SM4), we can have 64k instructions in a pixel shader (but less on vertex shaders, or on OpenGL? 2.x).

"so interesting numbers relating to amount of memory available per core are:

128 registers in Intel HD Ivy Bridge
4k bytes in Intel HD Ivy Bridge register file
32k icache (shared between all cores) in Intel HD Ivy Bridge
16k minimum value for CL_DEVICE_LOCAL_MEM_SIZE in OpenCL? 1.0 (http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html)
64k CL_DEVICE_LOCAL_MEM_SIZE in Intel HD 2500 (i assume this is shared local memory?)
"It is important to understand the OpenCL? device hardware limitations and capabilities when programming OpenCL? kernels to use SLM. As a rule of thumb, do not use SLM if the LUT size is larger than 4K bytes." -- https://software.intel.com/en-us/articles/accelerate-performance-using-opencl-with-intel-hd-graphics (note: SLM is "shared local memory") and as to the MINIMAL size when to bother to use SLM (i think that's what they meant): "Lookup tables will almost always create a performance bottleneck when the lookup table data is large, generally more than 256 bytes"

...

GPU program size limits:

https://www.opengl.org/discussion_boards/showthread.php/182182-Shader-Program-size-limit https://www.reddit.com/r/opengl/comments/2n8mjz/size_limits_for_shaders_on_various_gpus/ https://stackoverflow.com/questions/2617957/glsl-maximum-number-of-instructions

" -- lowEndTargets/gpus.txt

and i think the upshot is:

we should try to shoot for a 32k limit for our low-level interpreter and/or runtime (what about the 4k limit seen in some systems? well, first, that's tough b/c even a tiny interpreter often needs 8k, and many supposedly smallish ones need more than 32k; second, many systems have 2x or more 'program memory' than RAM anyhow (eg flash on MCUs)). We can probably manage to get 64k but we need some headroom for device-specific code below us, and application code above us, as well as to save some cache, and also just to give us some leeway.
we should try to shoot for a 16k limit for RAM usage/overhead by our low-level interpreter and/or runtime
16-bit addressing limit (implies a 64k address space limit) is feasible for this stuff ('this stuff' = manycore/massively parallel architecture with no shared memory and private local memory per-core, and where the cores are small/slow/cheap (Cortex M0-MCU-like rather than x64-like))
expect most nodes to be asleep most of the time, and then woken up by signals (like neurons in the cortex; b/c electronics systems which are always awake and clocked hit power limits pretty quickly these days, so you need to be like the GA144 and be asynchronous so that most of the cores can be asleep at any one time)

to summarize even shorter:

32kB limit for program memory low-level interpreter and/or runtime (16k 16-bit words)
16k word limit for RAM (comparable systems have 4k-32k bytes of RAM) (better: only 8k 16-bit words instead of 16k; and if we could get it down to 4k 16-bit words, then we can use STM32 TSSOPs)
16-bit addressing (64k private local memory space)
think GPGPU
think Cortex M0. Mb an STM32 Discovery board with an STM32F0 or STM32L0. Mb look into https://gnu-mcu-eclipse.github.io/qemu/ or https://beckus.github.io/qemu_stm32/ .
maybe explore XMOS and Epiphany
FreeRTOS? looks like a good target too "The RL78 port can create 13 tasks, 2 queues and 4 software timers in under 4K bytes of RAM!... Typically an RTOS kernel binary image will be in the region of 6K to 12K bytes". It's apparently the most popular small embedded OS. "FreeRTOS? provides methods for multiple threads or tasks, mutexes, semaphores and software timers. A tick-less mode is provided for low power applications. Thread priorities are supported. FreeRTOS? applications can be completely statically allocated. Alternatively RTOS objects can be dynamically allocated with five schemes of memory allocation provided..FreeRTOS? can be thought of as a 'thread library' rather than an 'operating system',"

note: the tiled CPU-node, memory-node, network-node design at the end of ootAssembly19 only gives each CPU node 4k of local memory (plus 1k of non-local working memory at each of the 4 neighboring network nodes). So 8k 16-bit words is possibly 4x too much; we might really only have 4k bytes; otoh in that design each memory node also has 7k of fabric memory, and each CPU node is next to 4 memory nodes, so each CPU node could potentially be allocated another 28k -- but we should leave that for the user application to use, esp. since allocation isn't guaranteed. Let's split the difference; we'll rely on each CPU node being able to use the 4k of non-local working memory for our language in that design, so now we have 8k BYTES, not 8k words, of RAM; but this includes memory that needs to be used for implementing any fabric platform VM stuff, so again, the core interpreter stuff should fit in 4k BYTES. That's pretty tight; we may have to give up on multithreading implementation at the lowest level (and instead implement multithreading at a higher level). Otoh the interpreter code can go in 'ROM', this RAM just has to fit the intrepreter stack etc.

---

OK so as per the discussion in ootAssemblyNotes22 for the built-in 'chip' networking (imagine this as the on-chip network of the tiled/fabric architecture discussed in the previous section and in ootAssemblyNotes19):

16-bit words
up to 64k local memory
built-in network provides routing of messages, channels, and shared memory blocks within one 'chip'
up to 64k nodes per 'chip' (16-bit on-chip network addresses)
expect most nodes to be asleep most of the time, and then woken up by signals
message delivery is atomic (you don't have to worry about reading half a packet because the packet is still coming in as you are reading it)

messages:

message payload size of up to 4k words
metadata included with a message is source addr, dest addr, and one-word 'type'
a message with a one-word payload may be able to be sent/received more efficiently and has a special API (b/c no pointer is needed from the metadata to the payload; such a message can be contained in 64 bits) (see below, 'Special fixed-length message types')
messages can be sent synchronously or async; with sync, the sender blocks until their message is received by the destination (not just buffered, but actually delivered), or until the message times-out (timeout policy is implementation-dependent and opaque to the application) in which case an error is returned. With async, the sender's message may be buffered or may be dropped if the receiver's buffer is full. There is no guarantee that async messages will be received in order.
the receiver can register 'interrupt' receive handlers for some message types

channels:

channels are multi-producer, multi-consumer streaming connections. A metaphor for a channel is a group chat.
messages can be transmitted atomically over channels
channel IDs are one word
all messages on a channel are ordered, however, messages sent to a channel may not be received by the channel until after other messages sent afterwards
one sends a message to a channel by sending a messages of type CHANNEL with destination addr equal to the channel ID
channels are subscribed and unsubscribed to by recipients. When subscribed, a recipient receives channel messages just as if they were ordinary messages, with message type CHANNEL and the channel ID in place of the source addr.
all messages sent to channels behave as if they were sent asynchronously. A channel message may be delivered to none, some, or all subscribers.
if a process unsubscribes and then subscribes again, it may or may not receive duplicate messages. Different receivers may receive channel messages at different times (but in the same order).

shared memory blocks:

shared memory blocks of up to 4k words (which can be atomically mutated one word at a time; no multiword transactions; one-word compare-and-swap is available)
shared memory block IDs are one word
processes subscribe and unsubscribe to shared memory blocks. A process must be subscribed in order to access a shared memory block.
shared memory block have sequential consistency, however, if a process unsubscribes and then subscribes again, it may then see an older state of the shared memory block. Also, one processes may still see an older version of a shared memory block at the same time that another process sees a later version.

Note that these numbers are all derived from the 16-bit word size choice; if addresses are one word, that's 64k nodes; if there are 64k nodes, then for a data structure with one bit for every addressable node to fit within one message payload, the payload size must be >= 4k.

Error correction/reliability/exactly-once-delivery is implicitly provided and may be assumed by the application, except for possible dropping of async messages; this is supposed to be like hardware/IPC, not like external network transmission (if you wanted to access a 'real' (external) network then you may need more than 16-bit addresses anyway).

Incoming packets may or may not be buffered. These buffers are opaque to the application (the buffer size cannot be observed or changed, and the application cannot tell if there are currently any packets queuing in its incoming buffers). The possible presence of a buffer is noted only because it affects the semantics of the network.

Special fixed-length message types (from the message size discussion in ootConcurrencyNotes7):

signals:

signals are messages with no payload, just a message type.
- If a second signal of the same type is received before the first one is handled, it has no effect. This semantics allows incoming signals to be stored with a fixed-length buffer (a bitmap) without any special case where the buffer gets full (alternately, you could think of it as a buffer of length 1).
- There are 64 signal types, but 4 are reserved for the implementation, so 60 are available to the application. 15 signal types are special and can have their own signal handler, but 1 of these is reserved for the implementation, so 14 are available to the application. The last signal handler handles the other 48 signals (of which 45 are available to the application).
- An incoming signal can also clear the relevant bit (this is a difference from other platforms). This allows 'the network' to asynchronously control more than 32 bits of state in each process (more than two Boot words)
- The process can set a signal mask. Signals which are masked don't cause the signal handler to be run, but do change the signal bits. Unmasking a signal, if a signal of any of the unmasked types are pending, causes at least one pending signal to be delivered immediately.

64-bit messages:

these messages have a 64-bit payload, a 64-bit source address, a 64-bit destination address, a 32-bit port, and a 32-bit type
note that this payload can fit one word on contemporary platforms, or four 16-bit Boot words
note that this whole packet is 256 bits (32 bytes), and so can fit in one cache line on contemporary machines

(alternate, probably better, idea: 64-BYTE messages, with 16-byte header (64-bit src addr, 64-bit dest addr) and 48-byte payload; this allows an interior/wrapped packet to have ample room for a 32-byte header and 16-byte payload, or a 16-byte header and 32-byte payload; note however that a single Ed25519 signature (512-bits/64-bytes) would still be too large to fit in the payload)

---

todo (from ootAssemblyNotes19):

(what other modes do other ISAs have? scaled offsets come to mind)

so, we have three options:

don't have indirect-indexed or post-increment
give up boxed addressing
reduce the operand size to 11 bits instead of 12, and have 4 addr mode bits, with 8 basic addressing modes (6 of which we've already chosen above (recalling that 'constant' can be thought of as 'boxed immediate'; although that's not an exact analogy, what if you want to give as input to some instruction an unboxed immediate which happens to reside in the constant table? if you want to worry about that they we've already chosen 7 of the 8), and one boxed bit

---

In boot we were going to make everything 16 bits but many platforms don't support anything smaller than F32. So maybe other the floats should be F32 or we should support float arithmetic of at least 16 but not guarantee that it won't work like f32 or even f64 or higher

---

proj-oot-ootAssemblyThoughts

Oot Bytecode and Oot Assembly

Introduction

Motivation

Overview

SootB

How SootB is a subset of OotB

OOTB imports and loading

Behavior inherited from Oot

how module loading works

Authentication

Instruction encodings

Addressing modes

Opcodes

Custom instructions

Global constants

Primitive types

Stdlib types

Memory

Threads and processes

Primitive instructions

Bootstrapping and VM plugins

Capabilities security

Primitive capabilities (must be implemented by any primitive OotB implementation)

Non-primitive (boxed) capability-augmented pointers (implemented by the stdlib on top of primitives)

Introspection

ABI, calling convention, stack convention

OOTB File format

Header

Header::TOC

Filename

Signature

Import table

Constant table

Function table

Initial memory segment table

Code segment

Oot Assembly format (textual format)

Limits

usage tips

todo, finish this 'spec'

todos

File format todo