Bayle Shanks's website: proj-oot-ootVm

Introduction and motivation

"Oot" has 3 well-defined compilation stages:

Oot -> Oot Core -> Oot VM -> platform-specific code

Oot is the high-level language. It is built from Oot Core using Oot Core's metaprogramming facilities.

Oot VM is a virtual machine targeted by Oot. Oot Assembly is the linear intermediate language/assembly language that the Oot VM runs. Oot Bytecode is a particular encoding (meta)format for Oot Assembly. Oot Core is interpreted/JIT compiled/compiled to Oot Assembly.

For bootstrapping, the Oot compiler is available in Oot Assembly.

Runtime 'administration/services' such as garbage collection, greenthread scheduling, copy-on-write, bounds checking, and non-strict evaluation are implemented 'beneath' Oot Assembly, in the Oot VM.

The rationale for Oot Core is:

to conceptually highlight which high-level operations are important in the language (for example, in Haskell we see that things like partially applied functions, closures (let-bound stuff), variables applied to arguments, and generic thunks are all pretty 'core')
to focus implementation effort around these without worrying about VM impedence mismatch until the next level down

Why have a VM?

Portability. We want it to be dead easy to (naively) port Oot to a new platform. This can be accomplished by naively implementating the Oot VM and then bootstrapping by running the Oot Compiler within it
Separation between runtime and compiler. A key design criterion for Oot is the simplicity of the reference implementation. This is facilitated by separating the Oot implementation into a 'runtime' which is 'under' Oot Assembly, and a 'compiler' which is 'on top of' it, as opposed to having both of these as part of a monolithic 'interpreter'.

Why have a linear Oot Assembly layer underneath Oot Core, instead of just interpreting the Oot Core AST directly, or instead of compiling Oot Core directly to machine code? The main reason is to serve as a thought experiment. This helps me:

clarify what the core of Oot should be from the 'bottom up'
take inspiration from other VMs and assembly languages
think of Oot Core in terms of a 'basis set' of instructions
start thinking about what will be required to implement Oot

Another reason is performance. Since i haven't implemented this yet, i don't have hard data on this, but I suspect that it will be easier to quickly execute a linear series fo fixed size instructions than to walk an AST, if only because this is close to the architecture of today's commodity computers. Wikipedia says "for interpreters, an AST causes more overhead than a bytecode interpreter, because of nodes related to syntax performing no useful work, of a less sequential representation (requiring traversal of more pointers) and of overhead visiting the tree" but "using AST has been proposed as a better intermediate format for just-in-time compilers than bytecode".

Another reason is that everybody else is doing it. Other recent interpreted high-level languages all seem to take this approach (eg Java, C#, Python, Perl, Ruby; even Lisp machines compiled Lisp to a stack-based linear language).

Note that although we are making an Oot VM, it might not be used by all implementations of Oot; optimized implementations of Oot might compile to another target, might directly interpret the AST (for JIT compilers that find this useful), or might create their own variant instruction set and bytecode (similiar to LuaJIT? or Dalvik). In contrast to Oot Core, whose structure is occasionally visible to programs when reflection or other metaprogramming is used, the Oot VM is really just an implementation detail (except that some of Oot VM's limits might be inherited by the HLL, eg the ~2048 number of local variables limit).

Why make our own new VM/bytecode, rather than targeting an existing one (eg JVM, .Net/CLR, Javascript, LLVM, BEAM, RPython, etc)? Because no existing VM quite meets our requirements:

Portability/simplicity: the VM should be simple to naively implement on a new platform; we want Oot to be able to interoperate with all of JVM, .Net/CLR, Javascript, etc, or to be embedded as a scripting language within another application, so the Oot VM should be able to be easily implemented within each of these. An interpreter for eg LLVM is harder to implement than one for Oot Bytecode.
Fast startup: this rules out eg the JVM
FFI: whether the Oot VM is running on bare metal, JVM, .Net/CLR, Javascript, or within an application, it should be able to interoperate with its platform (call and be called by libraries of code on the platform, transparently convert/pass data between Oot and other code on the platform, etc)
(nearly) Pauseless GC: this rules out the most popular implementation of the JVM
Open Source
Tail Calls: this rules out the JVM, and complicates LLVM
impedence mismatches: if we re-used eg the JVM, we would encounter impedence mismatches between various places where JVM design choices were made to fit languages like Java, as opposed to languages like Oot (eg tail calls, choice of internal representations, etc)

Some properties we want Oot VM to have:

easy to implement
simple
portable, abstracted from implementation details
a small set of core opcodes, to ease implementation and to keep it simple
preserve higher-level intent (expressivness)
highly customizable
cross-platform preprocessing that must be done in any case should be done at a higher level above Oot Assembly; for example, Oot Code must be parsed before Oot Assembly is generated
administrative automation such as garbage collection, lazy control flow, greenthread, and copy-on-write should be done at a lower-level beneath Oot Assembly, but should be exposed as primitives via Oot Assembly opcodes
parallelizable decoding
linear (but perhaps it should support more general graph shapes, too?)
support for annotations
efficient emulation
targetting a world of 'brain-like computers' with tens of thousands or more CPUs (or at least virtual threads or 'kernels'), each of which have a small amount of attached local memory
possible to have (massively) parallelizable instruction decoding

What do we mean by 'preserve intent', and why do we want that? What we mean is "don't map a specific operation S to OA code C if, by looking at the resulting C, you cannot discover that this is actually S without doing non-local analysis".

Some examples:

we use 3-operand instructions, and memory-to-memory (rather than registers), in order to make the code more closely match its meaning (e.g. not obscure "c = a + b" by having MOVs in the middle of it). In particular, we want to make it so that, often, when there is one variable at the source code level, that translates into one particular memory location in the Oot Assembly level, rather than translating into a bunch of different memory locations with MOVs etc in between them (ie need for inessential register transfers).

There are three reasons to 'preserve intent':

to defer platform-specific optimization choices (eg if many platforms provide a primitive that efficiently implements a high-level construct, it would be a shame to instead use a slow reimplementation in Oot Assembly code of that same construct; better to have that primitive be a single Oot Assembly bytecode)
to make it easier to write quick-and-dirty program analysis
elegance/readability

Efficiency: We want Oot Assembly to be reasonably efficient to interpret; however, typically efficiency should not be at the expense of preservation of intent or customizability. We expect that performance-minded platform-specific implementations might compile Oot Assembly into custom high-performance bytecode formats anyways.

Some features of Oot VM:

very small set of core instructions: this allows a naive implementation of Oot VM ported to a new platform to be written quickly (because only these core instructions need be implemented). This also facilitates the understandability of Oot Assembly by making it easy to 'hold in your head' the small set of core instructions.
custom instructions: Oot VM also has other instructions written in Oot Assembly in terms of the core instructions (and then other custom instructions are implemented using core instructions and previously defined custom instructions). This makes Oot Assembly more 'expressive', possibly facilitating optimization, and provides a simple mechanism for creating a more optimized implementation (by 'hardcoding' an underlying implementation of various non-core instructions, rather than calling their core instruction implementation).
single-instruction 'first-class function' calls: a single instruction can represent a call to a first-class function (which is essentially the same as a custom instruction, although it may have been created at runtime and not assigned an opcode)
three formats:
- SHORT format (a concise stack-based format for representing expressions)
- PRIMARY format (a "3-address" (opcode plus 3 operands) format for representing most of Oot)
- LONG format (an extensible format for representing arbitrary annotated ASTs using Oots (Oot Graphs) for annotations, reifiable edges, etc)
fixed 64-bit instruction width. This is simple, and it also may facilitate parallelized instruction decoding on systems with many CPUs
fast startup
a data stack and a call stack
support for call stack manipulation, allowing a HLL to implement continuations
support for tail calls
addressing modes: instead of LOADK instructions, Oot Assembly provides constant addressing modes, and instead of LOAD/STORE instructions, Oot Assembly provides indirect addressing modes. This may facilitate program analysis (by not unnecessarily introducing as many short-lived uses of 'registers' in the translation to Oot Assembly), and makes Oot Assembly more readable.
custom addressing modes/flags: the Oot Assembly PRIMARY encoding format has a few extra bits, which we may decide to let the language implementation define in a manner similar to custom instructions (but otoh we might remove/predefine these someday)
bit-width agnosticism/local-variable-like registers and memory locations: the semantics of registers and memory locations is more like identifiers bound to values than like a fixed number of bits:
- a single 'register' or 'memory location' might (appear to) hold arbitrarily large structures
- there are 2048 'registers', enough for all local variables needed by most programs
16-bit friendly; Oot VM is designed to be able to be easily implemented on architectures as small as 16-bits (by only requiring 16-bit arithmetic in the core, and providing larger-width arithmetic in the custom instructions)
instructions for partial function application
instructions for immutable data manipulation (making it possible to cleanly implement COW (copy-on-write) in the VM)
facilities for non-strict evaluation can be easily built on top of it
support for inline metadata annotation of instructions (for extensibilty, providing hints to optimized implementations, and for the internal use of JIT implementations)

Encoding Format

Instructions are encoded in a sequence of 64-bit frames.

2 format bits + 62 bits depending on format

The format bits are as follows:

0x: SHORT format (the second bit is used by the short format)
10: PRIMARY format ('MEDIUM' format)
11: LONG format

In the PRIMARY format, each 64-bit frame represents a single instruction. The PRIMARY format is the workhorse of OA. Most OA code can be straighforwardly represented in PRIMARY format. It is 3-operand register-based with addressing modes. You might call it a '4-operand format' because the opcode itself has an associated addressing mode, allowing 'first-class functions' to be stored in memory and called in one instruction (provided they take 2 arguments).

In the LONG format, many 64-bit frames may be combined to represent a single instruction or "sentence". The LONG format is an extremely extensible format which can represent each operand as a 'phrase' which may contain various modifiers and subphrases; this is an extensible AST serialization format. It is intended to be used when modes, modifiers, and metadata must be attached to instructions or to individual operands within instructions, as well as to allow OA to be extended to cleanly represent other programming styles.

In the SHORT format, one 64-bit frame represents 8 instructions. The SHORT format is intended to be used to concisely represent the calculation of single expressions. It is stack-based.

Short Format

1 format bit + (2 bit addr mode + 5 bit operand) + 7 x (1 subformat bit + 2 bit addr mode + 5 bit operand)

The SHORT format is a 64-bit packed grouping of 8 sub-instructions, each one 8 bits long except for the first one, in which the first bit was consumed by the format bit (which is '0').

Because the leaf and interior nodes of expressions produce values which are consumed exactly once, and because the interior and top nodes of expressions consume their inputs (as opposed to assigning them to a variable from which they might be re-used in some other expression later), expressions can be represented in a concise RPN form in which there is a sequence of various operations which are all implicitly assumed to take their inputs from, and push their inputs onto, the stack. The SHORT format provides for arguments to be PUSHed onto the stack from any of the first 16 registers, for a sequence of any of the first 32 opcodes to be called with stack-implicit arguments, and for a single value produced at the end to be optionally POPed from the stack and stored into any of the first 16 registers. Stretches of code which require no control flow operations besides function calling and value selection, no more than the first 16 registers, no more than the first 32 opcodes, no more than the first 4 addressing modes, and no metadata or modifiers, could conceivably be written entirely in this format.

The first instruction is a 'PUSH' instruction which pushes a value onto the stack. The next 6 instructions are each either PUSHs or operations which take their arguments from, and push their results to, the stack. The last instruction is either POP instruction which pops a value from the stack, or another operation.

In the first subinstruction, the addressing mode and the operand specify the data which is to be PUSHed onto the stack.

In the next six subinstructions, the subformat bit chooses between 'PUSH' or 'OPERATION'. In the case of PUSH, the addressing mode and the operand are used as in the first subinstruction, to specify the data which is to be PUSHed onto the stack. In the case of OPERATION, the addressing mode and the operand are used to specify an opcode which is executed.

In the last, 8th subinstruction, the subformat bit chooses between 'POP' or 'OPERATION'. In the case of POP, the addressing mode and the operand are used to specify the effective address to which the result which is POPed from the stack is to be stored. In the case of OPERATION, the addressing mode and the operand are used as in the previous six instructions, to specify an opcode which is executed against the stack.

The inputs and outputs of the OPERATION instructions are all implicitly the data stack (stack0) (eg the inputs are POPed from the stack and the outputs are PUSHd to the stack). Note that OPERATION instructions may consume any number of inputs from the stack (todo must this number be fixed for each operation, or can some eg consume the stack up to some terminator? i'm thinking if not fixed, then known at compile-time at least).

Primary Format

2 format bits + (2 custom flag bits + 4 addr mode bits + 8 operand bits) + 3 x (4 addr mode bits + 12 operand bits)

Opcodes

The first field is the opcode field. The first four bits of the opcode field are the address mode bits. The addressing mode is applied to the 11-bit operand to generate the opcode field value. The opcode field value is broken down into:

2 custom flag bits
4 addr mode bits
8 bit opcode

Long Format

todo

see ootAssemblyThoughts for the previous proposal. Note that now that we are stealing 2 format bits from the header, it would have to be modified at least a little. Maybe since we have short formats now we should ditch the 8-bit payload format; that saves 1 bit, we still need another. Also, the addr modes are probably custom, not 'split addressing'. Also, since we have the other formats now, no need to specify anything for role 6.

Also, do we want to add a 'map' that says what the child nodes will be and what their offsets are, rather than just linearly reading and encountering roles and then a stop?

Also, do we want to add backwards offsets at each child node to say how to find its parent?

Both of these ideas make sense if we think of this as an AST serialization rather than as a natural language 'stream encoding'.

Also, the format currently described at ootAssemblyThoughts was designed to permit arbitrarily long sentences, by using 'stops' instead of headers with length. But in the real world, we'd probably prefer to have header, perhaps with a 'combine with next' semantic if the header says the max length.

We might also want to have a map to the starting location (offset) of each child of the current node, and the role of that child. And maybe, if the node has a single parent, the offset back (or forwards) to that parent. Otoh this information is implicit in the data in any case, and who is to say that it is more efficient to store it explicitly; that depends a lot on the application. And besides, i'm not trying to make some complicated red-black tree or anything, i'm trying to make something very simple.

But if we did have a map, it would have to have length limits. And maybe the numbers 4 and 16 are useful here (and possibly 8). If we had a 4-bit role and a 4-bit offset for each child, that's 8 bits in the map, so if we had 8 children we'd already fill up a 64-bit frame. If we had a 3-bit role and a 3-bit offset, and 8 children, that's 48 bits, so at least we have room for a little data here or maybe an offset to the parent. If the length needed for the children exceeds these limits, we can ay 'use the parenthesis construct' which is a 64-bit frame that is mostly just a size of the subtree.

Custom Instructions

This proposal allow us to start with a tiny, very easy to implement set of primitives, and then expand them using the language itself, rather than to make a tradeoff between ease of porting and expansiveness of instruction set.

There are various types of 'custom' metaprogrammabilty, eg custom ops, custom types. i think this means that ops, types etc should be classified into:

core: these are required to be implemented by someone porting OA
base: these can be assumed by any OA program, but they do not have to be implemented by a porter because they have been implemented in OA itself and can use the 'custom' metaprogrammability to run
module-specific: these are things that are specific to an OA-program, not part of the language of OA

My language may be a little sloppy below, since i made up that terminology right now, after writing the rest.

In addition to custom instructions, there are first-class functions that can be called like instructions via using the addressing mode on the opcode field; this functionality is referred to as 'call3'. Typically the opcode's addressing mode is immediate constant; but by using 'constant table' addressing mode, module functions can be called, and by using the other addressing modes, function pointers sitting in memory can be called (so in this last case, which function is executed is only known at run-time, not compile-time).

call3s are like macros that substitute their 'body' inline whenever they are seen, except with references to the special ARG0 memory location replaced by the effective address indicated by the first operand, and references to ARG1 and ARG2 replaced by either references to the values indicated by the second and third operands, or the effective addresses, depending upon if the assigned opcode is less than 0xC00; and 'JMP' appended to their body whose target is the instruction following their invocation. Custom instructions do create a new 'scope' with respect to label numbers; custom instructions are not aware of and cannot jump to labels outside of their own implementation, and labels within custom instructions are only accessible from jumps within those instructions; if custom instructions containing labels are inline expanded, the labels must be re-numbered so as not to conflict with labels in the calling code.

Some interpreters may implement custom instructions using a hidden stack instead of inline expansion.

Custom instructions may not invoke themselves; and they may not make use of custom instructions defined after themselves. There is no way to recursively call much less 'tailcall' a custom instruction; when you define a custom instruction, please use iteration instead of recursion. This limitation ensures that custom instructions may be inlined (or, that if they are implemented using a hidden stack, that there is a fixed upper bound on the hidden stack depth).

todo: do we place some requirements on call3'd subroutines that they must complete quickly and cannot utilize the underlying platform, for the purpose of being able to use the underlying platform stack as the hidden stack without saving it when we want to return control to the underlying platform?

Custom instructions are declared by a map of opcodes to a tuple (module identifier, function).

Custom instructions must not have cyclic dependencies, and must be defined in order such that when one of them is defined, any custom instructions used within its implementation have already been defined. (the order of the assigned opcodes may be different from dependency order).

The form of first-class functions is the same as custom instruction declarations (with a 0 in the opcode field). If the opcode operand resolves to an integer, then an instruction (built-in or custom) is indicated; if it resolves to a pointer, a first-class function is indicated (note: this means that there must be a 'function type' which can be represented as either an integer or a pointer).

Function declarations include:

the instruction type signature (note: some functions are 'weird' inexpressible with the type signature; meaning they do something funky with the stack, or never return, or something; in this case the type signature just labels them as 'weird')
if this instruction specializes other instruction(s) (by replacing one or more dynamically typed inputs with a specifically typed input), then the opcodes of those instructions)
privacy: whether or not this function is accessible outside of the containing module
how many locals are needed by the code
for each of those locals, its type signature
length of the code, and pointer to the code, in OA, to implement the instruction.

Custom addressing mode declarations include:

the addr mode identifier
code, in OA, to implement the addr mode when used in an input operand (like a custom instruction implementation, with one input, the operand, and one output, the value)
code, in OA, to implement the addr mode when used in an output operand (like a custom instruction implementation, with two inputs, the operand and the value to be stored, and no output)

The first four custom addressing modes are 'direct'; their operands represent registers, and their effective address is that register; these are the only ones whose effective address may be a register.

Custom addressing modes must leave the stacks unchanged (except if they are direct and the register operand is a memory-mapped register). This makes it easier to analyze the program for changes to the stack.

TODO: actually what if we want to use the extra addr mode bits to represent per-operand custom flags? in that case we need to wrap the whole instruction, not just the argument evaluation.

There are also custom flag bits, which are defined by a wrapper that is called in place of and wraps the following op. Like call3s, they can look at the link register to get a pointer to the instruction that they are supposed to wrap. When they are called INTROSPECT_INSTRUCTION is automatically called on the wrapped instruction to break that instruction into (4 operands and 4 addressing modes) and put those parts on the stack. The custom flag implementation can and typically does use a form of EVAL to execute those parts, possibly after modifying them.

Requirements on custom flag bit semantics:

does not expand control flow eg let an instruction potentially jump to a location that it otherwise wouldn't
does not expand data flow eg let an instruction affect a memory location that it otherwise couldn't
does not cause something which terminates to not terminate
does not change type to something incompatible with what we would conclude without it
leaves the stacks unchanged

OA does not support variadic instructions.

Self-modifying code per se is not supported, although EVAL is.

Addressing Modes

There are 4 built-in 'core' addressing modes and 12 custom addressing mode slots.

Built-in addr modes are:

immediate constant
constant table
register direct
register indirect

'immediate constant' in the output addressing mode with operand '0' means 'discard'. Other operands for 'immediate constant' addressing mode in the output are forbidden, except in certain compile-time metaprogrammy operations, for example IMPORT.

'constant table' in the output addressing mode is forbidden unless the constant referenced is a global static pointer.

some ideas for other addressing modes:

direct boxed
indirect boxed
?: shared memory/IO/do under lock/atomic operation/transaction/memory barriers around this/volatile (note: if multiple input operands have this flag set, they must be gathered atomically or in a transaction; if at least one input operand and the output operand has this set, the entire instruction must be atomic or in a transaction) (when this is not set, the platform can do its usual instruction reordering tricks, assuming that no other threads are concerned with this value)

Modules

Modules must declare custom instructions, custom addressing modes, and their constant table. All of these declarations must be done completely at compile-time. Custom types don't need to be declared; they are just stored as pointers.

There are two places to declare custom instructions; as unused opcode immediates, and as function-typed entries in a constant table. There (probably) is a separate constant table for instructions (ie for when the opcode field has constant table addressing mode). tombdo is there any difference between custom instructions as immediate opcodes, and as function-typed entries in a constant table? I'm thinking that the latter would be more like library functions and the former would be more like new 'instructions' but what is the difference? Perhaps the difference between the two is that the former is hardwired as builtins in the OA implementation and the latter is per-module; or that the former is per-whole-program (allowing faster dispatch) and the latter is per-module. Or that the former can be either OA code or hardwired in the implementation, while the latter can be OA code or an external C library. Or the latter can be used by a module loaded at runtime but the former cannot. Or something. Currently my best guess is that only constant table customs are per-module; the others are either per-program, or part of the implementation (perhaps an interchangable part though, eg each program can specify an Oot 'profile' different ones of which have different selections, and some profiles are implementation-dependent..).

Modules may import other modules. This is done by listing the other module in the imports table within the module info table. There is also a runtime instruction IMPORT. This instruction must use immediate-mode addressing to specify the destination integer identifier (this is to allow the implementation to know what module identifiers are being used at compile-time).

Modules may also import custom opcodes, types, and addressing modes from the other modules into their own. This is done by listing the other module into the metaprogramming imports table within the module info table. There is no way to do this at runtime. Trying to import metaprogramming from different modules that define overlapping opcode, type, or addressing mode identifiers is a compile-time error.

There is a special module BUILTINS. This is meant to be an implementation-dependent module that contains opcodes, types, and addressing modes that might be given special treatment by an Oot Assembly implementation. By tracing imports of opcodes, types, and addressing modes from this module, the implementation can know at compile-time where its optimizations can be applied. The semantics of standard opcodes, types, and addressing modes, even non-core ones, must not be altered in the BUILTINS module's implementation; however BUILTINS can also define non-standard opcodes, types, and addressing modes. (todo is this a good idea? or just autoimport all implementation-dependent builtins to EVERY module?)

The BUILTINS module is auto-imported at module identifier 0. The current module is auto-imported at module identifier 1.

Absolute memory addresses (constant table mode), eg those in the operands of JMPs, are all implicitly module relative (pointers, however, can be shared between modules). Integer JMP addresses refer to a count of Oot Assembly instructions in the module.

Each module has:

magic number
version
Table of Contents
- TOC length
- pointer to extensible data section (reserved for future use) (or 0 if none)
- extensible data section length
- pointer to metaprogramming imports table (or 0 if none)
- metaprogramming imports table length
- pointer to imports table (or 0 if none)
- imports table length
- pointer to function declaration table (or 0 if none)
- function declaration table length
- pointer to custom addressing modes table (or 0 if none)
- custom addressing modes table length
- pointer to custom instructions table (or 0 if none)
- custom instructions table length
- pointer to constants table (or 0 if none)
- constants table length
- pointer to annotations table (or 0 if none)
- annotations table length
- pointer to constants data (or 0 if none)
- constants data length
- pointer to symbol names data (or 0 if none)
- symbol names data length
extensible data section (reserved for future use)
imports table
fn declarations table
custom addr modes table
custom operands table
constants table
annotations table
constants data

Length fields are 16-bits, but their values cannot be more than 4k. Lengths mean how many items each table has (not the number of bytes that the table takes up).

(todo: is this really the best format? should we (just) place the lengths in the tables themselves, or only have lengths and let the pointers be computed?)

Symbol names data is the data which can be stripped for more compact builds. It contains ASCII string names and source maps for the module and for each defined custom addr mode, custo opcode, function declaration.

Types

There are some built-in types that don't need to be declared (the first 16, perhaps?).

types todo: probably want at least integer, floating point, string (bytestring or unicode or both?), object?; the INSP of Parrot, and mb also of MoarVM?

another idea for 4 core types: pointer int16 other/custom static type (has already been statically typechecked) dynamic (boxed/wrapped, tagged type; runtime typechecking requested)

an idea for another 12 'base' types:

boolean/single bit unsigned int16 byte float64 arbitrarily large/precise rational unicode string array cons cell/linked list/binary tree graph/hash/dict (opaque, as opposed to AST) function/callable, thunk/promise/future/lazy value AST/expression/block/logic greenthread/process/channel(/actor?)

(note that some implementations might not support all of these types; eg an embedded implementation might not support floating point, arbitrarily large/precise rational, unicode)

OA is typed in the sense that every scope has a type declaration for every register it uses, and every instruction has a declared type signature; there is however a 'dynamic' type. 'dynamic' type means that this location can hold a value of any boxed type. Boxed types have a type tag. Values in locations with 'dynamic' type must be 'coerced' via a COERCE instruction to their actual type before they are moved to locations with their actual type, or passed as arguments to instruction demanding their actual types; the COERCE instruction causes a runtime exception if the value isn't actually the correct type (as determined by its type tag). Type identifiers are 12-bit integers.

There are also a few special memory locations whose type varies by scope, namely the call3 inputs and output registers. In addition, there are stacks; stacks are typed via 'stack maps' like in the JVM. Stack maps are like regexps, and can contain unbounded repetitions of patterns. Instruction type signatures include a specification of their alteration of the stack (in the form of eg: upon entry, i expect the top of the stack to contain an int8 and an int16, which will be consumed; upon exit, the top of the stack will contain an int16 (note: the instruction does not HAVE to consume the inputs, but if it does not, it must list these as outputs)).

(i suppose in the end we also need a way to coerce raw bytes to typed something, for interop?)

Some instructions have polymorphic forms, eg forms whose input type signature includes 'dynamic'. Some of these instructions also have specialized forms, that is, forms where the 'dynamic' in the type signature is replaced by a specific type. In this case, the instruction will have the same mnemonic but a different opcode, and the Oot implementations handles dispatch to the specific form if there is one (by, at runtime, checking for a specific form before each call to an instruction with a dynamic input type signature).

todo i think this is out of date: Just like memory locations are typed, the stack is typed. But since the program is allowed to arbitrarily manipulate the stack, we cannot always statically know what the 'type signature' of the stack (the stack map) will be. So, what we do is we say that, so long as the number of things on stack0 is predicatable, and stack1 is only being accessed implicitly by CALLs and RETURNS and their associates, we can infer the stack map (like the JVM verifier does (used to, now i think stack maps are made by the compiler)), but if the program wants to change the stack in a weird unpredictable way, then it has to start including assertions saying "i'm about to add an arbitrary number of things to the stack" and "i'm about to directly manipulate stack1, consuming n elements" (so that the compiler can remember that, at that point in time, the stack-so-far (above the n consumed elements) is known but everything added to the stack on top of that is unknown). When the program is done doing weird stack writes, it can then assert how much it consumed and then the program continues with the rest of the implicit stack map governing (i could probably find a clearer way to write that...). And if it wants to read the unpredictable regions of stack0, or any of stack1, then each time before reading, it must assert that there is no stack underflow, and also the type of the thing it read (which can be 'dynamic'). This allows the program to use the stack efficiently in normal circumstances; to manipulate the stack arbitrarily if it wants to, at the expense of runtime checks; and to do one-time manipulations (such as handling exceptions) which leave the stack in a predictable state after which efficient stack usage can resume.

Although OA is designed for optional static typing, unlike eg the JVM, there may not be any typechecking done at runtime; typechecking is provided as a separate OA program (written in OA). OA users are urged to typecheck any untrusted or newly written code before running it.

(do we allow 'subtypes'? eg can we write a function that says "i'll take anything which is an int or a float" and then assert at runtime that some value is an int or a float, without saying which one?)

Type declarations include:

new: a function that, if necessary, allocates memory for a new instance of the types and possibly initializes it
ASSERTTYPE implementation for this type: a function that examines a binary blob and returns False if this blob is definitely not a valid instance of this type, and True if it might be/appears to be a valid instance of this type
pointer map: a function that examines an instance of this type and returns a list of all pointers in the instance (for precise GC) (todo: don't we also need a way to distinguish between/enumerate/map between 'inner' pointers vs 'blocks' of allocated memory, eg if you allocate a single block of memory and then have some pointers to locations within that block, then the GC should be able to tell that those 'inner' pointers are 'within' this block of allocated memory)
todo: and we need to be able to traverse the fields of this type, right? eg if we add '1' to a 'pointer', the result in Oot isn't some value that might be non-aligned with field boundaries, but rather the next subfield. So the type declaration should provide enough information to do this. Maybe not just a few methods, but more like a Haskell ADT and/or a C struct declaration?
todo what else from C++? clone? deepcopy? eq?

Values of the 'pointer' type are not necessarily 'pointers' on the underlying platform; they are simply some implementation-dependent reference value that allows dereferencing via the indirect addressing mode, and 'pointer arithmetic' to traverse adjacent fields. A 'pointer' could be represented by a machine pointer although with information about the width of record fields, by an integer into an array of emulated 'memory locations', by an object with methods to 'get' and 'set' its value and to get the 'next' and 'previous' locations, etc. Dereferencing a 'pointer' could involve calling garbage collection read-barriers, and writing to it could involve calling GC write-barriers, all 'under the hood'.

Memory

4k local variable locations or 'registers' (or less; in each module header it says how much it needs; perhaps memory sizes are always powers of two (this apparently makes eg pointer tagging easier))

memory map and memory-mapped special locations 0: PC (program counter) (do we forbid writing to this? i think so; we want to make programs use control flow ops like JMP for easier analysis) 1: ERR 2: stack0 push/pop 3: direct read/write access to the top of stack0 (without pushing or popping) 4-7: GPRs 8: stack1 push/pop 9: direct read/write access to the top of stack1 (without pushing or popping) 10: current size of stack 0 11: current size of stack 1 12: MODE 13: ARG0 (call3 operand0/output) 14: ARG1 (call3 operand1/usually input) 15: ARG2 (call3 operand2/usually input) 0x010-0x7FF: general purpose registers (GPRs) (accessible in the opcode field) 0x800-0xFFF: general purpose registers (GPRs) (not accessible in the opcode field)

note: this is not to say that the virtual machine can only access 4k of memory, just that it has only 4k of registers. A pointer can be placed in any of these memory locations, and that pointer can point to somewhere else. Likewise, the PC can point to memory locations outside of this map. These alien memory locations are always considered to be 'higher' than this memory (eg if compared, the alien memory address is 'greater than' 0xFFFF) but don't have to actually be that way in the underlying hardware. The alien memory addresses can be incremented as usual. These 4k locations are the only ones that can be accessed in direct addressing mode; however other locations can be accessed via indirect addressing mode. There are also absolute JMPS, which although 'absolute' are actually relative to each module.

There is no way to get a pointer to a register.

I call these locations 'registers' but really they are more like local variables.

if there are two integer values side by side, one of them an int8 and one an int256, each of them still take up exactly one of these locations.
similarly, if you are traversing a struct, each step takes you to the next field, even if one field is eg a long array and the next field is an int16; we don't care how many bytes the fields take up
when one of these guys holds something 'too big' to actually fit in one space in memory, it actually holds a pointer, but this is handled transparently
'too big' can be either a big value (like int256; the programmer cannot address 'within' this, it is an atomic value) or it could be a record or array (which is made up of many programmer-accessible parts)
indirect addressing (pointer dereferencing) is therefore only for accessing the parts of records or arrays; you can't access other registers this way
you can have two values which are aliases of each other (and maybe with boxed values you can even have 'symlinks' which can be chained). But you cannot make two 'register' memory locations aliases of each other, and in fact you cannot take a pointer to a register in any way (or in any way cause it to be updated without writing to it in the destination operand of some instruction in direct addressing mode). There are two reasons for doing it this way:
- it makes program analysis simpler; dataflow analysis can assume that no register is ever affected by anything except an instruction that writes to it; and furthermore, dynamically computed addresses can never resolve to a register.
- it makes it possible to store these things in a direct manner in actual registers, or on the stack; if you need to take a pointer to a memory location, you'd better not have that memory location be in an actual hardware register, or on the stack (or if you do, you have to emulate the pointer in some complicated way)

If you do an operation on 'the contents of register 7' but the operation calls for some struct, then it is implicit that register 7 actually holds a pointer to that struct. If the contents of register 7 is typed as a struct (of some specific type), then the semantics of 'copy register 7 into register 8' is to copy the STRUCT (with COW? maybe not, if it is not boxed), not the pointer.

It is illegal and undefined to read from uninitialized registers.

The machine handles much of memory allocation/garbage collection transparently.

Stack

There are two stacks, intended to be used as data stack (stack0) and call stack (stack1). However, these stacks have different properties, and furthermore are related. The state of the part of the data stack relating to the current (context? block? function?) (eg the current 'stack frame' or current 'activation record') at each line of code must be statically known ('state' meaning its height, and the type of each object in it) .

The call stack contains only one type of item, 'stack frames'/'activation records' (or pointers to them), which each contain a PC (program counter), registers (local variables), and the associated data stack 'frame'; however, unlike the data stack, the call stack can be arbitrarily manipulated, that is, there is no need to statically know its height.

If a program attempts to access beyond the end of either stack, or even beyond the end of the current 'stack frame' within the data stack, then the program can get any value as a result, or an error might occur.

Note that as we pass between different (contexts? blocks? functions?; is the current whatever-it-is at the head of the call stack?), the current stack frame of the data stack changes. If the call stack is manipulated, eg if two item are SWAPd, then it's as if the currently-unreachable-part-of-the-data-stack-below-the-current-stack-frame also changes. So in a way, manipulating the call stack also manipulates the data stack, but not the other way around (however, manipulating the call stack doesn't change the current data stack frame).

Note that the call and data stacks can/should be thought of, and implemented, as a single stack, composed of 'activation records' each of which contain a PC, registers, and a mini data stack. Whether the activation records are actually on this single stack, or whether the single stack contains of pointers to heap-allocated activation records, or some mixture, is an implementation detail (probably the stack actually contains the activation recordsstack frames when there is no need for a closure that outlasts the relevant dynamic context; and contains pointers to activation records when a closure to this record may be needed). The separation of data stack/call stack may be thought of as (a) a safety mechanism, to make it clear to which part of the stack the mandatory typechecking applies, and (b) an encapsulation and convenience, to prevent OA code from having to directly deal with stack pointers or the structure of activation records.

Modes

The MODE register sets various modes. It has 1 bits. By default, it starts out in mode 0.

bits in MODE from MSB to LSB:

0-1: error handling mode: (00 is crash, 01 is handle, 10 is restart, 11 is tolerant)
reserved
reserved
reserved
reserved
reserved
reserved
reserved
reserved
reserved
reserved
reserved

Aside from 'external' state such as memory and I/O, and the stack, OA itself has no 'hidden state'; the state is visible in the registers.

Errors

There are four error handling modes, controlled by the MODE register.

In 'crash' mode, any attempt to write a non-zero value to the ERR register is immediately fatal.

In 'handle' mode, any attempt to write a non-zero value to the ERR register succeeds and then immediately causes the current value of the PC to be pushed onto stack1, and then causes a JMP to the beginning of the program. Note that programs run in this mode should be sure to check the ERR register at the beginning and if it is non-zero, to 'handle' the error in some fashion.

In 'restart' mode, any attempt to write a non-zero value to the ERR register immediately causes the stacks to be emptied and a JMP to the beginning of the program. Note that 'restart' could be emulated by putting a routine to clear the stacks and the ERR register at the beginning of the program and running in 'handle' mode.

In 'tolerant' mode, any attempt to write to the ERR register succeeds.

Note that it is possible for a subroutine to leave the ERR register in a non-zero state by saving the current MODE, setting the error-handling mode to 'tolerant', setting the ERR register, then restoring the error-handling mode. By this method (or by the NONFATALERROR opcode, which does the same thing), a subroutine can return an 'error code' which is (unchanged) upon success and non-zero otherwise, which may be more convenient than returning multiple values via the stack.

Calling

There are (at least) three explicit CALL instructions, in addition to CALL3 functionality:

CALL: inputs: the memory location to transfer control to; (should another input be the number of parameters (found on the data stack), or can this be unknown to the caller?); callee lives in a new scope, meaning that the implementation transparently saves the caller registers and the callee registers are distinct. The return address and then the inputs are pushed onto the data stack, and at the end these inputs should be consumed and possibly new values should be pushed onto the data stack, and RETURN should be called with the return value; the return address might be modified or discarded by the callee.
CALL_INLINE: inputs: the memory location to transfer control to; (should another input be the number of parameters (found on the data stack), or can this be unknown to the caller?); outputs: an error code; callee lives in the same scope (callee registers are not distinct from caller)
TAILCALL: like CALL in that a new scope is created, but doesn't push a return address (and the old scope, caller registers, is not saved; they are lost)
APPLY: can partially apply (could be implemented by creating a dynamic closure if given an ordinary function object)

Limits

Note than although the instructions can only hold 12-bit immediate values, other addressing modes can be used to give larger values. Unless otherwise specified, anywhere an integer may be given, its size may range up to 64k - 1 (and possibly larger in some implementations). Many things have a limit of 4k - 1, however. These include:

register numbers
opcodes
type identifiers
constant table indices
module identifiers

(todo: is the 4k limit really needed for any of these?)

OA implementations may also limit the number of allowed labels. The limit must be at least 4k.

FLAGS (lack thereof)

There are two flag bits per instruction.

Note that further use-cases for operand-level flags can be emulated by custom opcodes.

Boxing

in the boxing:

type tag
is a thunk?
lenient evaluation requested?
follow symlink?
if array, type and length, and whether the 'data parallel' flag is set (ie if parallelizable operation is done on array, should we fork greenthreads to do this in parallel on the various array members?)
is this an immutable data structure with behind-the-scenes copy-on-write? (todo: for structs directly in memory locations to which a CPY is applied, the semantics are to copy the struct itself; above i said this may imply COW, should it? if so then the COW stuff is below the level of OA, ie done in the implementation of OA. Is that a good idea or does that make it too hard to implement OA? If not, than what are the semantics in that case? is it even possibly to have unboxed structures? If so, do we just actually copy the entire structure in such a case? I'm leaning towards this last possiblity)
is this a behind-the-scenes distributed object?
is this a behind-the-scenes transaction-requiring object, or shared memory requiring locking?
are there subscribers to changes to this value?
is this value 'dataflow persistent' like a spreadsheet cell?
are operations on this value supposed to be subjected to the '__evaluate' magic protocol?
are changes in this value supposed to be serialized into an undo-able journal?

Custom flag bits suggestions

blocking/async; do we block right now, or is the return value a promise?
(a) in case of error, trap/raise, or (b) wrap the result in a Maybe (does this mean that even ADD can create a Maybe? is a carry an error for ADD? or, on ADD, does this choose between using the status register for carry, vs returning a tuple?)

Atomics and memory model

todo

todo: atomics? consider some of:
- consensus number infinity: memory-to-memory move and swap, augmented queue, compare-and-swap (CAS), fetch-and-cons, sticky byte, Load-Link/Store-Conditional
- consensus number 2n-2: n-register assignment
- consensus number 2: test-and-set proof, swap, fetch-and-add, queue, stack
- consensus number 1: atomic read and atomic write

how about:

CAS: Compare and Swap: ?x, ?x -> ?x. If the contents of the destination is equal to the first input, set the destination to the second input, and set the Z flag to 0; otherwise, set the Z flag to 1.

but mb we are not even having shared memory at all?? in which case we need some IPC/message passing primitives at least, instead.

Concurrency

The Oot VM offers only cooperative multitasking. The MAINT instruction must be called to cause the VM to consider yielding control to another greenthread, and one blocking greenthread causes the others to block.

Note that Oot Core is preemptively multitasked; the compiler/interpreter inserts MAINTs upon:

calls
backwards jumps
memory allocation
ipc

and avoids blocking (or, creates another OS-level thread if a potentially blocking operation must be called).

An object passed between processes cannot contain pointers to outside of itself.

Garbage collection

todo: do we garbage collect cyclic references (like Python) or do we require weak pointers to break the cycle (like Apple Objective C ARC)? Do we have mark-and-sweep style GC or reference counting-style or some combination? Are we like Golang? Like Python? Like Inferno?

Garbage collection is only considered upon memory allocations (if there is no more memory available; todo do we do garbage collection before requesting more memory from the OS, or only if the OS says that no more memory is available?) or upon calls to the MAINT instruction.

todo: how does the HLL tell the VM where pointers are in data structures? What about pointers to locations 'inside' data structures (as opposed to the head of data structures)? do we have read barriers? write barriers? handles?

Typechecking

Explicit stack maps are not required (because that makes it harder to write a code generator) and stack typechecking is not guaranteed (because that increases starting time) but code that would fail stack typechecking is nevertheless illegal. A stack map inference engine and checker will be written in Oot and included in the Oot distribution, and perhaps the executable will use it by default, or at least have an option to require it before executing code (other than the Oot language implementation itself).

Notes on some instructions

JMP: if given a pointer, JMP to that pointer (can you JMP to 'data' locations or just locations marked as 'code'?); if given an integer argument, jumps to the n-th Oot Assembly instruction. NONFATALERR, see above

Core instructions

These instructions must be implemented by someone porting OA. All other 'base' instructions are ultimately reduced to these.

When an instruction 'returns' a value, this means that the contents of effective address of operand 0 is set to this value.

NOP (can also be used as metadata; argument0 can say what kind of metadata this is, arguments 1 and 2 can be an argument 'to' the metadata, eg argument0 could mean 'label' and arguments 1 and 2 could be the actual label ID)
CPY: ?x -> ?x. Copies the input to the destination.
ADD: Add: int, int -> int. (todo: are the CNZO flags affected?)
SUB: Subtract: int, int -> int. (todo: are the CNZO flags affected?)
GETLABEL: int, int -> pointer. Multiplies the first argument by 2^12 and adds to the second argument. Returns the pointer corresponding to that label. (todo: i guess we should consider the ints unsigned? but that may be too confusing)
JMP: Jump: pointer, int -> pointer. Takes a pointer and an offset. JMPs to pointer + offset. Sets PC to the result, and returns the previous PC.
MODINFO: int -> pointer. Takes a module identifier, and returns a pointer to the module info table for that module. WARNING: this is not forwards compatible, because the module info table format may change even between Oot Assembly minor versions. This is used by the base instruction implementations but should never be used by user-level code.
SKZ: Skip if Zero: int -> (). If input is equal to zero, increments the PC by 1.
EQ? (polymorphic? structural or pointer?)
LEQ: int, int -> bool. If first input is less than or equal to second input, returns 1, otherwise returns 0.
HALT: int -> (). Terminate execution, returning the input value.
LABEL: assignes a 24-bit integer identifier to a label (note; we can only reach about 2^22 labels if the ints in JMPL are signed)
MALLOC: int -> pointer. Allocate n contiguous memory locations in the heap, and return a pointer. Each 'memory location' will be at least large enough to hold either of: a pointer, or a 16-bit quantity. This will be garbage-collected.
SETERR?
GETERR?
todo:
- message passing/channel/IPC RECV (select?) and SEND
- the stuff about wild stacks
- console I/O (so we can actually implement debuginteractive in OA!)
MAINTSCHED: causes the VM to consider doing some garbage collection
MAINTGC: causes the VM to consider yielding control to other greenthreads
MAINT: alias for a combination of MAINTSCHED and MAINTGC

probably not core but still considering:

UNDEF: annotation promising that a given memory location will not be accessed again
ASSERTTYPE: assert that the type of the value currently in the given memory location is a certain type
DEBUGINTERACTIVE: is this really core? if supported, drop into an interactive debugger
DEBUGLOG: is this really core? if supported, write a value to the debug log
EVAL: is this really core?
LEA: is this really core?
PHI
HINT

add-on profile modules:

runtime module import
- IMPORT: pointer, int -> int. Takes a pointer to a module (in what format?) and a module identifier, and imports the module to that module identifier. Returns a return code.
shared memory
- CAS: Compare and Swap: ?x, ?x -> ?x. Atomic. If the contents of the destination is equal to the first input, set the destination to the second input, and set the Z flag to 0; otherwise, set the Z flag to 1. NOTE: if the platform doesn't support this atomic, then Oot Assembly can still be implemented by simply not giving Oot Assembly the ability to spawn threads that share memory, in which case this instruction may be implemented in an ordinary manner (because the non-atomic scenario will never occur, since only one thread will ever be accessing memory at a time; so we can provide a default implementation, and this is why i say it is probably not really 'core'.
FFI
- ALLOCBYTES
VALIDPOINTER: pointer -> int. Returns 1 if the pointer is valid, or 0 if it is not. Used to test for invalid pointers before dereferencing.
weak references
- MALLOCW: int -> weak pointer. Like MALLOC but creates a weak reference.
- VALIDWEAKREF
- MALLOCM: int -> pointer. Memory allocation, manual. This memory allocation must be manually deallocated.
- MDEALLOCM: pointer -> (). Memory deallocation, manual. Frees memory allocated by MALLOCM.

More Instructions (non-core instructions)

Note that Oot Core can be simpler than Oot Assembly in there may be instructions in here which are not Oot Core primitives, but rather correspond to Oot Core Library constructs. In this case the compiler/interpreter uses the 'jet' system to recognizes the Oot Core Library call and override it with a single Oot VM instruction.

functional:

compose, permuteargs

collections:

get set in del keys values items len find createassoc
sequences and iterators and arrays: createarray createcell concat iter next seek cons extend push pop append insert dims rep zip antizip? sort (provide multiple sorts?) uniq head tail
functional collections: filter map reduce
sets: union intersect setdiff

hashing:

hash

automata:

comparisons:

comparisons: (leq is core) lt gt geq eq (should EQ be core? is it polymorphic?) neq true false odd even truthy nonzero empty (are these last three distinct, and distinct from the others?)
boolean logic: and or xor nxor nand nor not
conditional jumps: jeq jne jle jlt jge jgt jz jnz jand jor

arithmetic:

basic arithmetic: mul div mod neg
more arithmetic: pow pow2 inc dec rem abs sgn min max trunc copysign floor ceil round
even more arithmetic: ln sqrt nroot sin cos tan asin acos atan hyperoperation mac (multiply-and-accumulate)
constants: inf nan e pi golden_ratio
shifts: shl asr lsr ror rol rorc rolc (? are these right, i havent paid close attention, todo)
counts: clz ctz popcnt

moves:

moves/copies: MOV (like CPY but undef's the old location), alias (like CPY but creates an alias to the old value, rather than a separate copy of it), getalias

control:

calls: call tailcall ret apply
system calls: syscall
loops: for while foreach continue break
switch

stack ops:

dupn swap rotn getn setn deln

graph:

creategraph match replace

memory:

system calls:

promises:

(todo: what else?)