proj-oot-ootAssemblyThoughts

Oot Bytecode and Oot Assembly

See https://gitlab.com/oot/ovm

Introduction

Oot Bytecode (OotB?) is a simple, easy to implement virtual machine. Oot Core is implemented in OotB?. OotB? only has a few 'primitive' instructions; this language is called 'primitive OotB?'. The OotB? Stdlib contains implementations of other instructions built out of the primitive instructions. This makes it easy to bootstrap a naive implementation of Oot; all you have to do is to implement an interpreter for primitive OotB?.

Motivation

Why did we invent yet another bytecode format? Design goals include:

Popular existing bytecodes are not 'very easy' to port to a new platform, and are too 'low level' to directly preserve high-level language constructs (HLL constructs could be preserved via annotation, but then the annotation format effectively forms another programming language anyways).

Oot Bytecode is very easy to initially implement because it has so few primitive instructions. This naive implementation will be very slow but can be incrementally improved by overriding the default interpretation of various stddlib instructions with custom versions using the target platform's native constructs.

Oot Bytecode preserves high-level language constructs via the 'stdlib' instructions that are built out of the primitives. This allows a compiler to an HLL to emit somewhat readable target code that uses the platform's native data structures just by recognizing those stdlib instructions that have an obvious counterpart in the target platform.

Note that we want to be able to efficiently compile the Oot Core implementation from OotB? to even a very inexpressive, static language. Therefore we try to limit run-time dynamicity in OotB? and in the Oot Core implementation.

Overview

Oot Bytecode is the object language of an imperative virtual machine with very few primitive instructions. We hoped that typical Oot Bytecode will be able to be compiled with relatively little effort into relatively readable code in foreign HLLs.

It is similar to the ISA of a CISC 3-operand memory-memory machine, but it is more abstract (perhaps similarly to Lua bytecode). It abstracts away from details such as data representation format, and its 'memory locations' or 'registers' have semantics more similar to local variable identifiers rather than an array of bytes (in this way it is more similar to eg Lua bytecode). It has operations analogous to pointer arithmetic, but these actually correspond to moving across fields in a struct rather than to indexing into an array of bytes. It is relatively agnostic regarding the actual bit width of integers, pointers, etc; in fact, a single item in the constant table can represent an arbitrarily large object. It is a memory-memory (rather than register) machine, but it can directly address (with absolute immediate addressing) only 4k 'memory locations'; however, opaque pointers can point to other 'memory locations' beyond what can be directly addressed.

It has 8 addressing modes, plus a 'boxed' addressing mode bit. The use of 3-operand format and 8 addressing modes is designed to reduce the need for intermediate instructions moving data between virtual registers, which makes it slightly harder to compile programs in other VM architectures into readable HLL code.

It has 3 instruction encodings, SHORT, MEDIUM, and LONG, all three of which consist of a sequence of 64-bit units. SHORT packs 8 fixed-length instructions into each 64-bit unit; MEDIUM is a fixed-length encoding with 64-bits per instruction; and LONG is a variable-length encoding. MEDIUM is the basic encoding; SHORT is a denser encoding that can only represent a subset of instructions; and LONG is an extensible encoding.

The compilation unit for Oot Bytecode is an Oot Bytecode Module. Modules have important limits as a consequence of length limitations in the 64-bit MEDIUM encoding; most notably, the number of functions directly exported by one single module cannot exceed 2048 (a single module can, however, also export other modules, and collectively these can export more than 2048 functions).

Oot Bytecode has polymorphic typing and also has first-class functions and has a function calling syntax in which the opcode field of the instruction, instead of containing a number identifing a Oot Bytecode Core instruction, contains a reference to a memory location which itself contains a reference to a particular function. This is accomplished by giving an addressing mode to the opcode (hence, the opcode field is called the 'opcode operand').

In addition, many integer opcodes are assigned to refer not to Oot Bytecode Core primitive instructions, but rather to 'custom instructions' implemented in the Oot Bytecode Standard Library. These opcodes do not need to be implemented by an Oot Bytecode implementer; they need only implement the Oot Bytecode Core instructions, since the Library is ultimately defined in terms of those primitives. Advanced implementations can treat the custom instructions as compile-time macros and expand and inline them at compile-time; or they can replace the provided implementations with platform-native implementations.

Oot Assembly, the textual language, also has a separate compile-time macro facility, which are expanded during compilation to Oot Bytecode. Why have 'custom instructions' if you already have compile-time macros? Because Oot Bytecode 'custom instructions' can more easily be replaced by with platform-native implementations by an interpreter.

SootB

todo now it's the same as SHORT, plus 64-bit LOADI (16-bit immediates) with restricted addr modes, and maybe a STOREI that can write to 11-bits of registers

SootB? is a simple stack-machine single-threaded subset of OotB?. SootB? is designed to be really easy to implement an interpreter for. There is a compiler from OotB? interpreter to SootB? implemented in SootB?, so this allows you to bootstrap OotB? and hence Oot from just a SootB? interpreter. This will be slow and you will probably want to eventually extend your SootB? interpreter into a full OotB? interpreter, but in the meantime you'll have a working Oot installation. Since SootB? is a subset of OotB?, you can work on extending it incrementally (however you won't experience any speed gains until you are all done, at which point you can throw out the provided OotB?-on-SootB? interpreter and replace it with your own native OotB? interpreter).

In the following, TOS stands for 'top-of-stack', the value on top of the stack.

A SootB? machine has program code, a program counter, a stack, a linear memory, and an error register. It is untyped. Each location in the stack and the memory can hold a value which can be treated alternately as an unsigned short integer (which can at least hold values between 0 to 32767, inclusive), or as a pointer. Pointer arithmetic and arithmetic between pointers and integers is permitted.

Program inputs including any file descriptors needed for console I/O are placed on the stack before starting the program. The program terminates when a HALT instruction is encountered.

SootB?