Boot (Oot Boot) design choices discussion

:hardbreaks:

Motivation

Boot is one component of the larger Oot project, which aims to create a high-level programming language. Two of the goals for Oot are to be:

an interoperable glue language
easily portable

The plan is to build four layers, each of which is implemented upon the previous:

Boot, a low-level assembly-language-like VM which is dead simple to implement
Ovm, a "high-level assembly" VM with extensible instructions and data structures; these can be implemented by platform primitives where supported, and by Ovm code where not
Oot Core, a simple and powerful, but verbose, high-level language
Oot, which is Oot Core but with syntactic sugar and other conveniences added via metaprogramming, as well as a standard library

Boot is not just a substrate of Ovm, but also a subset of it. In addition to being the platform that the reference implementation of Ovm is implemented in, the Boot instruction and register set are a subset of Ovm's, and the Boot encoding is a dense encoding variant in Ovm.

It is not the case that all implementations of Oot or Ovm will be built on top of Boot. In fact, good interoperability and performance can most likely only be achieved by implementing Ovm directly on a target platform, or even by directly implementing Oot Core or Oot. However, Boot provides a 'good enough' platform that is quicker to port. In addition, because Ovm is an extension of Boot, the effort spent to create a Boot implementation can for the most part be re-used towards creating an Ovm implementation.

Boot is not designed to be directly written by humans; Ovm is the lowest level that is intended to be directly written. Rather, Boot is designed to be easy to implement, so that the whole stack is easy to port. If the only goal is easy implementation, then since Boot has stacks, why is it not just a stack machine, instead of also having registers? This is because Boot is also supposed to be a useful subset of Ovm; both so that a subset of Ovm can be more compactly encoded (Boot has 16-bit instructions, whereas Ovm MEDIUM encoding has 64-bit instruction), but also so that an implementation of Boot can be incrementally extended into a direct implementation of Ovm.

Why not just use LLVM, WebM, Lua, JVM, .NET, C, RISC-V, or MIPS?

To achieve the goal of creating an interoperable 'glue language', I want to make it easy for other people to create (slow, poorly integrated) implementations of Oot that are written in, and run on, a wide variety of platforms. For example, I want people to be able to implement Oot in Python, in Haskell, in Java, in C, in Rust, in Octave, in C#, in Javascript, etc. If Oot were implemeted in, say, LLVM, then that would mean creating an LLVM backend for each target platform. It seemed to me that all of these other 'common VMs/ISAs/languages' are too much trouble to implement/port/create backends for -- enough trouble that re-implementing one of them on a new platform would be a 'serious project' rather than something that one person who is not a programming language expert might attempt to do over a weekend.

In addition, many of these other VMs are not 'assembly-language-like', but rather, make assumptions about the sorts of control flow that might occur in the high-level languages that they support (typically they make 'structured programming' sorts of assumptions and encapsulate some aspects of the callstack); this can make it hard (or at least annoying and/or inefficient) for languages written on top of these VMs to provide things like tailcalls, continuations, coroutines (which may have to save/restore the stack when switching contexts), resumable exceptions, debugging facilities that inspect the stack, etc. Assembly language circumvents these restrictions by providing first-class stacks and 'goto'; therefore Boot is an 'assembly-language-like' VM.

Why have both Boot and Ovm? That's a lot of layers.

Originally I planned to just have Ovm, but when I first tried implementing it, i found that it was too complicated to implement over a weekend (well, i might be able to re-implement it over a weekend, but i couldn't ask that of someone else who doesn't know Ovm as well as i do).

Well what about only having Boot, and not Ovm?

I don't want to get rid of Ovm because:

Ovm allows parts of the platform to be implemented via platform primitives on platforms where appropriate primitives are available, but by Ovm source code on platforms where they are not. This is not just good for efficiency, it's also key to facilitating interoperability, because it allows the high-level language to use platform-native data structures throughout.
the reference implementation of Oot Core will be written in Ovm, and it would be a pain to write and debug a high-level language written in low-level assembly.

Why only unsigned integers? Why only a minimum bitwidth instead of a fixed bitwidth? Why no division instruction?

To facilitate ease of implementation on many platforms, in Boot we're looking for the lowest common denominator. We can implement these other things at the Ovm level.

Why 16-bit?

8- and 16-bit processors appear to be unnecessarily spartan; even in the embedded space there are lots of 32-bit processors available. But, within the family of embedded processors, 16-bits appears to be squeezed by 8-bit and 32-bit; that is, there are many 32-bit processors, and several 8-bit processors, but only a few 16-bit processors. So, if we are not going spartan, why not use 32-bit; and if we are, why not use 8-bit?

We chose 16-bit rather than 8-bit because many 8-bit processors use pointers of size greater than 8-bits. 16-bits seems to be the smallest multiple of 8-bits that is wide enough to address a 'decent' amount of memory (for example, the CPUs in old Apple IIs and Commodore 64s could address 16-bits of memory). We are going for simplicity here, and it's simpler if we assume that a pointer can fit in one word of memory.

Why am I even interested in going as low as 16-bits? The ultimate purpose of Boot is a substrate for Oot, and we don't imagine that a high-level language like Oot will be the best choice for the manual, resource-constrained projects that often characterize embedded systems programming; so why should we care about a small bitwidth? The reason is that I'm interested in creating a programming language that might be good for making massively concurrent programs, because I think the brain works that way (and by massively concurrent, I don't mean hundreds of processors, I mean tens of thousands, or preferably billions). In order to achieve this amount of concurrency, I imagine that a hardware architecture would probably scale way back on per-processor complexity. I doubt that the elementary components of the brain (whether these turn of to be synapses, neurons, or clusters of neurons) compute with a high degree of numerical precision -- and in fact, I believe that the brain's components have a lot of noise/errors, which would further limit accuracy to even less than whatever precision it has. So, I suspect that hardware that supports massive concurrency may be composed of individual processors that are simpler (smaller bitwidth, etc) than we are accustomed to today. Of course, it's exceptionally unlikely that I can guess future architectural specifications to a degree that I can make Boot compatible with a class of machine that doesn't yet exist --- but by limiting ourselves in this way we at least make it more likely that it will be less troublesome to create a variant of Boot in the future, than it would be if we based everything on, say, 64-bit floating point.

Interestingly, there's some evidence that artificial neural networks can use 8-bit values for execution, and 16-bits for training (although stochastic rounding, which is non-standard, may be needed). [1] [2] [3]. This is not conclusive, but merely adds one more application to the list of applications that don't need more than 16-bits.

Why not some number of bits in between 8 and 16? Practically, there aren't many implementations that use something in between, and more subjectively, 16 is a power-of-two which seems aesthetically appropriate for systems based on binary logic.

Tangentially but interestingly, 16 is a nice number subjectively/aesthetically/'numerologically', and has lots of connections to the numbers 2 and 4, and also has some symmetries relating to exponentiation of 2s and 4s. 16 = 2*2*2*2 (4 2s multiplied together). 2^(2^2) = (2^2)^2 = 16; 16 is the last power of two that can be written that way 'associatively' because exponentiation is not associative and for larger series of 2s raised to each other, the order of operations matters, eg (222)2 != 2(222). In fact, 16 is the only integer n such that there are distinct integers x,y such that: x^y = y^x (eg 2^4 = 4^2). Because 16 = 2^(2^2), it is a member of the sequence 1,2,4,16,65536,..., which starts with 1 and then raises 2 to the previous sequence member. This sequence has an 'application' in the following sequence of bitwidths: 1-bit; the bitwidth whose bits can themselves be addressed by a 1-bit address (which is 2-bit); the bitwidth whose bits can themselves be addressed by a 2-bit address (which is 4-bit); the bitwidth whose bits can themselves be addressed by a 1-bit address (which is 2-bit); the bitwidth whose bits can themselves be addressed by a 4-bit address (which is 16-bit); the bitwidth whose bits can themselves be addressed by a 16-bit address (which is 65536-bit); (also, 16 is the last widely available member of this sequence, because 65536-bit bitwidths is obviously way bigger than found in present-day commodity implementations). Another way to think of this 'application' is: start with a bit as an primitive index; you can index 2 items. Now, you can use your two-item memory to store 2 1-bit indices; this allows you to build a memory that can index 2^2 = 4 items (you have 2 dimensions of indices, and the index in each dimension is 1 bit, so you can index 2*2 = 4 items); now, you can use your 4-item memory to store 4 1-bit indices; etc. And because 16 = (2^2)^2, it is a member of the sequence 2,4,16,256,..., which starts with 2 and then squares the previous sequence member (perhaps there is a similarly compelling computersy 'application' for this sequence?). Simplifying the numbers in parentheses of 16 = 2^(2^2) = (2^2)^2, we see that 16 = 2^4 = 4^2. 16 is a power of a prime (2^4); since 4 is the first composite number, 16 is the first power of a prime where the number that the prime was raised to is composite. 16 is a also a square number (4^2). 16 is a 'practical' number (defined as a positive integer n such that all smaller positive integers can be represented as sums of distinct divisors of n) (all powers of 2 are practical). 16 has 4 identical prime factors (if you count multiplicity), and it is equal to 4*4. The 16-cell (sometimes called a '4-D hyperoctahedron', '4-dimensional cross-polytope' or '4-D orthoplex', or a 'hexadecachoron'), is one of only 6 regular polytopes in 4-D, and it has 16 faces. The 16-cell honeycomb is one of only 3 regular tesselations of (Euclidean) 4-D space. Of course, all small integers have lots of 'special properties' like these, and 8 has a similar list. However, 8 doesn't have the x^y = y^x property, nor can it be represented as a bunch of 2s exponentiated together, and although it's a power of 2, 8 is the third power of 2, and 3 is not a power of two; and its geometric relevance is mainly to 3-D, in contrast to 16's relevance to 4-D, and again 4 is a power of 2 and 3 is not; so in a way 16 seems to be more 'purely' connected to 2 and its powers, whereas 8 has a strong connection to 2 but also a connection to non-power-of-two 3 (in place of 16's connection to 4). Since usage of a number as a bitwidth is intimately connected both to exponentiation and to 2, aesthetically, 16 seems to have some qualities to recommend it for this purpose, even if 8, the next smaller power-of-two, is smaller and is therefore simpler. 32, the next larger power-of-two, lacks these qualities and is also larger, so aesthetically it's not as nice for this purpose. 4 is even more simple than 8 and unlike 8 is 'purely' connected to 2, but a bitwidth of 4 is clearly 'too small' in that it's not very available in commodity implementations. This 'numerological' stuff may not have any real practical bearing on the decision, but it shows that 16 is an aesthetically nice choice for a bitwidth, in a way that 8 and 32 are not.

CAP register

Although its behavior is left open, the CAP register is intended (but not required) to be used for access control/security/privilages/capabilities (hence the name CAP).

Lack of safety and security

If the purpose of Boot is to serve as a substrate for and a subset of Ovm, and if Ovm is to be a safe language, then why don't we require bounds-checking at the Boot level? Why do we allow invalid pointer operations to have undefined effects? Because Boot will be used to execute the Ovm implementation, and for efficiency we will trust the Ovm implementation to be safe and correct. Oot will be a safe language; Boot is not.

Note that this means that an Ovm implementation running on a Boot platform will not be able to simply identify parts of an untrusted program that it is running that contain only instructions in the Boot subset of Ovm, and blindly execute them directly on the Boot platform; it must prove safety or insert bounds-checks.

Undefined values for arithmetic over 16-bits

Not having a maximum bitwidth allows Boot programs to directly use whatever data types are native to the platform that Boot is being implemented on. This is good for efficiency, and for making it easy to implement Boot on new platforms, and for interoperation with other code native to the platform.

We allow the implementation freedom to produce arbitrary output values when the outputs 'should' overflow 16-bits because we want to allow Boot implementations to be able to simply use the native arithmetic operations of the underlying platform without inserting checks for overflow around each operation.

Syscall-y instruction choice

The reason for (malloc, mdealloc, mfree, open, close, read, write), is that they are fundamental and common. (walk) because it is fundamental. (poll) because it is called a lot in tight loops and so you want them to be efficient, and (time rand) because both they are called in tight loops, and they are fundamental.

Why MUL?

If we are targetting 16-bits, why include MUL? MUL instructions are not universal.

MUL seems to be fairly common among higher-end embedded processors and even where it is not present, it is simple for a Boot implementor to write an unsigned integer multiplication routine.

Why are OTHER and UNDEFINED redundant?

Since we don't define the semantics of either OTHER or UNDEFINED, these would seem to be redundant. The intent is that OTHERs may be created by correctly-behaving libraries and implementation extensions, whereas UNDEFINEDs are only created by erroneous programs.

What's the point of SMALLSTACK?

SMALLSTACK's capacity limit may allow some advanced implementations to optimize it away.

SMALLSTACK is a useful abstraction for Ovm's 'custom instructions', which can call each other but only in a non-recursive, inlinable way. In Ovm, many custom instructions will use SMALLSTACK, not DATASTACK or CALLSTACK or registers, for temporary processing, in order to make it easy for custom instruction subroutines to access manipulate data placed in registers and on DATASTACK and CALLSTACK by the main program; conversely, the 'main program' won't touch SMALLSTACK except as a temporary place to load arguments and retrieve results.

Why so many stacks?

Many languages combine CALLSTACK and DATASTACK into one, but a few like to have two different ones (eg see https://users.ece.cmu.edu/~koopman/stack_computers/sec2_1.html#211 ). In fact, some languages use even more (eg see http://docstore.mik.ua/orelly/perl/prog3/ch18_03.htm ). So I thought it might be nice to offer a second stack, although it's debatable whether we should, because a language that wanted two stacks if we only provided one could use the 'user stacks' facility.

As above, SMALLSTACK is a useful abstraction for Ovm's 'custom instructions', and also, i thought it would be nice to have at least one stack with a small, fixed capacity, so that advanced implementations could in some cases optimize it away.

Why are the syscally-instructions in here? Those aren't assembly language-y. Why not make them 'system libraries'?

You can think of these as calling into a system library. We include these in order to allow us to write Oot compilers and interpreters in pure Boot.

While many VMs don't have instructions for file I/O, some do (e.g. MoarVM?).

Why stackload and stacksave?

In order to support things like coroutines swapping stacks when changing context, we need first-class stacks.

Why the bitwise arithmetic?

These are not strictly necessary but are extremely common across many platforms.

Why do SKIPs and JRELs reserve two operands?

We want to reserve two operands for the implementation, so that, in Ovm's MEDIUM encoding, the implementation can inline encode relative jump magnitudes of up to 65535; this allows each instruction to be expanded to up to 255 native instructions.

Why pickk and rollk?

While not strictly necessary, these operations have a good power-to-weight ratio, since they include common operations dup, over, swap, rot as special cases. In Ovm, we intend for our custom instructions to mainly use SMALLSTACK, not registers, for temporary storage, and these stack operations (along with the capability of LOAD and STORE to peek/poke deep in the stack without popping, and the TOS alias to the top of SMALLSTACK) make it a lot easier to do stuff without registers.

Why do pickk and rollk force k to be a static constant?

First, this allows them to substitute for a bunch of instructions like DUP and SWAP. Second, in higher levels of the Oot stack we're going to require statically known stack maps, so dynamic pick and roll won't be useful (outside of trusted components such as the Oot implementation itself).

What if the stack size can't fit in a UINT?

We assume that the stack's size can fit in a UINT; this assumption shows up in the SYSINFO calls that retrieve stack size, and in the stacksave format that has the stack size at the beginning. This is because the simplest architecture uses registers and stack locations that can fit a native pointer anyways, and a word that can hold a native pointer should always be able to hold a UINT indicating the size of addressable memory.

Why are malloc, mrealloc, mfree in the one-operand opcodes? We could have put them in the zero-operand opcodes and used that space for more computation instructions

There's not a great reason for this, I was just out of space. These might be moved back to the zero-operand opcodes in the future. That being said, the current situation, where they are in the one-operand opcodes, has the virtue that the location of the affected pointer is always in the instruction, rather than having that pointer be pushed onto SMALLSTACK. This might be slightly more efficient in cases of frequent mallocs, and might make it slightly less annoying for implementations to trace pointer ownership.

Why are sysinfo, log in the one-operand opcodes? We could have put them in the zero-operand opcodes and used that space for more computation instructions

Each these have at least one argument that advanced implementations might like to know statically. Since the sysinfo query is staticly known, when the result of a sysinfo query is a staticly known constant, the query can be replaced with the constant at compile-time. When the result of a sysinfo query is something dynamic (like the stacksize, which in some circumstances could be queried frequently in an inner loop), a compiler might inline the code to compute the result of the query. Since the loglevel is staticly known, a compiler can be instructed to strip LOG instructions with a low loglevel.

Why is 'time' an instruction instead of a magic file similar to 'rand'?

Since the current time might be queried in an inner loop, it was thought that it might be good to make this efficient. We don't want to pre-open the open file descriptor, as with 'rand', because there may be many clocks available and we don't know which one a program wants, and there are too many potential clocks for us to want to pre-open them all. We don't want to have the program OPEN a clock, because then different programs, and possibly even different runs of the same program, will have different file descriptors for the same clock, meaning we can't just hardcode a special-case check for a magic file descriptor number in READ (as we can with 'rand'), which means that reading a clock in this manner may be less efficient than having an instruction.

Why is 'rand' a magic file instead of an instruction?

To save opcode space. This can even be efficient; since 'rand' is given a fixed open file descriptor, it seems easy to hardcode a special case into READ.

Why doesn't the textual assembly language provide labels?

Since Boot is not designed to be written by humans, ease of implementation trumps ease of use.

Can you quickly summarize the justifications of each instruction/class of instruction?

Annotations are useful for tooling. Sysinfo is useful for querying static implementation configuration choices as well as stuff like stack size.

We need to load constants up to MAX_UINT. We need to express constant pointers to locations in code, in order to do (effectively) absolute jumps.

We need addition. Subtraction, less-than-or-equals, and multiplication aren't really necessary primitives but they are very common and useful.

We need some sort of conditional (SKIPZ). Skipnz isn't strictly necessary but is very common and useful. We want JREL to avoid the annoyance of having a zillion absolute jumps everywhere.

We need JD for fully dynamic control flow.

We need to be able to add integers to pointers in order to reach a field when we are given a struct. It's useful to be able to subtract integers from pointers because this would be needed for efficient stack PUSHing, if we didn't already provide that as an instruction.

PUSH, POP aren't strictly necessary but are very common and useful. PICKI and ROLLI have good power for their weight, and we expect to be doing lots of stack manipulation on SMALLSTACK in Ovm. We need STACKLOAD and STACKSAVE to do things like switch contexts in coroutines.

If we are ever using shared memory for IPC, including for synchronization, then we need some synchronization primitive like CAS. If we are allowing a relaxed memory ordering in shared memory by default, then we need something like FENCE-SEQ to change this when needed.

The bitwise operations aren't strictly necessary but they are very common and useful.

Memory allocation, I/O, IPC, time, and rand would usually be in a library but we want everything here.

It's nice to have a LOG (instead of just a STDERR file) so that we can have staticly known loglevels, and so that the implementation can insert a timestamp to each logentry. BREAK might be useful to some debuggers.

Halt is not strictly necessary (the end of the program code could be implicit HALT) but it's common and useful.

Why not give 'loadi' 12 immediate bits instead of only 8?

Because you can instead use 'loadi' followed by 'loadlo' to load more than 8 bits.

Why have devices instead of additional opcodes?

First, there are limited opcodes. It's efficient to save these opcodes for frequently used, fast operations, and to use something more cumbersome, like devices, for operations which are relatively slow (like I/O) or infrequent.

Second, the notion of a 'space' of devices located by device id provides conceptual economy. We can reuse the same device API for many instances of this object; for example, we have one channel control device API that we then use for every channel. At the same time, different types of devices can have different device APIs.

Third, all of these device APIs use the same few device opcodes, allowing code analysis tools to make some basic assumptions about which registers and memory addresses might be affected by a given device instruction even without understand the device type.

Why don't signals have attached data? Why don't you have a count of pending signals, instead of just a bit?

If signals had attached data, then the implementation might:

store the attached data somewhere
block the sender and store a record that says that that sender is waiting to deliver this signal
block the sender, not record that, and do expensive polling when the signal is caught to find the sender and get their data

The last option is complex and inefficient, and all the others require the implementation to maintain some sort of buffer that could fill up if too many signals come in faster than they can be caught. Another option could be to provide an attached data option but not guarantee delivery; for example, if another signal of the same type comes in before the first one is handled, the latter signal's data could overwrite the former's. This would make the attached data less useful, however.

All of these options add edge cases for programs to work around, and all of them add implementation complexity (and also increase the latency of sending and receiving a signal). And if threads or processes want to send messages to each other, there is already a mechanism provided for that: channels.

Therefore, i chose for the effect of a signal only to set a one-bit flag (and to call the signal handler when the flag is set and the signal is unmasked). Signals are not mailboxes; signals are like the little flag on the mailbox.

Similarly, a count of pending signals could overflow or saturate.

Why doesn't receipt of a signal spawn a new thread, instead of interrupting an existing thread?

The whole point of signals is to provide a mechanism to interrupt processing in a single thread. If you want a new thread to handle some event, you can spawn one explicitly.

11 GPRs is too few. Why aren't there more?

If we have 16-bit 3-operand instructions, there's not much space. 4 bits for the opcode is already pushing it; this only gives us 16 3-operand instructions (or, as we do it, 15 3-operand instructions plus a number of instructions with less operands). We could take away some bits some of the other operands and make one of the other operands bigger, but then that's irregular and possibly annoying to write compilers for -- and that would privilege the lowest registers anyways because that's all that would fit in some of the operands.

So given that we have 16 registers, we spend 5 of them on special registers. So 11 are left

Why do you waste 5 registers on special registers? We could use more GPRs

The 5 special registers, R0 through R4, are:

0. 0 (constant 0 register) (also used to indicate SMALLSTACK to stack-aware instructions) 1. 1 (constant 1 register) (also used to indicate DATASTACK to stack-aware instructions) 2. ERR (also used to indicate CALLSTACK to stack-aware instructions) 3. PC 4. T (alias to the item on the top of SMALLSTACK)

None of these are need-to-haves but they are all very-nice-to-haves.

Having registers for SMALLSTACK, DATASTACK, and CALLSTACK allows the implementation to implement these stacks at a lower level. That may make accessing these stacks more efficient, and it allows the implementation to do something complicated behind the scenes of PUSH and POP, for example, dealing with segmented stacks, or in the case of SMALLSTACK dealing with register T. As a bonus, with non-stack-aware instructions these registers are not totally wasted, because they also serve as constants 0 and 1 and register ERR.

The constant zero register adds efficiency because we don't have an immediate addressing mode (or any addressing modes), so without this we'd have to have a bunch of 'LOADI 0's all over the place. Similarly for constant 1.

Register ERR replaces other architectures' carry bit, high-order multiply result register, and ERRNO function. This stuff would have to be stored somewhere anyhow, and making it an accessible register removes the need for special instructions to retrieve hidden carry/status/lasterror state.

Register PC allows somewhat efficient code to access memory addresses relative to the PC. In OVM, which will have addressing modes, and might have an addressing mode that adds together the contents of two registers and uses the result as an effective address, this might be even more efficient.

You might say that register T is not even a special register, but rather a GPR; it can be used just as the other GPRs, the only thing 'special' about it is that PUSH and POP treat it specially (you might think of SMALLSTACK as actually being a stack with 0 to 15 items, but PUSH pushes T onto that stack and then CPYs its target register into T, whereas POP copies T into its target register and then pops the stack into T). There are various other instructions that also treat T specially, namely by implicitly taking inputs from or placing outputs into T. Register T exists because stack machines often have a cache of the TOS, for efficiency. This is less useful in Boot but in OVM we'll probably add a stack addressing mode, in which case it will be more useful. Also, our primary intended use-case for SMALLSTACK is to enable OVM (Oot Virtual Machine, which will target Boot/BootX?) to have 'custom instructions' whose implementations are required to preserve the other registers and stacks; since computation requires at least a few temporary storage locations, SMALLSTACK serves that purpose. Since our instructions get and set registers, rather than pushing and popping their results from the stack, we need at least one register for the use of these custom instructions; T serves that purpose.

Why not have a TOS cache for DATASTACK and CALLSTACK too, and multiplex the stacks with these registers, instead of with constant registers and the ERR register?

Because you might want to LOAD and STORE to the addresses on the TOS of various stacks.

TODO amend this:

Why is there no way to coerce integers to pointers?

Pointers are supposed to be opaque. Programs aren't supposed to just create pointers; they are supposed to derive them by getting a pointer from elsewhere (for example, from MALLOC) and then (possibly) adding or subtracting uint indices. This makes it slightly harder to get pointer arithmetic bugs (although you still can; you can add or subtract too much, giving you a result outside of the allocated space for the object that you thought you were pointing into).

There is another reason. In traditional assembly language, memory is one large contiguous address space, and a pointer is equivalent to some integer offset from memory location zero. In this paradigm, any memory location can be specified by giving an offset, and any memory location can be reached by starting from any other memory location and then adding or subtracting some offset. But in Boot, we want to allow for other possibilities. We want to allow the Boot implementation to implement pointers as 'fat pointers' or 'descriptors' if it chooses, for example maybe each 'pointer' is actually a native pointer along with two more native pointers, giving a lower and an upper bound for the memory region being pointed into, to allow an implementation to do runtime bounds-checking. Or, perhaps the form of memory is something other than one large contiguous space; perhaps memory is implemented as many small pools of memory on various remote devices, and a 'pointer' consists of a device ID (or even a routing path) and then an offset.

Why is there no 'pointer literals', no way to construct immediate/constant pointers (not even the null pointer)?

First, pointers are supposed to be opaque; since we don't want to pin down the representation of a pointer, we can't specify pointer literals.

Second, on most systems, if you could construct memory location zero, you could then get a pointer to any memory location by adding some offset. But that's effectively like coercing integers to pointers, which we don't want to allow. (you could also allow some pointer literals, for example the null pointer, but then forbid addition and subtraction of offsets to these; but then we really have two types of pointers, one type that permits addition and subtraction and one that doesn't; which would be more complicated).

Why do you allow pointer subtraction?

First, this allows us to serialize complex data structures containing pointers (assuming all of the pointers are 'derived' from a common ancestor) by converting the pointers into integer offsets from each other.

Second, this allows you to check how much memory is left until you run out of room when you have, e.g., a stack or array pointer, and a base pointer to the memory region the stack is in, and knowledge of how big that region is.

Third, this allows you to add an index to a base pointer, store both the base pointer and the new pointer but throw away the index, and then later recover the index.

Why do you require derived pointers for LEQ-PTR and SUB-PTR?

As noted above, pointers are supposed to be opaque, both to help prevent pointer errors, and also to allow the topology of memory to be something other than a single large array. A pointer and another pointer derived from it CAN be thought of as existing at integer offsets within a single larger array, but this is not generally true of any two pointers.

Why do you require that, even if pointers aren't derived, LEQ-PTR can't say both P1 <= P2 and also P2 <= P1?

Because, if the pointers were derived, then this would imply that they are equal. That could lead to confusing bugs.

Why do you require that, even if pointers aren't derived, SUB-PTR can't say both P1 < P2 and also P2 < P1?

Because in this case < seems similar to <=, and if it were <=, and if the pointers were derived, then this would imply that they are equal. So this could also lead to confusing bugs.

Why is SUB-PTR a two-operand instruction instead of three-operand?

Just because we ran out of 3-operand opcodes, and this seems like it may be a less common operation than the other 3-operand arithmetic operations.

Why do you have BLEQ instructions instead of just LEQ?

The RISC-V people complain about the lack of compare-and-branch instructions in some other ISAs eg in Chapter 2 of [4].

Why is the opcode in the least-significant bits instead of the most-significant?

I didn't have a preference one way or the other, and that's the way RISC-V does it, so i figure maybe they know stuff i don't.

Why not use WASM, RISC-V, OpenRISC, ARM, x86, MIPS, SPARC, etc?

Even the simplest of these is harder to implement than i wanted. We want this to be really really easy to implement, so that it is easy to port to many platforms (because we want to port it to all of: hardware ISAs (like ARM and x86), common VMs (like JVM and CLR), and high-level languages (like Python and Haskell)).

Also, for easy implementation on top of a variety of other platforms (including high-level languages such as scripting languages), i wanted our ISA to be somewhat agnostic as to word size (the bit-width of integers and pointers), so that the implementation can just use whatever native number type is provided).

I doubt Boot will be as efficient as more serious efforts like WASM and RISC-V, but this is not a critical flaw, because implementations of Oot concerned with performance will eventually skip Boot and implement OVM or even Oot in native code on the target platform.

Why can the implementation choose to skip all-1 instructions (ANNOTATE 15 15 15) instead of crashing?

Because all other ANNOTATE instrucitions are skipped, and in some implementations it may be significantly faster to just skip over all ANNOTATE instructions without looking at their operands.

Since programs don't know the value of MAX_UINT beforehand, do they have to have extra code to deal with various possible sizes?

No. A valid Boot program is allowed to assume that MAX_UINT is a certain number and work correctly only on platforms that provide that size of UINT. For example, you could write a Boot program that only works correctly on 32-bit platforms. Of course, if you choose to, you are also able to write Boot programs that can handle multiple UINT sizes.

Why is there a cast-uint-ptr instruction in BootX? Doesn't that make it impossible to be safe?

This is an optional instruction and safe implementations should not implement it. This is included so that, for unsafe implementations, there is a platform-independent way to do this. This allows BootX? to be a target for C compilers.

What are some disadvantages of this design?

We are optimizing for ease of implementation over code density. If we had more instruction formats we could probably get denser code, but that would be harder to implement.
We can't handle absolute addresses as efficiently as many other ISAs. Many ISAs have many instructions which can take absolute addresses in their operands. For the most part, in Boot and BootX?, absolute addresses must be first loaded using 2 separate instructions (LOADI and then CVT-INT-PTR). In BootX?, you could provide immediate constants to instructions that take addresses like LD and ST, but these are still only 4-bit immediates.

Why can't you branch to offsets of -2, -1, or 0?

The motivation for excluding -2 thru 0 is that it is not useful; when the branch is called the PC has already been moved to point to the next instruction, which is 2 bytes after the branch instruction; so -2 would just branch back to the branch instruction, causing an infinite loop; -1 would branch into the middle of the 2-byte branch instruction, which is illegal; and 0 would mean continuing onwards without a branch, in which case the branch instruction would just be functioning as a NOP.

Why can't the offset on LD, ST take negative values?

I decided to use all of the 8 values for non-negative values so that we can reach deeper into stacks. Positive values are probably more useful than negative for other things, too (such as indexing into arrays or addressing items within structs).

Why are there stack-aware PUSH and POPs that PUSH and POP SMALLSTACK? We can already do PUSHes and POPs of SMALLSTACK by using CPY with register 2 operands, so this is redundant.

It's true that 'CPY 2 x' and 'PUSH 2 x' do the same things, and that 'CPY x 2' and 'POP x 2' do the same things.

This is because some parts of the toolchain might find it useful to be able to explicitly represent PUSHes and POPs of the SMALLSTACK.

In addition, if there was no special case, PUSH and POP would be kind of useless when given register 2 operands anyways, since they would PUSH or POP something to some user stack, but they would consume the stack pointer while doing so.

Is Boot a RISC instruction set?

For the most part, yes.

It meets the criterion of being a 'load/store architecture', that is, only a few special instruction access main memory. SMALLSTACK can be considered an additional set of 16 registers (that is, not part of main memory), in which case no instructions access main memory, except:

LD, ST, PUSH, POP, CAS
the "syscall"-ish instructions (open, close, read, write, malloc, etc).

It meets the criterion that the encoding format is regular and fixed-width.

It meets the criterion that the number of instructions is not terribly many, and that the amount of work done by each instruction (aside from the syscall-ish ones) is small.

Is BootX a RISC instruction set?

Probably not. Unlike Boot, BootX? has addressing modes, which allow multiple loads and stores to be embedded within almost every instruction.

It is also a variable-length instruction set (although still a simple one), with 8-bit, 16-bit, and 32-bit instructions.

Why don't you add some restrictions that would make Boot easier to compile? For example, require CVT-INT-CODEPTR to only be applied to newly-loaded immediate values, and require the layout of SMALLSTACK at any single point in the program to be fixed and staticly knowable

This would add complexity to the specification. More importantly, it would prevent certain kinds of programs from being written in Boot, for example a REPL (in which the argument of CVT-INT-CODEPTR, and the layout of SMALLSTACK, may vary based on user input).

Why don't you specify a fixed signature for LIBRARY? This would aid program analysis.

Not specifying a fixed signature for LIBRARY means that program analysis tools must be provided with the platform-specific signatures for any LIBRARY invocations.

This is not fixed because it is consistent with the spirit of omitting other restrictions that would make program analysis easier (see previous question).

Why does CVT-INT-CODEPTR exist in Boot? It makes program analysis harder.

This instruction exists so that programs have a way to do jumps farther than the 2KiB? (in either direction) limit of JREL.

In Boot, getting the PC with PUSHPC and then doing arithmetic on it would not be good enough, because pointer arithmetic is in units of abstract 'slots' which are the size of native pointers/ints, but program jumps need 16-bit granularity, and slots may be larger than 16-bits (in fact, we give jumps byte granularity, for upwards compatibility with BootX?, where we will have some 8-bit instrutions).

Why can't CVT-INT-CODEPTR also work for regular PTRs?

Because some implementations may have a different representation for CODEPTRs and regular PTRs.

Why isn't there a CVT-INT-PTR for regular PTRs?

There is no need for this unless you need to access something in memory outside of the Boot program (and outside of its allocated memory) which lives at a fixed address. In which case, this must be a platform-specific thing, so we leave it up to implementations to provide a facility to do this.

Why do you have a smallstack?

Stacks allow for higher code density, compared to adding an additional number of registers equal to the stack depth. This is because with registers, instruction encodings must specify which register is being used, but you don't have to specify which stack location you are using because it is assumed to be the top of the stack. It also allows for slightly easier program analysis in some cases (specifically, escape analysis), because items are popped off of the stack when they are used for the last time, instead of continuing to sit in a register after their last use.

Note that, in some cases, when the program uses the smallstack in a predictable manner, the compiler is advanced, and many registers are available on the underlying platform, compilers may be able to 'compile away' the smallstack and translate smallstack operations directly into register operations.

Why do you have two stacks? Why not just have the operand stack within each stack frame of the call stack like, eg, the JVM?

Having two stacks allows higher-level languages to more easily provide forms of function calling with less overhead, which encourages code to be factored at a finer grain. For example, let's say you have a place in your function where you sort the order of the top 5 items on the operand stack. Should you factor this out into a different function called SORT5? If the operand stack is within the stack frames of the call stack, if you call another function, you would have to push the return address onto the call stack, and then the newly-factored SORT5 function would have to be changed to reverse the 2nd, 3rd, 4th, 5th, and 6th items on the call stack, instead of the 1st, 2nd, 3rd, 4th, 5th items. So far so good, but what if the implementation of SORT5 itself calls other functions, which expect to access the 1st, 2nd, 3rd, 4th, 5th items on the stack? Now you have to make new versions of those functions, too. Or, you could just add a function prolog and function epilog to SORT5 which moves the return address up 5 spots in the call stack, exposing the 5 items to be sorted. But any of these alternatives have an efficiency cost, which discourages programmers from factoring code into subfunctions willy-nilly.

But if you have two stacks, then you can push the return address onto the call stack, and then SORT5 can immediately operate on the operand stack. There is no need to adjust the code of SORT5 or its dependencies or to add a function prolog or epilog. The only overhead is the push of the return address, the jump to the SORT5 routine, and the pop of the return address and the jump back to the caller -- this is still substantial but not as substantial.

To give another example of the potential efficiencies of two separate stacks, let's say that function f() is called with 8 arguments, and f() calls function g() and passes those 8 arguments along, and g() then calls h() and passes those 8 arguments along (and assume that none of these are tail calls, and also that neither f() nor g() need those 8 arguments after they pass them on). The calls from f() to g() and from g() to h() will typically be implemented by pushing a return address to the call stack, and then making copies of all 8 arguments and pushing them onto the call stack. If we had 2 stacks, however, we could just leave the arguments on the operand stack and push only the return addresses onto the call stack, saving us the trouble of copying those 8 arguments over and over. Similar savings are sometimes available in function returns.

A more general way of looking at this is the concept of 'concatenative programming languages' (for example, Forth is a concatenative programming language), in which the juxtaposition of expressions denotes function composition. If the effect of a function call on program state is confined to pushing a return address to a call stack and jumping, then any subset of a function can be moved into its own function without alteration, and still perform the same operation on program state; that is to say, code which was previously juxtaposed (eg two sequences of instructions which were next to each other) can be refactored into function composition without alteration.

Won't it be hard for implementations to allocate memory for a smallstack?

No, because the smallstack is of a fixed size (16). This means that the additional memory requirement for it is no worse than the memory requirement to store the 16 registers.

Which instructions in Boot do you think are the least important / which ones would you take out first?

SAR (or maybe SLR, since we have signed integers), and ROT (actually i did take out ROT!).

Why don't you have some instructions (for example, 'LOADI, Load Immediate constant') have an additional byte or bytes of data that comes right after it in the instruction stream?

I have this idea that some analysis tools could run more efficiently if they can represent the program as an array of instructions of some fixed size (for example, an array of 16-bit instructions, for Boot). If some instructions include 'data' after the main instruction, then either the analysis tool has to represent that instruction separately from the data (perhaps referenced by a pointer), or it has to increase the size of all of the items in the array just so that it can occasionally fit in extra data, or it has to reference the data in the instruction stream itself (which is itself made slightly more difficult by having the data after the main instruction).

Probably not that important. Maybe i should give in and add data.

Were there any design considerations in the ordering of opcodes?

Yes, they were chosen so that at runtime, an non-type-checking interpreter dispatcher could decide what pre-processing of input arguments needs to be done (such as applying stack-awareness or, in the case of BootX?, addressing modes) by checking for which range the subopcode is in, without storing a full table of instruction signatures.

Within each block of instructions sharing the same number of operands (and the same addressing mode, for BootX?), opcodes were chosen with the following considerings in mind, in descending order of priority:

ANNOTATE has opcode 0 in the 3-operand instructions
- the all-zero instruction bit pattern is illegal, so the 3-operand instruction with opcode 0 must be something that you would never want to do with all-zero operands anyways
other than ANNOTATE, the INSTR-X pseudo-instructions are first (highest opcodes)
implementation-dependent instructions come next
RESERVED instructions (and instructions which violate the following rules because they used to be RESERVED) come next
instructions with more constant-type operands (the 'c' type) have lower opcodes than instructions with less constant-type operands. This is mostly for consistency, because we want to apply the same design criterion to BootX?, where we have addressing modes, so that the decoder can say "if the opcode is above this number (and not one of the 'INSTR-X' pseudo-instructions indicating that the real instruction has less operands/uses one operand for a subopcode), then i should parse the addressing mode bits for all of the other operands". But it also has some theoretical benefits for Boot: if the decoder has any machinery that starts loading stuff from memory before the instruction is fully decoded, it can just look at the opcode to see which operands will definitely not need to be loaded from memory. Also, analysis tools can quickly compute conservatively depedencies this way.
instructions with less inputs operands should be above (lower opcode than) instructions with more inputs (a stack output (push) does not count as an input for the purpose of this rule)
instructions with more stack-aware operands should be above (lower opcode than) instructions with less stack-aware operands
control flow instructions have lower opcodes than non-control flow instructions, again so that analysis tools looking for control flow instructions can quickly rule out most non-control flow instructions.
instructions with more operand outputs have lower opcode than instructions with less operand outputs
similar instructions are kept next to each other

(todo: in BootX?, explore moving these different classes of instructions into different opcode addressing mode blocks)

Were there any design considerations in the ordering of operands within instructions?

TODO enforce this

Yes, in descending order of priority:

Constant operands precede everything else
Non-input operands precede input operands (stack aware output operands e.g. in PUSH do not count as an input operand)
Stack-aware input operands precede non-stack-aware input operands
Output operands precede non-output operands

Why is ANNOTATE opcode zero? We want the runtime to check for the all-zero bitpattern (and terminate with an error if it is found), but we also want it to be able to completely ignore ANNOTATE instructions.

It was decided to combine these two special cases into one opcode, as it seems conceptually nicer to group special cases together; for any given instruction, once it has been checked that the opcode is >0, neither special case needs to be thought about anymore.

Implementations are encouraged but not required to check for the all-zero bitpattern, so in situations in which it is much more efficient to completely ignore ANNOTATE instructions, an implementation may also ignore ANNOTATE 0 0 0.

Note that the 32-bit BootX? encoding permits ANNOTATE 0 0 0, because the format bits of the 32-bit encoding includes a 1, so there is no 32-bit pattern of 0s in that instruction.

Why are the 32-bit BootX INSTR-ZERO implementation-dependent?

For the sake of code density BootX? supports a mixed-width instruction set with 32-bit BootX? instructions, 16-bit Boot instructions, and 8-bit BootX? instructions.

However, the primary goal for Boot and BootX? is easy portability. On some platforms, a fixed-width instruction set may be more efficient or easier to implement. Therefore, BootX? also supports a fixed-width instruction set with only 32-bit BootX? instructions. Care has been taken so that the 32-bit BootX? instruction set can easily express any 16-bit Boot instruction as a special case. This requires that the 32-bit BootX? instruction set has room to express the 16 zero-operand INSTR-ZERO instructions from Boot.

What is relationship between Boot and BootX?

The motivation for defining both Boot and BootX? is that Boot is simpler to implement but BootX? is probably more performant.

BootX? can be compiled to Boot, with the exception of some of the optional instructions in the SYSCALL2 subset, which expose platform functionality not required by Boot (eg 'time').

Boot is a smaller, simpler ISA that is easier to implement. It is a RISC fixed-width load/store architecture. Any Boot program can be transformed into a subset of Boot in which implicit SMALLSTACK PUSHes and POPs are factored out into explicit PUSH and POP instructions.

BootX? is a larger, slightly more complex ISA that may permit better code density (especially when used in concert with Boot as a 32/16/8-bit mixed-width encoding) and performance. It is less RISC and more CISC-y (although not too much). It contains additional instructions for operations that could be expressed in software in Boot, but which can be expressed more concisely in Boot (e.g. JMP), or that are likely to be exposed as primitives on many underlying platforms (eg integer division, floating point arithmetic). BootX? implementations can support only a fixed-width instruction stream of 32-bit BootX? instructions, or they can support a mixed-width instruction stream of 32-bit BootX? instructions, 16-bit Boot instructions, and 8-bit BootX? instructions.

Why does Boot have undefined behavior?

For those illegal operations for which naive execution would cause undefined behavior on popular targets, we permit undefined behavior.

When an illegal operation is encountered, Boot implementations are permitted to forgoe raising an error, and may attempt to execute them regardless, provided that the potential result of executing them regardless is no worse than allowed for that sort of illegal operation. The motivation for this ambiguity is to allow implementations to dispense with error checking.

Why does Boot provide an instruction for adding a number of bytes to a codeptr?

Consider the following sequence of instructions:

1. Push return_address to stack 2. Push argument1 to stack 3. Push argument2 to stack 4. Jump to subroutine 5. (after return of subroutine) do other stuff

When the subroutine returns to return_address, you probably want it to return to the next instruction after the jump, #5. But if there is no way to do pointer arithmetic on CODEPTRs, but rather only to fetch the PC, then if you tried to implement step #1 with PUSHPC, you would get a codeptr pointing to the instruction at step #2, and there would be no way to convert it into a codeptr pointing to the instruction at #5.

Ordinary add-ptr-int can't be used here, because that is in units of 'slots' (the size of a native pointer), but instructions in code must be addressed in units of bytes. So we need a new instruction, so that we can do:

1. PUSHPC to stack 2. add-codeptr-int (the item you just pushed onto the stack) + (10 bytes) 3. Push argument1 to stack 4. Push argument2 to stack 5. Jump to subroutine 6. (after return of subroutine) do other stuff

Why do alignment requirements and branch and jump target units vary by instruction length?

The spec says that "N-bit instructions must be N-bit aligned. Branch and Jump targets are in units of N-bits. So, for example, 8-bit instructions must be 8-bit aligned, 16-bit instructions must be 16-bit aligned, 16-bit instruction branch and jump targets count in units of 16-bits, 32-bit instruction branch and jump targets count in units of 32-bits.".

The motivation for this is to help implementations which recompile everything to one instruction length, and then only support execution of instructions of that instruction length. This ensures that if e.g. only 32-bit instruction are used, that (a) 32-bit alignment is required, which simplifies fast hardware implementations that speculatively decode future instructions in parallel, and (b) 32-bit branch and jump targets count in units of 32-bits, which makes sense because if 32-bit alignment is required then these are the only possible targets anyways; therefore any other choice would waste instruction encoding space/unnecessarily reduce the effective max jump/branch length expressible by instructions.

As a bonus, these requirements help debugging, because now any region of memory can be decoded into instructions without first knowing the initial offset at which execution is to begin.

Given the alignment restrictions, why not treat each 16-bit chunk of 8-bit instructions as a packet, and treat each 32-bit chunk of 16-bit instructions as a packet, and in the second instruction in each packet, use the redundant instruction width bit for something else?

This would be more efficient but would also be more complicated, so i decided not to do it.

As a bonus, this form of redundancy allows the instruction stream to be somewhat self-synchronizing, that is, if the instruction stream is being read as a bytestream from media that doesn't specify alignment, alignment can be guessed by seeing where the 16-bit and 32-bit instruction boundaries tend to be.

Wouldn't it permit faster parallel speculative decoding hardware if there were only one instruction length?

Yes, but one of our goals is to support massively-concurrent, "brain-like" computing. The brain appears to have massive concurrency at the cost of relatively slow serial execution speed of each CPU.

In order to support more cores (here, i am using the word 'core' to refer to the unit which is being repeated for massive concurrency; a CPU paired with some local memory and networking capability) in a given die area or thermal envelope, at the cost of slower serial execution speed, one might not be doing fancy stuff like parallel speculative decoding anyways.

Multiple instruction lengths permit higher instruction density, and this allows one to give less local memory to each core, which would allow more cores to fit in a given die area or thermal envelope.

Wouldn't it be simpler if there were one instruction length?

Yes, and this is why Boot supports only one instruction length, and only the more complex BootX? supports multiple instruction lengths.

Why is Boot 16-bit focused instead of 32-bit focused?

It is true that contemporary processors and programming toolchains focus on 8-bits (e.g. byte-oriented addressing) and 32-bits (e.g. the default size of ints in many toolchains; the bitwidth of the most popular >8-bit MCU architecture (ARM)). So why do we focus on 16-bits?

Because 16-bits is the smallest power-of-two that addresses an amount of memory sufficient to hold useful/interesting general-purpose programs, for example, practical (if very small) compilers and interpreters.

Why do we care about choosing the SMALLEST power-of-two? Because, in order to support massively-parallel, "brain-like" computing, we want to have very small, simple cores in exchange for more parallelism. 16-bit CPU hardware will be smaller, and we can't fit much local memory into each core anyways.

Well in that case why include support for addressing larger amounts of memory, and for 32-bit instructions, at all? Because another goal of Boot is to be a portable and reasonably efficient target language on contemporary machines.

things to mb reconsider someday

does MUL deserve to be in the 3-operand instructions?
mb we do want loadd and stored in the three-operand instructions: b/c the reason for leaving them out is that you could just do pointer arithmetic to get the same thing in two instructions, but can you do pointer arithmetic on stack "pointers"? If not, then this doesn't work.
right now, any register, stack item, or memory location can contain either a uint or a pointer. Since most Boot programs will never use uints larger than 16 bits, this is a big waste. In addition, on some architectures we may have different registers for integers, floats, and pointers. Perhaps we should say that only some registers and stack items, or even only some memory locations, need to accomodate pointers -- or is that more complicated than its worth? We could also just go 64-bit.
do we really need/want LEQ? SUB also provides us with a boolean telling us whether two things are LEQ, placed in ERR

todo

make instructions that implicitly use the stack should use SMALLSTACK instead of DATASTACK -- b/c since SMALLSTACK is limited-size, SMALLSTACK ops probably be more easily compiled away.
see also those other lists of popular linux syscalls that i made, and my ideas regarding instructions for epoll, completion ports, etc
figure out fork/forkdone -- i'm not clear on if 'fork' is easily supported by all platforms; also 'forkdone' seems like something we don't need a separate opcode for, something that we could do via channels; also there's just a general fuzziness in my head around the exact way these two would work
do we really need/want walk? if it's just 'subopen', can't we just combine it with 'open'? Should we combine 'walk' into 'open' and then have 'list/query' instead of 'walk'? In HTTP and WebDav? it seems like 'list' is a fundamental.
console I/O conventions (eg how to OPEN stdin and stdout)
We should have a list of special pre-opened Open File Descriptors, like stdin, stdout, stderr (in which case mb get rid of LOG? actually i think LOG is better than stderr, timestamps and loglevels are good), rand, and possibly one or more clocks

"The point of native precision is to have a portable way to express computations, which may involve distances between pointers, and we dont know the pointer size so we cant know the integer size. but once we move outside of that, it's probably better to know how much precision you're dealing with."

new stuff 190526

3 of the registers are duplicates, so the state of the 16 registers can be stored in only 13 real registers
48 additional 'storage registers' that can only be used for storage

proj-oot-bootDesignOld200622