proj-oot-ootBootNotes2

Representation of zero: 0 00 000

Subnormal numbers, scaled to integers The significand is extended with "0.": 0 00 001 = 0.001_2 * 2^x = 0.125 * 2^x = 1 (least subnormal number) ... 0 00 111 = 0.111_2 * 2^x = 0.875 * 2^x = 7 (greatest subnormal number)

Normalized numbers, scaled to integers The significand is extended with "1.": 0 01 000 = 1.000_2 * 2^x = 1 * 2^x = 8 (least normalized number) 0 01 001 = 1.001_2 * 2^x = 1.125 * 2^x = 9 ... 0 01 111 = 1.001_2 * 2^x = 1.875 * 2^x = 15 0 10 000 = 1.001_2 * 2^x = 1.000 * 2^(x+1) = 16 0 10 001 = 1.001_2 * 2^x = 1.125 * 2^(x+1) = 18 ... 0 10 111 = 1.110_2 * 2^x = 1.875 * 2^(x+1) = 30 0 11 000 = +infinity x 11 yyy (y != 0) = Nan

then the Wikipedia page concludes that the exponent bias in their example should be -2 for some reason (maybe -(x-1)?), and then i think they are saying that e.g. 0 00 001 represents 0.001_2 * 2^0 = 0.125 * 1 = 0.125. https://en.wikipedia.org/wiki/Exponent_bias implicitly suggests that the exponent bias should be (2^(exponent_bit_width-1)-1), which in the case of the example on the Minifloat Wikipedia page yields 7 (not -2 as they say; although the Exponent_bias wikipedia page is saying that the positive 'bias' is subtracted from the given encoded exponent; note that the Minifloat wikipedia page gives a table of 'All values as decimals' in which only the first row yields fractions, which seems odd), and in this case yields 1. But according to http://weitz.de/ieee/, even though for 32-bit floats the exponent is 127 according to the Expontent_bias wikipedia page, for subnormals the biased exponent is -126 and for normals it's also -126. So i guess you add one to it to get the lowest one (so our lowest factor is 2^(-bias + 1) = 2^(-1 + 1) = 2^0. So following this pattern, with our 2 bit exponents (so bias = 1 by the reasoning above) we have:

Representation of zero: 0 00 000 Subnormal numbers The significand is extended with "0.": 0 00 001 = 0.001_2 * 2^0 = 0.125 = 0.125 (least subnormal number) ... 0 00 111 = 0.111_2 * 2^0 = 0.875 = 0.875 (greatest subnormal number)

Normalized numbers, scaled to integers The significand is extended with "1.": 0 01 000 = 1.000_2 * 2^0 = 1 = 1 (least normalized number) 0 01 001 = 1.001_2 * 2^0 = 1.125 * 1 = 1.125 ... 0 01 111 = 1.001_2 * 2^0 = 1.875 * 1 = 1.875 0 10 000 = 1.001_2 * 2^1 = 1.000 * 2 = 2 0 10 001 = 1.001_2 * 2^1 = 1.125 * 2 = 2.25 ... 0 10 111 = 1.110_2 * 2^1 = 1.875 * 2 = 3.75 0 11 000 = +infinity x 11 yyy (y != 0) = Nan

I gotta say, i might prefer more exponent bits than significand bits if the range is gonna be that small, especially for immediate operand literals. What if we had 3 exponent bits and 2 significand bits; i think we'd end up with exponent bias 3, so smallest exponent -2, so:

Representation of zero: 0 000 00 Subnormal numbers The significand is extended with "0.": 0 000 01 = 0.01_2 * 2^-2 = 0.25/4 = 0.0625 (least subnormal number) 0 000 10 = 0.10_2 * 2^-2 = 0.50/4 = 0.125 0 000 11 = 0.11_2 * 2^-2 = 0.75/4 = 0.1875 (greatest subnormal number)

Normalized numbers, scaled to integers The significand is extended with "1.": 0 001 00 = 1.00_2 * 2^-2 = 1/4 = .25 (least normalized number) 0 001 01 = 1.01_2 * 2^-2 = 1.25/4 = .3125 0 001 10 = 1.10_2 * 2^-2 = 1.50/4 = .375 0 001 11 = 1.01_2 * 2^-2 = 1.75/4 = .4375 0 010 00 = 1.00_2 * 2^-1 = 1/2 = .5 0 010 01 = 1.01_2 * 2^-1 = 1.25/2 = .625 0 010 10 = 1.10_2 * 2^-1 = 1.50/2 = .75 0 010 11 = 1.11_2 * 2^-1 = 1.75/2 = .875 0 011 00 = 1.00_2 * 2^0 = 1*1 = 1 0 011 01 = 1.01_2 * 2^0 = 1.25*1 = 1.25 0 011 10 = 1.10_2 * 2^0 = 1.50*1 = 1.5 0 011 11 = 1.11_2 * 2^0 = 1.75*1 = 1.75 0 100 00 = 1.00_2 * 2^1 = 1*2 = 2 0 100 01 = 1.01_2 * 2^1 = 1.25*2 = 2.5 0 100 10 = 1.10_2 * 2^1 = 1.50*2 = 3 0 100 11 = 1.11_2 * 2^1 = 1.75*2 = 3.5 0 101 00 = 1.00_2 * 2^2 = 1*4 = 4 0 101 01 = 1.01_2 * 2^2 = 1.25*4 = 5 0 101 10 = 1.10_2 * 2^2 = 1.50*4 = 6 0 101 11 = 1.11_2 * 2^2 = 1.75*4 = 7 0 110 00 = 1.00_2 * 2^3 = 1*8 = 8 0 110 01 = 1.01_2 * 2^3 = 1.25*8 = 10 0 110 10 = 1.10_2 * 2^3 = 1.50*8 = 12 0 110 11 = 1.11_2 * 2^3 = 1.75*8 = 14 0 111 00 = +infinity x 111 yy (y != 0) = NaN?

that seems more useful for some reason. I guess what you lose out on with one less bit in the significand is the ability to say things like 1.125 and 1.875, and also 2.25, 3.75; so you get less gradations above 1, in exchange for more gradations below 1 and a bigger range.

What if we had 4 exponent bits and just 1 significand bit?

Representation of zero: 0 000 00 Subnormal numbers The significand is extended with "0.": 0 0000 1 = 0.1_2 * 2^-6 = 0.5/64 = 0.0078125 (least (and only, in this case) subnormal number)

Normalized numbers, scaled to integers The significand is extended with "1.": 0 0001 0 = 1.0_2 * 2^-6 = 1/64 = 0.015625 (least normalized number) 0 0001 1 = 1.1_2 * 2^-6 = 1.5/64 = 0.0234375 0 0010 0 = 1.0_2 * 2^-5 = 1/32 = 0.03125 0 0010 1 = 1.1_2 * 2^-5 = 1.5/32 = 0.046875 0 0011 0 = 1.0_2 * 2^-4 = 1/16 = 0.0625 0 0011 1 = 1.1_2 * 2^-4 = 1.5/16 = 0.09375 0 0100 0 = 1.0_2 * 2^-3 = 1/8 = .125 0 0100 1 = 1.1_2 * 2^-3 = 1.5/8 = .1875 0 0101 0 = 1.0_2 * 2^-2 = 1/4 = .25 0 0101 1 = 1.1_2 * 2^-2 = 1.5/4 = .375 0 0110 0 = 1.0_2 * 2^-1 = 1/2 = 0.5 0 0110 1 = 1.1_2 * 2^-1 = 1.5/2 = .75 0 0111 0 = 1.0_2 * 2^0 = 1*1 = 1 0 0111 1 = 1.1_2 * 2^0 = 1.5*1 = 1.5 0 1000 0 = 1.0_2 * 2^1 = 1*2 = 2 0 1000 1 = 1.1_2 * 2^1 = 1.5*2 = 3 0 1001 0 = 1.0_2 * 2^2 = 1*4 = 4 0 1001 1 = 1.1_2 * 2^2 = 1.5*4 = 6 0 1010 0 = 1.0_2 * 2^3 = 1*8 = 8 0 1010 1 = 1.1_2 * 2^3 = 1.5*8 = 12 0 1011 0 = 1.0_2 * 2^4 = 1*16 = 16 0 1011 1 = 1.1_2 * 2^4 = 1.5*16 = 24 0 1100 0 = 1.0_2 * 2^5 = 1*32 = 32 0 1100 1 = 1.1_2 * 2^5 = 1.5*32 = 48 0 1101 0 = 1.0_2 * 2^6 = 1*64 = 64 0 1101 1 = 1.1_2 * 2^6 = 1.5*64 = 96 0 1110 0 = 1.0_2 * 2^7 = 1*128 = 128 0 1110 1 = 1.1_2 * 2^7 = 1.5*128 = 192 0 1111 0 = +infinity x 1111 1 = NaN?

this shows another advantage of reducing the significand; less useless NaN? bitpatterns, so more bitpatterns for numbers. Compared to the previous, we lose every other number between .25 and 1, and the only non-integer above 1 that we can express is 1.5, and our integer range is only 1-4 instead of 1-8. In exchange, we get 5 more small numbers ranging from .0078125 to 0.046875, and the top of our range is 192 instead of 14. I guess using the extra NaNs? only had the effect of extending the range.

I think the second one is best.

That one has:

an alternative would be to represent rational numbers. With one sign bit, and a naive representation of 2 bits per each of numerator and denominator (1 thru 4), the only thing we'd get that we don't get above is x/3s (we lose a little bit due to redundancy; eliminating redundancy would make the format more complicated). We could do a little better if we gave either the numerator or the denominator another bit. I think we should stick to floating point (better mechanical sympathy, i suspect).

---

these guys both used 8-bit floats with 5 exponent bits and 2 significand bits:

https://papers.nips.cc/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf https://arxiv.org/pdf/1905.12334.pdf

this guy used 6 bits for some things (gradients) but i dont see a breakdown into exponent and significand bits:

https://arxiv.org/abs/1606.06160

--- old

removed from LOVM

  1. ## Stacks

The LOVM implementation maintains two stacks in addition to the memory stack. These are called the primary stack, and the output stack. The primary stack and the output stack each have a capacity of 16 items.

The item on the top of the primary stack is aliased to register R1.

There is always at least one item on the primary stack; an attempt to pop the last item causes a stack underflow. This allows register R1, the top-of-stack alias, to always be available for reading and writing. The output stack may be empty. TODO maybe elide this if no allocation.

  1. ### The memory stack

The memory stack grows downwards in memory; the "top" of the stack is the item which is lowest in memory. When terms like 'below' are used with reference to the stack, the speaker should make it clear whether they mean below in terms of the direction of the stack, or below in terms of the direction in memory.

Items on the stack always occupy PTR_SIZE memory locations when in the memory stack, even if they would otherwise be smaller than a pointer.

  1. ### The stack cache

The primary stack and the output stack may be thought of as a cache of the tip of the memory stack. In memory, the primary stack goes below the bottom of the rest of the memory stack, and the output stack cache goes below the bottom of the primary stack cache (in terms of the direction in memory). In the following, the 'stack cache' consists of the primary stack and the output stack.

Changes to the stack cache are not guaranteed to be visible in memory until the stack cache is flushed to memory (see the flushsc2mem instruction). Changes to the stack in memory are not guaranteed to be visible in the stack cache until the memory stack is flushed to the stack cache (see the flushmem2sc instruction). When there are changes to the memory stack that have not yet been flushed to the stack cache, we say that the stack cache is 'stale' (otherwise we say it is 'fresh'). When there are changes to the stack cache that have not yet been flushed to memory, we say that the stack cache is 'dirty' (otherwise we say it is 'clean'). When the stack cache is dirty, neither the memory stack nor the stack pointer may be written to; writes to the memory stack or the stack pointer, or executions of flushmem2sc, when the stack cache is dirty cause undefined behavior. When the stack cache is stale, the stack cache may not be modified; pushes, pops, or writes to the stack cache, or execution of flushsc2mem, when the stack cache is stale cause undefined behavior. The stack pointer (register #2) is not updated as items are pushed or popped to the stack cache. The stack pointer is updated by flushsc2mem when changes are flushed. When the stack cache is clean and fresh, the stack pointer points to the top of the output stack in memory.

An implementation-dependent number of memory locations below the stack pointer are reserved for the use of the LOVM implementation. The LOVM implementation may or may not store part or all of the stack cache there.

i think most of this stuff should be implementation-dependent details. Is it really necessary to specify the layout of stack frames? Maybe just let that be opaque, and just provide ways to get a pointer to the opaque stack frame and its size; and/or provide instructions that operate on the stack at the granularity of a stack frame. Also, call it activation frame rather than stack frame, and make it so that it doesn't have to be on the stack!

  so there's sort of 3 levels to consider supporting here (in the reverse of the ordering of those last 3 bullet points):
  1) structured programing with CALL, TAILCALL, GETLOCAL, SETLOCAL, etc
  2) direct activation record manipulation, but the records themselves are opaque; you can ask the implementation for the location and length of activations records (so that you can copy them to save/restore, move them between the stack and the heap, etc), allocation new activation records, 'activate' an activation record, etc
  3) direct observation/manipulation of the entire activation record in memory according to the LOVM native default activation record layout
  Sounds like #3 is undesirable. But we can do #2.

---

removed from Boot

---

Serialize syscall

note: since you can't subtract opaque pointers in BootX?, you can't serialize a data structure containing pointers (for storage in a file, for example). Just as we provided memcpy to get around this in-memory, we should probably provide a syscall to persist pointer-containing data structures to disk. Mb this should be in LOVM tho? alternately: just allow pointer-pointer subtraction and <= (but only for derived pointers). Note to disallow a < b < a (more generally: leq relation on ptrs is a partial order). Can always disallow this higher up, and in some profiles.

actually... do you really need pointer subtraction for serialization? Maybe you don't! I guess all you really need is to be able to dereference pointers, if you know the data structure you are serializing (so need either static knowledge of this, or tags/headers) in data. And this is mainly to figure out the size of the block of memory that is being pointed to.

so maybe leave this as-is for awhile, and add something later only if needed.

--- old

removed from Boot reference

---

it bugs me that every value on the stack is padded up to max(INT64_SIZE,PTR_SIZE). What a waste of memory! But i can't think of a better way. Alternatives:

A) What we do; make everything too big on the stack + This is feasible.

interestingly, both the Linux and Windows calling conventions require 16 BYTE alignment, which is even worse than what we do (and should we do that for compatibility? maybe! or maybe just say that the actually physical layout of the stack is opaque? it's already opaque in LOVM, maybe just leave it for now): https://stackoverflow.com/questions/49391001/why-does-the-x86-64-amd64-system-v-abi-mandate-a-16-byte-stack-alignment . Although i think they just mean that upon function entry/exit, the stack pointer should be aligned, not necessarily that every item on the stack is aligned (but i could be wrong)

B) We could have everything take up its native size on the stack, and have everything that accesses the stack indicate the stack position by multiple indices taken together; one index for # of INT8s skipped over, one index for # of INT16s skipped over, one index for # of INT32s skipped over, one index for # of INT64s skipped over, one index for # of PTRs skipped over. The problem is that this takes a lot more bits, bits that we don't have in each operand field. We only have 4 bits.

B2) So instead, we could just specify that ints take the expected # of bytes in all cases, and then when accessing the stack specify only 2 things, # of bytes and # of ptrs. But we only have 4 bits, so split that into 2 bits each, so instead of one range of 16 with 4 bits now we have 2 ranges of 4 for each of bytes/ptrs. And 1 32-bit integer takes 4 bytes, so we can only skip over one of those. So in many cases now we can't reach many function arguments with our stack-as-register addressing.

B3) Another thing we could do is pad ints smaller than int32s, and now have 2 indicies, # of INT32s, and # of max(INT64, PTRs). This is much better; we can access at least 4 arguments in any case, and maybe up to 8 in the best case. But now we have an issue with stack ops like push, pop, dup, drop, swap, over, rot. We have specify the bitwidth of each affected stack slot. This is do-able but ugly in the 32-bit encoding, but it probably takes a lot of stuff out of reach in the 16- and 8-bit encodings, because of how many more opcodes you'd need (forgetting about rots, you'd need 2 pushes, 2 pops, 2 dups, 2 drops, 4 swaps, 4 overs -- so 16 opcodes burned instead of 6). - hmm actually i guess that's feasible. And for stack-push-pop addr mode, you know the type of the operand, so you're good already might be worth it... but: - on 32-bit or 16-bit machines, or 64-bit machines running a VM using 32-bit PTRs, still doesn't help us with PTRs, which are still padded ridiculously - are 32-bit ints used a lot? Yes: https://www.google.com/search?safe=active&client=ubuntu&hs=fyP&channel=fs&sxsrf=ALeKk0131_pKlU6iAEasO6cpCBTHB6InHA%3A1606019830606&ei=9uq5X7PEJM3Z-gSumgI&q=profile+usage+of+32-bit+64-bit+integers&oq=profile+usage+of+32-bit+64-bit+integers&gs_lcp=CgZwc3ktYWIQAzIICCEQFhAdEB4yCAghEBYQHRAeOgQIABBHOgQIIRAKUOEFWLILYIUMaABwAngAgAGVAYgB-AeSAQMwLjiYAQCgAQGqAQdnd3Mtd2l6yAEHwAEB&sclient=psy-ab&ved=0ahUKEwizl4WMqpXtAhXNrJ4KHS6NAAAQ4dUDCAw&uact=5 says "We also ignore issues related to the use of integers larger than 32 bits (since these are rarely necessary)"

this path (B3, i mean) is probably why the JVM has 2 sizes of stack slots; it allows you not to have everything be the biggest size, while still taking less bits (and opcode variants) to specify instructions than if you allowed stack items to be any arbitrary number of bytes (note: i think .

C) Give up on abstraction over pointer size. Should we say everything is one location (most efficient for running on top of a HLL), or everything is native size (most efficient for compilation to native code)? If you want perf, compilation to native code is the best, so let's not cripple the best possibility; so make everything native size. But then we're still left with a hard problem; should PTRs be 32-bit or 64-bit? 64-bit is the obvious 'most portable' choice, but 32-bit is faster and uses less memory (i think WASM uses it for reasons like that), and here i was hoping for something that is 16-bit friendly.

i guess it wouldn't kill us to make things unportable by pointer size.

And even so, we still need to have 16 stack ops instead of 6, and have 2 ranges of 4 instead of one range of 16 in stack-as-register addr mode, like in (B3).

D) We could have something like function declarations that inform the implementation what the sizes are of each item near the tip of the stack. This could look like the function declarations in MIR (maybe some similar consideration is why MIR does that). It could also look like an assembler pragma that ranges over a linear set of static instructions. It could also look like a dynamic BootX? instruction or special register that informs the VM about the sizes of things.

The problem with this is that it seems to me to be 'higher level' than we want:

So, many of these seem feasible, and they have their own plusses and minuses. To summarize:

A)

B) X not feasible, too many indices

B2) X not really feasible, can't reach enough int32 arguments

B3) + 4-8 stack slots reachable from addr mode

C)

D)

So it appears there are 3 feasible choices:

summary:

what do others do?

decision?: leaning towards (A) for now (simpler)

next day:

actually i thought of some stuff that makes (B3) more competitive:

With this, now i think (B3) is best. There are 3 big losses in (B3) compared to (A):

well -- wait a minute -- how does this affect R.O.C. implementation? If the implementation is caching some stack items in registers, how does it know the size of those items? Like, imagine that the stack cache contains an integer on top, followed by a pointer; now you do a flushroc2mem; the implementation has to know that there is an integer followed by a pointer in order to write them to memory in the correct way. We could describe the stack layout in the argument to flushroc2mem I guess, but this prevents early flushing. Actually, the implementation already knows the stack layout because each stack manip instruction specified its size, but that means that the implementation must keep track of it in order to implement stack caching correctly. That's a lot of implementation work.

You could try to store the stack cache as bytes, but if it's stored in registers, that makes it hard to operate on stack items, or even fetch them, because you have to compute a linear combination of INT32_SIZE and PTR_SIZE to identify the byte offset of any item in the stack, but registers can't be (directly) accessed via a numerical index.

so now i think (A) is best.

well.. hold on now.. imagine if we just had INT32s and PTRs taking one stack slot, and INT64s taking up 2. That seems more tractable somehow, why? Forget about PTRs for a moment and pretend the stack only had INT32s and INT64s. Even if registers are 64-bit, you could use the stack caching registers to hold only 32-bits each, and store the high 32-bits of each INT64 in a different register. Now you can use the s32 instruction to store the value in each register to the stack, without knowing if something was an INT32 or an INT64.

you still have the issue of not being able to numericaly index into registers, however. And you need to burn a temporary register to do any computation with those INT64s (or maybe not? just load it into the first register). And, well, you don't really need to do the linear combination at runtime, the implementation could have a fixed table mapping each of the 16 possible combinations in a stack addr mode operand to a register.

Can the same sort of things rescue a mixture of 32-bit INT32s and 64-bit PTRs? First, assume that we are on a 'vanilla' platform in which 64-bit PTRs can be treated as 64-bit INTs (so you can still use s32 to write the low 32-bits of each PTR). An additional complexity is that now, instead of the program source code stack cache addr mode operand adding together the 1*INT32 and 2*INT64s, you have two things (INT32-count and PTR-count) that you have to add together, but as noted above, since the operand can only hold 16 possible combinations you can have a precomputed table in the implementation mapping these to registers. So i think we're good for the vanilla case, actually.

And what about the non-vanilla case? Now assume we're on a strictly typed platform that treats ptrs and ints differently. Well, in this case, in the old system where everything is the same size, the type of our stack cache items must already be a union type. If we want different sizes, the platform may insist that we prove to it using its type system that we know the type of each item in the cache. This would be a tall order, except that we can just switch from a union to a discriminated union. I can imagine that there may be some platforms that don't support/allow this. But we can probably opaque-ify (opacify) the actual stack layout to allow these platforms just to store everything in max(INT32_SIZE, PTR_SIZE).

well, wait a minute. Is adding in PTRs really as straightforward as i said two paragraphs above? Say that the stack holds 2 INT32s and 1 PTR (64-bit). Well, yes, it works, as long as you either store the INT32s packed (2 per 64-bit register), or store the PTR over 2 registers (you can still store all of it in the first one, as long as you also store the high 32-bits in the second one; although when you read in from memory you won't know if it's a 64-bit thing or two 32-bit things, so using the high 32 bits to store all of it is something you'd do later during computation, as a 'temporary' thing).

---

old

removed from boot

PTR_SIZE is guaranteed to be as large as the largest Boot type (so, any Boot type fits within a PTR_SIZE space in memory).

b/c:

---

interestingly, the C std specifies that the 'int' type could be 16-bit, actually. This is attractive to me b/c i like 16-bit (so that we can have brain-like massively parallel arrays of simple processors). But both LLVM and WASM specify integer types in terms of # of bits, so that's probably better. As has been much lamented and discussed (i think) in various Oot notes, i think the more practical decision is to go with 32 bits, not 16 (because: almost all commodity platforms use 32-bits instead of 16; and, many programs need to deal with data structures with more than 64k items).

---

one big disadvantage of opaque stacks is that is that you can't then use the stack ops on ordinary memory for 'user stacks' (at least, not without making the layout of those opaque too) -- well, i guess making those layouts opaque isn't the end of the world.

--- old

removed from boot_design.adoc (note; the old one is ootBootDesignOld201121.txt)

Boot is not just a substrate of Ovm, but also a subset of it. In addition to being the platform that the reference implementation of Ovm is implemented in, the Boot instruction and register set are a subset of Ovm's, and the Boot encoding is a dense encoding variant in Ovm.

Why only unsigned integers? Why only a minimum bitwidth instead of a fixed bitwidth? Why no division instruction?

To facilitate ease of implementation on many platforms, in Boot we're looking for the lowest common denominator. We can implement these other things at the Ovm level.

Lack of safety and security

If the purpose of Boot is to serve as a substrate for and a subset of Ovm, and if Ovm is to be a safe language, then why don't we require bounds-checking at the Boot level? Why do we allow invalid pointer operations to have undefined effects? Because Boot will be used to execute the Ovm implementation, and for efficiency we will trust the Ovm implementation to be safe and correct. Oot will be a safe language; Boot is not.

Note that this means that an Ovm implementation running on a Boot platform will not be able to simply identify parts of an untrusted program that it is running that contain only instructions in the Boot subset of Ovm, and blindly execute them directly on the Boot platform; it must prove safety or insert bounds-checks.

Undefined values for arithmetic over 16-bits

Not having a maximum bitwidth allows Boot programs to directly use whatever data types are native to the platform that Boot is being implemented on. This is good for efficiency, and for making it easy to implement Boot on new platforms, and for interoperation with other code native to the platform.

We allow the implementation freedom to produce arbitrary output values when the outputs 'should' overflow 16-bits because we want to allow Boot implementations to be able to simply use the native arithmetic operations of the underlying platform without inserting checks for overflow around each operation.

Syscall-y instruction choice

The reason for (malloc, mdealloc, mfree, open, close, read, write), is that they are fundamental and common. (walk) because it is fundamental. (poll) because it is called a lot in tight loops and so you want them to be efficient, and (time rand) because both they are called in tight loops, and they are fundamental.

Why are OTHER and UNDEFINED redundant?

Since we don't define the semantics of either OTHER or UNDEFINED, these would seem to be redundant. The intent is that OTHERs may be created by correctly-behaving libraries and implementation extensions, whereas UNDEFINEDs are only created by erroneous programs.

What's the point of SMALLSTACK?

SMALLSTACK's capacity limit may allow some advanced implementations to optimize it away.

SMALLSTACK is a useful abstraction for Ovm's 'custom instructions', which can call each other but only in a non-recursive, inlinable way. In Ovm, many custom instructions will use SMALLSTACK, not DATASTACK or CALLSTACK or registers, for temporary processing, in order to make it easy for custom instruction subroutines to access manipulate data placed in registers and on DATASTACK and CALLSTACK by the main program; conversely, the 'main program' won't touch SMALLSTACK except as a temporary place to load arguments and retrieve results.

Why so many stacks?

Many languages combine CALLSTACK and DATASTACK into one, but a few like to have two different ones (eg see https://users.ece.cmu.edu/~koopman/stack_computers/sec2_1.html#211 ). In fact, some languages use even more (eg see http://docstore.mik.ua/orelly/perl/prog3/ch18_03.htm ). So I thought it might be nice to offer a second stack, although it's debatable whether we should, because a language that wanted two stacks if we only provided one could use the 'user stacks' facility.

As above, SMALLSTACK is a useful abstraction for Ovm's 'custom instructions', and also, i thought it would be nice to have at least one stack with a small, fixed capacity, so that advanced implementations could in some cases optimize it away.

Why are the syscally-instructions in here? Those aren't assembly language-y. Why not make them 'system libraries'?

You can think of these as calling into a system library. We include these in order to allow us to write Oot compilers and interpreters in pure Boot.

While many VMs don't have instructions for file I/O, some do (e.g. MoarVM?).

Why stackload and stacksave?

In order to support things like coroutines swapping stacks when changing context, we need first-class stacks.

Why do SKIPs and JRELs reserve two operands?

We want to reserve two operands for the implementation, so that, in Ovm's MEDIUM encoding, the implementation can inline encode relative jump magnitudes of up to 65535; this allows each instruction to be expanded to up to 255 native instructions.

Why pickk and rollk?

While not strictly necessary, these operations have a good power-to-weight ratio, since they include common operations dup, over, swap, rot as special cases. In Ovm, we intend for our custom instructions to mainly use SMALLSTACK, not registers, for temporary storage, and these stack operations (along with the capability of LOAD and STORE to peek/poke deep in the stack without popping, and the TOS alias to the top of SMALLSTACK) make it a lot easier to do stuff without registers.

Why do pickk and rollk force k to be a static constant?

First, this allows them to substitute for a bunch of instructions like DUP and SWAP. Second, in higher levels of the Oot stack we're going to require statically known stack maps, so dynamic pick and roll won't be useful (outside of trusted components such as the Oot implementation itself).

What if the stack size can't fit in a UINT?

We assume that the stack's size can fit in a UINT; this assumption shows up in the SYSINFO calls that retrieve stack size, and in the stacksave format that has the stack size at the beginning. This is because the simplest architecture uses registers and stack locations that can fit a native pointer anyways, and a word that can hold a native pointer should always be able to hold a UINT indicating the size of addressable memory.

Why are malloc, mrealloc, mfree in the one-operand opcodes? We could have put them in the zero-operand opcodes and used that space for more computation instructions

There's not a great reason for this, I was just out of space. These might be moved back to the zero-operand opcodes in the future. That being said, the current situation, where they are in the one-operand opcodes, has the virtue that the location of the affected pointer is always in the instruction, rather than having that pointer be pushed onto SMALLSTACK. This might be slightly more efficient in cases of frequent mallocs, and might make it slightly less annoying for implementations to trace pointer ownership.

Why are sysinfo, log in the one-operand opcodes? We could have put them in the zero-operand opcodes and used that space for more computation instructions

Each these have at least one argument that advanced implementations might like to know statically. Since the sysinfo query is staticly known, when the result of a sysinfo query is a staticly known constant, the query can be replaced with the constant at compile-time. When the result of a sysinfo query is something dynamic (like the stacksize, which in some circumstances could be queried frequently in an inner loop), a compiler might inline the code to compute the result of the query. Since the loglevel is staticly known, a compiler can be instructed to strip LOG instructions with a low loglevel.

Why is 'time' an instruction instead of a magic file similar to 'rand'?

Since the current time might be queried in an inner loop, it was thought that it might be good to make this efficient. We don't want to pre-open the open file descriptor, as with 'rand', because there may be many clocks available and we don't know which one a program wants, and there are too many potential clocks for us to want to pre-open them all. We don't want to have the program OPEN a clock, because then different programs, and possibly even different runs of the same program, will have different file descriptors for the same clock, meaning we can't just hardcode a special-case check for a magic file descriptor number in READ (as we can with 'rand'), which means that reading a clock in this manner may be less efficient than having an instruction.

Why is 'rand' a magic file instead of an instruction?

To save opcode space. This can even be efficient; since 'rand' is given a fixed open file descriptor, it seems easy to hardcode a special case into READ.

Why not give 'loadi' 12 immediate bits instead of only 8?

Because you can instead use 'loadi' followed by 'loadlo' to load more than 8 bits.

Why have devices instead of additional opcodes?

First, there are limited opcodes. It's efficient to save these opcodes for frequently used, fast operations, and to use something more cumbersome, like devices, for operations which are relatively slow (like I/O) or infrequent.

Second, the notion of a 'space' of devices located by device id provides conceptual economy. We can reuse the same device API for many instances of this object; for example, we have one channel control device API that we then use for every channel. At the same time, different types of devices can have different device APIs.

Third, all of these device APIs use the same few device opcodes, allowing code analysis tools to make some basic assumptions about which registers and memory addresses might be affected by a given device instruction even without understand the device type.

Why don't signals have attached data? Why don't you have a count of pending signals, instead of just a bit?

If signals had attached data, then the implementation might:

The last option is complex and inefficient, and all the others require the implementation to maintain some sort of buffer that could fill up if too many signals come in faster than they can be caught. Another option could be to provide an attached data option but not guarantee delivery; for example, if another signal of the same type comes in before the first one is handled, the latter signal's data could overwrite the former's. This would make the attached data less useful, however.

All of these options add edge cases for programs to work around, and all of them add implementation complexity (and also increase the latency of sending and receiving a signal). And if threads or processes want to send messages to each other, there is already a mechanism provided for that: channels.

Therefore, i chose for the effect of a signal only to set a one-bit flag (and to call the signal handler when the flag is set and the signal is unmasked). Signals are not mailboxes; signals are like the little flag on the mailbox.

Similarly, a count of pending signals could overflow or saturate.

Why doesn't receipt of a signal spawn a new thread, instead of interrupting an existing thread?

The whole point of signals is to provide a mechanism to interrupt processing in a single thread. If you want a new thread to handle some event, you can spawn one explicitly.

11 GPRs is too few. Why aren't there more?

If we have 16-bit 3-operand instructions, there's not much space. 4 bits for the opcode is already pushing it; this only gives us 16 3-operand instructions (or, as we do it, 15 3-operand instructions plus a number of instructions with less operands). We could take away some bits some of the other operands and make one of the other operands bigger, but then that's irregular and possibly annoying to write compilers for -- and that would privilege the lowest registers anyways because that's all that would fit in some of the operands.

So given that we have 16 registers, we spend 5 of them on special registers. So 11 are left

Why do you waste 5 registers on special registers? We could use more GPRs

The 5 special registers, R0 through R4, are:

0. 0 (constant 0 register) (also used to indicate SMALLSTACK to stack-aware instructions) 1. 1 (constant 1 register) (also used to indicate DATASTACK to stack-aware instructions) 2. ERR (also used to indicate CALLSTACK to stack-aware instructions) 3. PC 4. T (alias to the item on the top of SMALLSTACK)

None of these are need-to-haves but they are all very-nice-to-haves.

Having registers for SMALLSTACK, DATASTACK, and CALLSTACK allows the implementation to implement these stacks at a lower level. That may make accessing these stacks more efficient, and it allows the implementation to do something complicated behind the scenes of PUSH and POP, for example, dealing with segmented stacks, or in the case of SMALLSTACK dealing with register T. As a bonus, with non-stack-aware instructions these registers are not totally wasted, because they also serve as constants 0 and 1 and register ERR.

The constant zero register adds efficiency because we don't have an immediate addressing mode (or any addressing modes), so without this we'd have to have a bunch of 'LOADI 0's all over the place. Similarly for constant 1.

Register ERR replaces other architectures' carry bit, high-order multiply result register, and ERRNO function. This stuff would have to be stored somewhere anyhow, and making it an accessible register removes the need for special instructions to retrieve hidden carry/status/lasterror state.

Register PC allows somewhat efficient code to access memory addresses relative to the PC. In OVM, which will have addressing modes, and might have an addressing mode that adds together the contents of two registers and uses the result as an effective address, this might be even more efficient.

You might say that register T is not even a special register, but rather a GPR; it can be used just as the other GPRs, the only thing 'special' about it is that PUSH and POP treat it specially (you might think of SMALLSTACK as actually being a stack with 0 to 15 items, but PUSH pushes T onto that stack and then CPYs its target register into T, whereas POP copies T into its target register and then pops the stack into T). There are various other instructions that also treat T specially, namely by implicitly taking inputs from or placing outputs into T. Register T exists because stack machines often have a cache of the TOS, for efficiency. This is less useful in Boot but in OVM we'll probably add a stack addressing mode, in which case it will be more useful. Also, our primary intended use-case for SMALLSTACK is to enable OVM (Oot Virtual Machine, which will target Boot/BootX?) to have 'custom instructions' whose implementations are required to preserve the other registers and stacks; since computation requires at least a few temporary storage locations, SMALLSTACK serves that purpose. Since our instructions get and set registers, rather than pushing and popping their results from the stack, we need at least one register for the use of these custom instructions; T serves that purpose.

Why not have a TOS cache for DATASTACK and CALLSTACK too, and multiplex the stacks with these registers, instead of with constant registers and the ERR register?

Because you might want to LOAD and STORE to the addresses on the TOS of various stacks.

Why is there no 'pointer literals', no way to construct immediate/constant pointers (not even the null pointer)?

First, pointers are supposed to be opaque; since we don't want to pin down the representation of a pointer, we can't specify pointer literals.

Second, on most systems, if you could construct memory location zero, you could then get a pointer to any memory location by adding some offset. But that's effectively like coercing integers to pointers, which we don't want to allow. (you could also allow some pointer literals, for example the null pointer, but then forbid addition and subtraction of offsets to these; but then we really have two types of pointers, one type that permits addition and subtraction and one that doesn't; which would be more complicated).

Why do you allow pointer subtraction?

First, this allows us to serialize complex data structures containing pointers (assuming all of the pointers are 'derived' from a common ancestor) by converting the pointers into integer offsets from each other.

Second, this allows you to check how much memory is left until you run out of room when you have, e.g., a stack or array pointer, and a base pointer to the memory region the stack is in, and knowledge of how big that region is.

Third, this allows you to add an index to a base pointer, store both the base pointer and the new pointer but throw away the index, and then later recover the index.

Why do you require derived pointers for LEQ-PTR and SUB-PTR?

As noted above, pointers are supposed to be opaque, both to help prevent pointer errors, and also to allow the topology of memory to be something other than a single large array. A pointer and another pointer derived from it CAN be thought of as existing at integer offsets within a single larger array, but this is not generally true of any two pointers.

Why do you require that, even if pointers aren't derived, LEQ-PTR can't say both P1 <= P2 and also P2 <= P1?

Because, if the pointers were derived, then this would imply that they are equal. That could lead to confusing bugs.

Why do you require that, even if pointers aren't derived, SUB-PTR can't say both P1 < P2 and also P2 < P1?

Because in this case < seems similar to <=, and if it were <=, and if the pointers were derived, then this would imply that they are equal. So this could also lead to confusing bugs.

Why is SUB-PTR a two-operand instruction instead of three-operand?

Just because we ran out of 3-operand opcodes, and this seems like it may be a less common operation than the other 3-operand arithmetic operations.

Why can the implementation choose to skip all-1 instructions (ANNOTATE 15 15 15) instead of crashing?

Because all other ANNOTATE instrucitions are skipped, and in some implementations it may be significantly faster to just skip over all ANNOTATE instructions without looking at their operands.

Since programs don't know the value of MAX_UINT beforehand, do they have to have extra code to deal with various possible sizes?

No. A valid Boot program is allowed to assume that MAX_UINT is a certain number and work correctly only on platforms that provide that size of UINT. For example, you could write a Boot program that only works correctly on 32-bit platforms. Of course, if you choose to, you are also able to write Boot programs that can handle multiple UINT sizes.

Why is there a cast-uint-ptr instruction in BootX? Doesn't that make it impossible to be safe?

This is an optional instruction and safe implementations should not implement it. This is included so that, for unsafe implementations, there is a platform-independent way to do this. This allows BootX? to be a target for C compilers.

What are some disadvantages of this design?

Why can't the offset on LD, ST take negative values?

I decided to use all of the 8 values for non-negative values so that we can reach deeper into stacks. Positive values are probably more useful than negative for other things, too (such as indexing into arrays or addressing items within structs).

Why are there stack-aware PUSH and POPs that PUSH and POP SMALLSTACK? We can already do PUSHes and POPs of SMALLSTACK by using CPY with register 2 operands, so this is redundant.

It's true that 'CPY 2 x' and 'PUSH 2 x' do the same things, and that 'CPY x 2' and 'POP x 2' do the same things.

This is because some parts of the toolchain might find it useful to be able to explicitly represent PUSHes and POPs of the SMALLSTACK.

In addition, if there was no special case, PUSH and POP would be kind of useless when given register 2 operands anyways, since they would PUSH or POP something to some user stack, but they would consume the stack pointer while doing so.

Is BootX a RISC instruction set?

Probably not. Unlike Boot, BootX? has addressing modes, which allow multiple loads and stores to be embedded within almost every instruction.

It is also a variable-length instruction set (although still a simple one), with 8-bit, 16-bit, and 32-bit instructions.

Why don't you add some restrictions that would make Boot easier to compile? For example, require CVT-INT-CODEPTR to only be applied to newly-loaded immediate values, and require the layout of SMALLSTACK at any single point in the program to be fixed and staticly knowable

This would add complexity to the specification. More importantly, it would prevent certain kinds of programs from being written in Boot, for example a REPL (in which the argument of CVT-INT-CODEPTR, and the layout of SMALLSTACK, may vary based on user input).

Why don't you specify a fixed signature for LIBRARY? This would aid program analysis.

Not specifying a fixed signature for LIBRARY means that program analysis tools must be provided with the platform-specific signatures for any LIBRARY invocations.

This is not fixed because it is consistent with the spirit of omitting other restrictions that would make program analysis easier (see previous question).

Why does CVT-INT-CODEPTR exist in Boot? It makes program analysis harder.

This instruction exists so that programs have a way to do jumps farther than the 2KiB? (in either direction) limit of JREL.

In Boot, getting the PC with PUSHPC and then doing arithmetic on it would not be good enough, because pointer arithmetic is in units of abstract 'slots' which are the size of native pointers/ints, but program jumps need 16-bit granularity, and slots may be larger than 16-bits (in fact, we give jumps byte granularity, for upwards compatibility with BootX?, where we will have some 8-bit instrutions).

Why can't CVT-INT-CODEPTR also work for regular PTRs?

Because some implementations may have a different representation for CODEPTRs and regular PTRs.

Why isn't there a CVT-INT-PTR for regular PTRs?

There is no need for this unless you need to access something in memory outside of the Boot program (and outside of its allocated memory) which lives at a fixed address. In which case, this must be a platform-specific thing, so we leave it up to implementations to provide a facility to do this.

Why do you have a smallstack?

Stacks allow for higher code density, compared to adding an additional number of registers equal to the stack depth. This is because with registers, instruction encodings must specify which register is being used, but you don't have to specify which stack location you are using because it is assumed to be the top of the stack. It also allows for slightly easier program analysis in some cases (specifically, escape analysis), because items are popped off of the stack when they are used for the last time, instead of continuing to sit in a register after their last use.

Note that, in some cases, when the program uses the smallstack in a predictable manner, the compiler is advanced, and many registers are available on the underlying platform, compilers may be able to 'compile away' the smallstack and translate smallstack operations directly into register operations.

Why do you have two stacks? Why not just have the operand stack within each stack frame of the call stack like, eg, the JVM?

Having two stacks allows higher-level languages to more easily provide forms of function calling with less overhead, which encourages code to be factored at a finer grain. For example, let's say you have a place in your function where you sort the order of the top 5 items on the operand stack. Should you factor this out into a different function called SORT5? If the operand stack is within the stack frames of the call stack, if you call another function, you would have to push the return address onto the call stack, and then the newly-factored SORT5 function would have to be changed to reverse the 2nd, 3rd, 4th, 5th, and 6th items on the call stack, instead of the 1st, 2nd, 3rd, 4th, 5th items. So far so good, but what if the implementation of SORT5 itself calls other functions, which expect to access the 1st, 2nd, 3rd, 4th, 5th items on the stack? Now you have to make new versions of those functions, too. Or, you could just add a function prolog and function epilog to SORT5 which moves the return address up 5 spots in the call stack, exposing the 5 items to be sorted. But any of these alternatives have an efficiency cost, which discourages programmers from factoring code into subfunctions willy-nilly.

But if you have two stacks, then you can push the return address onto the call stack, and then SORT5 can immediately operate on the operand stack. There is no need to adjust the code of SORT5 or its dependencies or to add a function prolog or epilog. The only overhead is the push of the return address, the jump to the SORT5 routine, and the pop of the return address and the jump back to the caller -- this is still substantial but not as substantial.

To give another example of the potential efficiencies of two separate stacks, let's say that function f() is called with 8 arguments, and f() calls function g() and passes those 8 arguments along, and g() then calls h() and passes those 8 arguments along (and assume that none of these are tail calls, and also that neither f() nor g() need those 8 arguments after they pass them on). The calls from f() to g() and from g() to h() will typically be implemented by pushing a return address to the call stack, and then making copies of all 8 arguments and pushing them onto the call stack. If we had 2 stacks, however, we could just leave the arguments on the operand stack and push only the return addresses onto the call stack, saving us the trouble of copying those 8 arguments over and over. Similar savings are sometimes available in function returns.

A more general way of looking at this is the concept of 'concatenative programming languages' (for example, Forth is a concatenative programming language), in which the juxtaposition of expressions denotes function composition. If the effect of a function call on program state is confined to pushing a return address to a call stack and jumping, then any subset of a function can be moved into its own function without alteration, and still perform the same operation on program state; that is to say, code which was previously juxtaposed (eg two sequences of instructions which were next to each other) can be refactored into function composition without alteration.

Won't it be hard for implementations to allocate memory for a smallstack?

No, because the smallstack is of a fixed size (16). This means that the additional memory requirement for it is no worse than the memory requirement to store the 16 registers.

Which instructions in Boot do you think are the least important / which ones would you take out first?

SAR (or maybe SLR, since we have signed integers), and ROT (actually i did take out ROT!).

Why don't you have some instructions (for example, 'LOADI, Load Immediate constant') have an additional byte or bytes of data that comes right after it in the instruction stream?

I have this idea that some analysis tools could run more efficiently if they can represent the program as an array of instructions of some fixed size (for example, an array of 16-bit instructions, for Boot). If some instructions include 'data' after the main instruction, then either the analysis tool has to represent that instruction separately from the data (perhaps referenced by a pointer), or it has to increase the size of all of the items in the array just so that it can occasionally fit in extra data, or it has to reference the data in the instruction stream itself (which is itself made slightly more difficult by having the data after the main instruction).

Probably not that important. Maybe i should give in and add data.

Were there any design considerations in the ordering of opcodes?

Yes, they were chosen so that at runtime, an non-type-checking interpreter dispatcher could decide what pre-processing of input arguments needs to be done (such as applying stack-awareness or, in the case of BootX?, addressing modes) by checking for which range the subopcode is in, without storing a full table of instruction signatures.

Within each block of instructions sharing the same number of operands (and the same addressing mode, for BootX?), opcodes were chosen with the following considerings in mind, in descending order of priority:

(in BootX?, explore moving these different classes of instructions into different opcode addressing mode blocks)

Were there any design considerations in the ordering of operands within instructions?

Yes, in descending order of priority:

Why is ANNOTATE opcode zero? We want the runtime to check for the all-zero bitpattern (and terminate with an error if it is found), but we also want it to be able to completely ignore ANNOTATE instructions.

It was decided to combine these two special cases into one opcode, as it seems conceptually nicer to group special cases together; for any given instruction, once it has been checked that the opcode is >0, neither special case needs to be thought about anymore.

Implementations are encouraged but not required to check for the all-zero bitpattern, so in situations in which it is much more efficient to completely ignore ANNOTATE instructions, an implementation may also ignore ANNOTATE 0 0 0.

Note that the 32-bit BootX? encoding permits ANNOTATE 0 0 0, because the format bits of the 32-bit encoding includes a 1, so there is no 32-bit pattern of 0s in that instruction.

Why are the 32-bit BootX INSTR-ZERO implementation-dependent?

For the sake of code density BootX? supports a mixed-width instruction set with 32-bit BootX? instructions, 16-bit Boot instructions, and 8-bit BootX? instructions.

However, the primary goal for Boot and BootX? is easy portability. On some platforms, a fixed-width instruction set may be more efficient or easier to implement. Therefore, BootX? also supports a fixed-width instruction set with only 32-bit BootX? instructions. Care has been taken so that the 32-bit BootX? instruction set can easily express any 16-bit Boot instruction as a special case. This requires that the 32-bit BootX? instruction set has room to express the 16 zero-operand INSTR-ZERO instructions from Boot.

What is relationship between Boot and BootX?

The motivation for defining both Boot and BootX? is that Boot is simpler to implement but BootX? is probably more performant.

BootX? can be compiled to Boot, with the exception of some of the optional instructions in the SYSCALL2 subset, which expose platform functionality not required by Boot (eg 'time').

Boot is a smaller, simpler ISA that is easier to implement. It is a RISC fixed-width load/store architecture. Any Boot program can be transformed into a subset of Boot in which implicit SMALLSTACK PUSHes and POPs are factored out into explicit PUSH and POP instructions.

BootX? is a larger, slightly more complex ISA that may permit better code density (especially when used in concert with Boot as a 32/16/8-bit mixed-width encoding) and performance. It is less RISC and more CISC-y (although not too much). It contains additional instructions for operations that could be expressed in software in Boot, but which can be expressed more concisely in Boot (e.g. JMP), or that are likely to be exposed as primitives on many underlying platforms (eg integer division, floating point arithmetic). BootX? implementations can support only a fixed-width instruction stream of 32-bit BootX? instructions, or they can support a mixed-width instruction stream of 32-bit BootX? instructions, 16-bit Boot instructions, and 8-bit BootX? instructions.

Why does Boot have undefined behavior?

For those illegal operations for which naive execution would cause undefined behavior on popular targets, we permit undefined behavior.

When an illegal operation is encountered, Boot implementations are permitted to forgoe raising an error, and may attempt to execute them regardless, provided that the potential result of executing them regardless is no worse than allowed for that sort of illegal operation. The motivation for this ambiguity is to allow implementations to dispense with error checking.

Why does Boot provide an instruction for adding a number of bytes to a codeptr?

Consider the following sequence of instructions:

1. Push return_address to stack 2. Push argument1 to stack 3. Push argument2 to stack 4. Jump to subroutine 5. (after return of subroutine) do other stuff

When the subroutine returns to return_address, you probably want it to return to the next instruction after the jump, #5. But if there is no way to do pointer arithmetic on CODEPTRs, but rather only to fetch the PC, then if you tried to implement step #1 with PUSHPC, you would get a codeptr pointing to the instruction at step #2, and there would be no way to convert it into a codeptr pointing to the instruction at #5.

Ordinary add-ptr-int can't be used here, because that is in units of 'slots' (the size of a native pointer), but instructions in code must be addressed in units of bytes. So we need a new instruction, so that we can do:

1. PUSHPC to stack 2. add-codeptr-int (the item you just pushed onto the stack) + (10 bytes) 3. Push argument1 to stack 4. Push argument2 to stack 5. Jump to subroutine 6. (after return of subroutine) do other stuff

Why do alignment requirements and branch and jump target units vary by instruction length?

The spec says that "N-bit instructions must be N-bit aligned. Branch and Jump targets are in units of N-bits. So, for example, 8-bit instructions must be 8-bit aligned, 16-bit instructions must be 16-bit aligned, 16-bit instruction branch and jump targets count in units of 16-bits, 32-bit instruction branch and jump targets count in units of 32-bits.".

The motivation for this is to help implementations which recompile everything to one instruction length, and then only support execution of instructions of that instruction length. This ensures that if e.g. only 32-bit instruction are used, that (a) 32-bit alignment is required, which simplifies fast hardware implementations that speculatively decode future instructions in parallel, and (b) 32-bit branch and jump targets count in units of 32-bits, which makes sense because if 32-bit alignment is required then these are the only possible targets anyways; therefore any other choice would waste instruction encoding space/unnecessarily reduce the effective max jump/branch length expressible by instructions.

As a bonus, these requirements help debugging, because now any region of memory can be decoded into instructions without first knowing the initial offset at which execution is to begin.

Given the alignment restrictions, why not treat each 16-bit chunk of 8-bit instructions as a packet, and treat each 32-bit chunk of 16-bit instructions as a packet, and in the second instruction in each packet, use the redundant instruction width bit for something else?

This would be more efficient but would also be more complicated, so i decided not to do it.

As a bonus, this form of redundancy allows the instruction stream to be somewhat self-synchronizing, that is, if the instruction stream is being read as a bytestream from media that doesn't specify alignment, alignment can be guessed by seeing where the 16-bit and 32-bit instruction boundaries tend to be.

Wouldn't it permit faster parallel speculative decoding hardware if there were only one instruction length?

Yes, but one of our goals is to support massively-concurrent, "brain-like" computing. The brain appears to have massive concurrency at the cost of relatively slow serial execution speed of each CPU.

In order to support more cores (here, i am using the word 'core' to refer to the unit which is being repeated for massive concurrency; a CPU paired with some local memory and networking capability) in a given die area or thermal envelope, at the cost of slower serial execution speed, one might not be doing fancy stuff like parallel speculative decoding anyways.

Multiple instruction lengths permit higher instruction density, and this allows one to give less local memory to each core, which would allow more cores to fit in a given die area or thermal envelope.

Wouldn't it be simpler if there were one instruction length?

Yes, and this is why Boot supports only one instruction length, and only the more complex BootX? supports multiple instruction lengths.

Why is Boot 16-bit focused instead of 32-bit focused?

It is true that contemporary processors and programming toolchains focus on 8-bits (e.g. byte-oriented addressing) and 32-bits (e.g. the default size of ints in many toolchains; the bitwidth of the most popular >8-bit MCU architecture (ARM)). So why do we focus on 16-bits?

Because 16-bits is the smallest power-of-two that addresses an amount of memory sufficient to hold useful/interesting general-purpose programs, for example, practical (if very small) compilers and interpreters.

Why do we care about choosing the SMALLEST power-of-two? Because, in order to support massively-parallel, "brain-like" computing, we want to have very small, simple cores in exchange for more parallelism. 16-bit CPU hardware will be smaller, and we can't fit much local memory into each core anyways.

Well in that case why include support for addressing larger amounts of memory, and for 32-bit instructions, at all? Because another goal of Boot is to be a portable and reasonably efficient target language on contemporary machines.

things to mb reconsider someday

new stuff 190526

new 201020

 Typically we expect creators of software that runs on BootX, and of implementations of BootX, to focus on profiles rather than support arbitrary combinations of subextensions. Since each profile includes all smaller profiles, this allows creators of software that runs on Boot to prepare for O(n) possibile configurations instead of O(2^n). However, the subextensions are also defined for use in situations in which the linear ordering of the profiles really doesn't fit.

--- old

removed from boot_design.md

To achieve the goal of creating an interoperable 'glue language', I want to make it easy for other people to create (slow, poorly integrated) implementations of Lo that are written in, and run on, a wide variety of platforms. For example, I want people to be able to implement Lo in Python, in Haskell, in Java, in C, in Rust, in Octave, in C#, in Javascript, etc. If Lo were implemeted in, say, LLVM, then that would mean creating an LLVM backend for each target platform. It seemed to me that all of these other 'common VMs/ISAs/languages' are too much trouble to implement/port/create backends for -- enough trouble that re-implementing one of them on a new platform would be a 'serious project' rather than something that one person who is not a programming language expert might attempt to do over a weekend.

--- old

removed from boot_reference.md

Registers are encoded in operands as (treating operands as unsigned 7-bit integers; both register number and encoded value in decimal):

Other operand values are RESERVED.

---

executable size limits and representative app sizes on various platforms (listed even if there is a way around them or if they are old/obsolete):

4 GB

2 GB

60 MB, 80 MB (iOS) https://developer.apple.com/forums/thread/3757

64k reference limit (Android) https://developer.android.com/studio/build/multidex

150MB, 100MB (.ABB, APK file sizes, Android) https://medium.com/unpacking-react-native/google-play-store-app-size-f7a1b9fb58d0

80/60/400MB (apple) https://support.unity.com/hc/en-us/articles/208412186-IL2CPP-build-size-optimizations

cockroachdb 123MB https://www.cockroachlabs.com/blog/go-file-size/

gcc/cc1 27MB

100-140MB https://thephd.github.io/sol3-compile-times-binary-sizes

16 MB (embedded flash) https://semiengineering.com/embedded-flash-scaling-limits/

512 kB (mouser SOC category filters) https://www.mouser.com/Semiconductors/Embedded-Processors-Controllers/Systems-on-a-Chip-SoC/_/N-8h6y5 https://www.mouser.com/Semiconductors/Embedded-Processors-Controllers/Systems-on-a-Chip-SoC/SoC-FPGA/_/N-hx1619

>=64 MB (1 of 44906), >=8 MB (47/44906), >=4MB (348/44906), >=2MB (1535/44906) (mouser MCU category filters) https://www.mouser.com/Semiconductors/Embedded-Processors-Controllers/Microcontrollers-MCU/_/N-a85i8

"In short, an ESP8285 is like an ESP8266 but with 1Mb memory on chip"

"The on-board flash chip of the ESP8266 has plenty of space for your webpages, especially if you have the 1MB, 2MB or 4MB version. " https://tttapa.github.io/ESP8266/Chap11%20-%20SPIFFS.html

"ESP-03 only have 4 M bit (512 KB) flash, there are many limitation"

"First you need to know the size of the flash memory attached to esp8266 on your module. Common sizes are 512 kB (4Mbit), 1 MB (8Mbit) and 4 MB"


so alternatives seem to be:

---

given all that, embedded 32-bit label and jump data after instructions (what we have right now) is looking pretty good.

some other RISC-y solutions are to load constants from a given offset in the instruction stream; but this is just a special case of that; or a load high immediate sort of thing (so it takes two instructions to load the large immediate label, and the immediate is split across the operands of those two instructions), but it seems to me that would make tooling and efficient execution harder rather than easier even though now 'everything is an instruction'.

leaning towards sticking with what we have.

---

ok i think this is why RISC-V doesn't do this, and instead loads large immediates over multiple instructions:

imagine a system like ours, where occasionally you have a 32-bit instruction with a 32-bit data payload following. Since this happens sporadically, sometimes an instruction and its payload will fall on opposite sides of a page fault. So imagine a hardware CPU executing this. It executes the instruction but now it wants to fetch the data. But we hit a page fault, so now it has to maybe invoke the OS. The instruction isn't complete, though, so while this is happening, the CPU has to remember what it was doing in the middle of that instruction. So it needs to store some extra state somewhere.

(this is mentioned briefly in the middle of here: https://stackoverflow.com/questions/21481427/what-is-the-advantage-of-having-instructions-in-a-uniform-format/21488077 )

this is concerning, but i'm not sure if it's really that much worse that doing an ordinary memory load. We still only hit memory once per instruction, it's just that now we're hitting program memory instead of data memory. Perhaps j32 is a little different because in that case something special is also occurring after the load. On a platform where that matters, though, j32 can be trivially compiled to ll32; jy, and for program analysis j32 is nicer because the immediate presence of the jump target isn't hidden by unnecessary indirection. Also, on a platform where that mattered, HINT instructions could be inserted as no-ops (or others; eg cp 0 0, etc) as necessary before any li32, ll32, or j32 to force it into 64-bit alignment.

so, still leaning towards sticking with what we currently got

--- removed from boot_reference:

Terminate program, with RESULT code passed on top of the stack cache (so set it with appmn &2 &2 1; swc $result &2 0). The result code is interpreted in a platform-specific way (however, most typically, success is indicated with a result code of 0).

Register 2 is the stack pointer and must always contain a valid pointer. It is guaranteed to contain a valid pointer at the beginning of the program, and the program must not change it to something other than a valid pointer. At the beginning of the program, register 2 is pointing to a location on the stack which is available for the Boot program to use.

At the beginning of the program, register 2 is pointing to a location on the stack containing the first input argument for the Boot program. Above this in memory (so 'deeper' into the stack) are the other input arguments for the Boot program.

  1. ## syscall 1: memcpy(DST : ptr, SRC : ptr, SIZE : int32) Copy SIZE bytes starting at memory location SRC to memory starting at memory location DST.

Arguments are passed on the stack, with DST on top of the stack, followed by SRC, followed by SIZE.

The caller must assume that the values in registers TODO may be overwritten during the call.

---

some snippets while testing/debugging pyboot (pyboot3):

  1. printf "li 5 5\nsys 2\ncp 6 5 0\nli 5 5\nsys 2\nsp 5 5 0\nli a a\ns32 a 6 0\nli 7 3\ncp 2 6 0\nli 1 1\nap 3 2 1\ns32 1 3 0\nap 3 3 1\ns32 1 3 0\nap 3 3 1\ns32 1 3 0\nsys 1\nll32 a 0 0\n.d ffffffff\nbreak 0 0 0"
./basm.py ./boot.py - -vvv
  1. to check that loading 65536 with l16 gives 0:
  2. printf "li 5 5\nsys 2\nli 1 1000\nli 2 2\nmul32 1 1 2\nmul32 1 1 2\nmul32 1 1 2\nmul32 1 1 2\ns32 1 5 0\nl16 4 5 0\nbreak 0 0 0"
./basm.py ./boot.py - -vvv
  1. to check that l16 is working with negative values:
  2. printf "li 5 5\nsys 2\nli 1 1000\nli 2 2\nmul32 1 1 2\nmul32 1 1 2\nmul32 1 1 2\nmul32 1 1 2\nsub32 1 1 2\ns32 1 5 0\nl16 4 5 0\nbreak 0 0 0"
./basm.py ./boot.py - -vvv

---

implemented Boot in Python again on Nov 26 (Thanksgiving) and Nov 27 (commit 39df6da; "pyboot3").

---

one thing that i am still slightly unhappy with is the choice of 32-bit instead of 16-bit.

for all the reasons in ootNumerology section 'removed from boot_design.adoc === Why 16-bit?'. Also, i realized why i think even 64 is slightly better than 32; 16 and 64 are both square numbers, 32 is not! Also 16 and 64 are both even powers of 2, 32 is not; the square powers of 2 are identical to the even powers of 2.

let's enumerate the places where the 32-bitness matters (places where programs would break if they assume we are exactly 32-bit rather than just 32-bit or more). We can look thru boot.py for instances of '32' and '31' (later note: 31 didn't actually add any new instances to this list):

and i think that's it

The first 5 seem pretty clear. Let's focus on the last two, signedness and wraparound. What are some examples of observable differences if a program assumed 32-bitwidth but then we changed it (say, to 16-bit or 64-bit)? I haven't double checked these:

---

perhaps we can say that wraparound semantics aren't provided, and as long as all of the values you are using stay in the following range:

unsigned: 0 thru 215-1 signed: -215 thru 215-1

then you won't detect a difference between the bitwidths.

This also suggests that we just define everything as signed in the range 0 thru 215-1, and don't make people worry about twos-complement.

We do need to provide a guaranteed way to detect overflow, however. For each arithmetic operation, go thru and see:

---

maybe no need to talk about 'BootS?' as a subset of Boot. Rather, you will have:

actually i changed my mind, how about using BootS? to talk about what i'm currently calling Boot:

Now you can use numbers to refer to profiles:

Boot0 compiles to BootS?0 Boot1 compiles to BootS?1 etc

Profile 0 could be tiny, static, pure:

Profile 1 could be what i currently call Boot

then some higher number could refer to all of the extensions needed for Oot. E.g. if 4 were that number, then

Boot4 compiles to BootS?4

and you would say that Ovm targets Boot4, and in order to implement it, you can implement Boot4 directly or you can just implement Boots4 and use the provided tools to compile/interpret Boot4 down to Boots4. Or, maybe better, whatever that number is is what 'Boot' without a number means.

So, we'll have a few reference documents:

The reason this wording is better is because the project as a whole should be called 'Boot', and it would be weird to have BootX? be the pinnacle of a project called Boot.

---

note: in Boot (what we used to call BootX?), we do need a separate F64_SIZE (rather than just saying it's the same as I64_SIZE)

---

could define a simple Boot calling convention:

btw, i think maybe it is best for syscalls to pass args in registers. Because on some implementations, the 'registers' are not physical and the syscalls will just be handled transparently by the underlying platform, so in these cases it would be a waste for the program to do caller-saving. We should make one of the syscall operands be an immediate that passes information about which registers are in use, though, like we were planning to do with CALL.

--- old

renamed Boot -> Boots

--- old

removed from boots_design

Boot and BootS? have the same instruction set, opcodes, instruction semantics, and operand ranges, although their encoding format is different and BootS? supports address modes.

If the platform supports the required functionality (e.g. floating point, various I/O and IPC functionality, etc), an implementation of Boots can be incrementally extended to a full implementation of BootX?.

The rest of BootX? offers functionality that may not be readily available on every target platform.

      1. Should I target Boot if I want to run something on a Boot platform? No, if you want to run something on a Boot platform, you should probably target the 'Small' profile of BootX?, "BootS?". BootS? programs can be compiled to Boot.

Boot is not designed to be directly targeted. Rather, Boot is designed to be easy to implement, so that the whole stack is easy to port. A BootS? program can be more easily efficiently compiled than a Boot program. If you target BootS? instead of Boot, then if a BootS? implementation later becomes available on your platform, you can take advantage of the efficiency of BootS?. Also, the toolchain is written with the assumptions that only BootS? will be targeted, and Boot is only produced as a result of the BootS? tooling.

The lowest level that is designed to be directly written by humans is Lo.

You should probably implement Boots first and then if you have time, extend it to Boot. Boots and Boot were designed so that an implementation of Boots can be incrementally extended into an implementation of Boot, and then to an implementation of BootX?. A direct implementation of BootS? will probably be able to compile or run programs much more efficiently than a direct implementation of Boot.

--- old

--- old

removed (immediate 0 now encodes displacement 0, not discard)

The only immediate value permitted to an operand to which the instruction does not read data but may write data is 0, and writes to an immediate-mode operand of 0 are discarded.

--- old

removed

Immediate mode, when the operand is of type int:

The 6 bits are treated as a twos-complement signed 6-bit integer. That is, 0 encodes 0, 1 encodes 1, ... 31 encodes 31, 32 encodes -32, 33 encodes -31, ..., 63 encodes -1.

          1. Immediate constants
            1. Integer The operand data, interpreted as a signed 6-bit number using twos-complement (range -32..31).

TODO: probably more useful if some of these constants are multiplied by I32_SIZE and/or PTR_SIZE. How about: -16..15 I16_SIZE*-1..1 excluding 0 I32_SIZE*-8..7 excluding 0 PTR_SIZE*-8..7 excluding 0 28 216 232 264 ? x 2 (assumes that LABEL_SIZE and HANDLE_SIZE == PTR_SIZE)

  1. ##### Pointer

op in:

--- done

If you have n-bits immediate jumps, and k-bit immediate indices into a label constant table, then the binary size can be (n+k) bits

We Also have the immediate mode labels to work with in bootx. Could create an assembler syntax to mark these, as well as having some implementation dependent ones.

If boot had labels that would help with large binaries. I guess ultimately the assembler doesn't matter that much though. I think maybe we should have load constant instructions in boot; this is more flexible than a constant table pointer. Also maybe could introduce an assembler syntax for a few global labels; that doesn't require two passes.

---