proj-oot-ootAssemblyNotes24

---

190529

we need to take a hard look at if Boot is meeting its goals: i'm not sure that it's any easier to implement than RISC-V RV32I.

it provides other benefits:

it's better than WASM (no weird structured control flow), simpler than JVM, much simpler than CIL, simpler and lower level than Lua, and simpler than ARM (esp. in instruction encoding). But it may be more complicated than RISC-V. And RISC-V is more popular, so even if it were equal complexity, RISC-V should win.

i used to be scared of RISC-V's instruction encoding variants, but now we have a bunch of those too! And ours are (right now, at least) even less regular than theirs, and also theirs are more thought out and have a bunch of properties that they claim are important for simple, efficient hardware decoding:

Yes, their 32-bit and 64-bit platforms are different not only in pointer size, which makes it harder for us to write code targeting both. But we already have an unknown pointer size in Boot, is it really that much trouble to have distinct 32-bit and 64-bit targets? And it's probably not so much trouble to make it worth it to have to start from scratch toolchain-wise with a new assembly language.

Yes, for our purposes, it's slightly simpler to include the necessary syscalls as opcodes. But again, not worth the complexity of introducting a new assembly language.

the only thing that i think might really be important is pointer opaqueness. I'm imagining interop situations where we are running inside a high-level platform like Python, and we want to interoperate with its data structures. In this context we don't have pointers, we have opaque references. We can store them, but not in integers. We can't do pointer arithmetic on them even if we want to. I'm imagining that we will be accessing struct fields (e.g. objects) by assigning a numerical offset for each field and then adding that offset to the base 'reference's, which is why i include the operation of adding an int to a pointer. (now that i say this, it strikes me that even blt-ptr (bltp) is too much, ok i just took that out).

in older versions of Boot we had systems that made the pointer size totally opaque; the program didn't even have to know it for memory layout b/c memory was accessed in terms of pointer-sized words. Should we go back to that?

In any case, we should at least work on our instruction encoding to make it more regular, more like RISC-V.

i think the takeaways are:

---

ok so thinking a little more, we need to go back a little to the more complex system where memory was a linear sequence of locations of opaque type and size. The reason is that the whole reason that we want totally opaque pointers is in order to use Boot code in situations like when we are running on top of Python; memory we allocate is Python arrays and we access Python objects by assigning a numbering to their fields and then treating the object reference as a pointer and adding integers to that pointer to specify the fields. The ability to do this is why Boot is more useful for interop than RISC-V.

so some considerations:

So, TODO:

---

so far i'm leaning towards/TODO:

---

if our memory locations could literally even be references to consecutive fields in a C struct, then some of those C struct fields can hold i32s, some can't, some can hold pointers, some can't, and this changes for different types of structs. So in that case, either you can to check the size of things at every single location, or the program has to have some knowledge at compile-time.

And if you are relying on the program to have compile-time knowledge about what fits where, then is there anything left to be done? And in that case, are we really any better off than RISC-V?

Perhaps we want the Boot program to have the POSSIBILTY of doing low-level interfacing like this, but to also have a sort of 'general case' memory type that it can do internal computations in, which is more predictable?

here's a few cases to consider:

If we rule out the idiosyncratic memories then things probably become more tractable. This means that memory locations never refer to things like references to consecutive fields in a C struct -- you would treat such things as fields of bytes, instead. But in this case how would we interop with structs on top of a typed HLL language, like Haskell?

---

so, i think that direct interop with external data structure layout is always going to require either program knowledge, or a debilitating amount of runtime introspection. When we say that we want Boot programs to be universal, all that we can really ask for is for the existence of uniform memory for internal data structures -- so e.g. if a hash map library is written in Boot, we want to be able to re-use that same code when running on top of Haskell, on top of Python, or on top of raw assembly. Now again, we can do this at compile time (have constants for the size of ptrs, the sizeof i323s, the size of i16s), or we can do this at runtime (introspection/sysinfo, possibly additional instructions such as addsizeof to make it easier to iterate through memory, although we still need to do other stuff when we need to e.g. compute how many memory locations an array of 100 ptrs takes up).

so, can we/should we just use RISC-V for this? maybe... the semantics we are using are pretty different (pointer arithmetic is invalid) but that may not be a problem. should think further. my instinct is no, the semantics are so different that you don't gain much by using it, so just make something new, it'll confuse ppl less anyways. but that may just be Not Invented Here syndrome.

---

okay i've thought about this a little today, i think it's still worth doing, and here's why:

the other option we are considering is just using RISC-V, but adapted for our case.

this means just specifying that pointers are opaque.

so what happens when you load a pointer? what instruction do you use? We need to add a 'load pointer' instruction, and a 'store pointer' instruction.

and what happens when you add an int to a pointer? what instruction do you use? You could just use 'add'. But then the VM has to check what type is in the registers each type it encounters an 'add', in order to know whether we are adding integers, or adding one pointer and one integer. it would be better to have an add-int-ptr, like Boot.

what happens if you try to do pointer arithmetic? it fails at runtime.

So, we need at least some new instructions (load and store pointer) and could really use some others (add int to pointer). That's already probably enough to make us think about a new language. And pointer registers seem to fit really well too.

we could always do an extension to RISC-V tho...

and there are other minor reasons to go with a new language. We're going to have to redefine the RISC-V semantics a little (no pointer arithmetic), and we're going to have to add the syscall fns. It's more confusing to have an overlay like this than just to make our own new thing.

---

can we make better immediate formats? actually i don't buy RISC-V's idea of splitting the immediates to keep the sign bit in the most-significant-bit. I think it's simpler to keep them together.

---

this 'BootX? should have no runtime requirement over Boot' is getting too expensive. I put in a special 32-to-8 and 8-to-32 cpy (mov) instruction (with its own instruction format) so we could access all 32 BootX? registers from Boot. And i put in extra cpfis, cptis, cpfps, cptps (cpy from int stack, cpy to int stack, cp from pointer stack, cpy to pointer stack) instructions so that we could access deep into the stack without pushing and popping. And i kept the intstack and ptrstack at capacity 8 so that those instructions can access all of it.

But this is getting ridiculous. I'm going to take all that out and assume that BootX? will need extra state and can store it in allocated memory.

---

i used to have this:

Five formats. For each format, from LSB to MSB, the fields are:

In graphical form (from MSB to LSB):

but i'm gonig to change it to consolidate i5, i8, i12 into one format with a 2-bit signed immediate and a 3-bit opcode. The instructions can then add in the other operands themselves. This is slightly less efficient (now JPC only gets an 11-bit signed immediate; also the encoding logic has to split up the immediate), but it's slightly simpler/less scary (b/c 2 less formats), and it has the property that RISC-V likes that the signed bit of the immediate is in the same place everywhere.

i saved the former one in boot_old/pyboot_190529_i5_i8_i12_encoding/encoding.py

later that day:

ok i did that but i think it's more complicated, not less.

i saved it in

../boot_old/pyboot_190530_r_i_only_encoding

Also i reread the RISC-C comments about keeping the sign bit in one place and i think they are talking about when they have branch offset immediates in units of 2; rather than just multiplying the immediate by 2 they do some complicated thing where they swap around the location of bits, but they leave the sign bit in one place, and i think that's what they were talking about.

Regarding keeping the registers in one place, since in the immediate formats some registers are being used as immediates, i think it's fine for those to move, which is what the old format did.

Regarding the simplicity of having only R- and I- formats, i found it was cleaner to add in R2 R1 R0 subformats anyways, and i think the complexity of juggling the immediate bits outweighs not having I5 I8 I12 formats.

So forget about swapping around the sign bit, that's stupid.

But i like having a fixed number of opcode bits for the Is, even at the expense of reducing JPC range by 2x -- it's probably much less efficient in terms of code density, but it's cleaner. So let's do:

---

ok i've mostly implemented Boot now but i havent finished testing it yet. It took me much of yesterday to write the initial encode/decode logic, then today I spent about half a day changing the encoding, and the other half writing the actual VM, and the first draft of the VM is done now but i've only tested the first few instruction. So i think it's likely that this is a two-day job in total (which i guess means a two-weekend job for hobbyists) (although, even though someone else wouldn't have to spend time changing the design of Boot itself, they would have to spend more time reading and understanding the spec i wrote, so maybe it really is a three-day job).

Anyhow, the point is, unlike the previous attempt, this is a success, in that:

it's sufficiently easy to implement!!!

(although i dunno if i'd call it 'dead simple'; i could have left out the I encodings b/c 32-bits is bigger anyways, i could have left out the rtwo rone rzero/MORE instructions, i could have chosen either a register or a stack machine rather than mixing them, i could have had a variable-length encoding that just had whatever data was needed after a one-byte opcode).

However i noticed that i now think that BootX?, as i currently envision it, wouldn't really be that much more complicated to implement, at least not if you don't include all the instructions. I think much of the complexity that i ran into before came from complicated ideas for the addressing modes etc, for example, having 'stack-aware' instructions that take an operand of $S to mean, not auto-push-pop to/from the stack, but rather, do something to the stack itself. Also my CALL3 idea led to difficulty in terms of how the return argument from the instruction being called would get assigned to the register indicated in the destination operand (although as i think about it at this moment, i don't see the problem; just have the RET instruction look at the jump target and if it's right after a CALL3 instruction then look at that instruction to get the return register).

Anyway, so i've simplified the addressing modes, and i don't think this would be a problem anymore. In fact, the encoding might even be easier, since with bigger operands the need for immediate encoding formats is not so important (with a 32-bit encoding you wouldn't need those immediate formats so encoding/decoding gets simpler). Also, now that Boot and BootX? have different numbers of registers, the story of incrementally extending Boot to BootX? is not so good as it was. And because the 16-bit encoding was so cramped (only +-1KiB? direct jumps), i felt the need to let a constant table creep in to Boot, which is complexity we wouldn't need with BootX? b/c we could have a 21-bit immediate constant.

And there was other stuff i was dreaming about with BootX?, about having it be a good emulator and having all these hooks and having capabilites and having polymorphic functions, but now i think that that belongs only in OVM if anywhere. BootX? is just a lowest-common-denominator target platform to make it easier for me to port everything on top of it.

So what about just calling the thing we're calling BootX?, "Boot", and having "Boot0" just be a subset of Boot's instructions (also Boot0 would only use the 32-bit instruction encoding, whereas Boot could have more)?

To be clear, what i currently mean by "Boot" (was: BootX?) is:

32-bit fixed length encoding: 3 encoding length format bits, 6 opcode bits, 2 opcode addr mode bits, 3 x (5 operand bits, 2 addr mode bits)

Boot0 would have the addressing modes (this removes the necessity for us to provide immediate-mode instruction variants like ADDI that are only needed b/c we don't have an immediate constant addr mode), but might not have the instructions supporting any types beyond int and ptr. Boot0 would have the 32x2 registers, plus 32x2 stack items.

to be clear, i am really excited that the 16-bit encoding of Boot seems to be really complete and dense, possibly even denser than anything else out there. And working with those constraints did help me hone my ideas about what is really essential and fundamental at this level of the stack of intermediate languages. But now i have those ideas honed, and the primary remaining goal for Boot is ultimately just to be an easy-to-implement, easy-to-use, portable target, not to win contests about code density and beautiful compactness. If the 32-bit format is easier to compile to than the 16-bit format (b/c of the greater number of registers), and not tons harder to implement, and if we'd have to compile Boot-32 to a Boot-16-only machine (due to greater number of registers in Boot-32), rather than easily extend the Boot 16 machine , then we should just start out implementing the 32-bit one, and consider the 16-bit format just an optimization.

---

if we start out with 32-bits though i'm just bothered by having so much state. If there's 32 registers, times two banks, and 16 stack locs, times two stacks, on a 64-bit platform that's:

32*4 + 32*8 + 16*4 + 16*8 = 576 bytes of state

so how about just going back to 8 addressing modes. Now we have:

16*4 + 16*8 + 16*4 + 16*8 = 384 bytes of state

Still a ton but not quite as excessive.

First three registers are still 0, stack, tos, and still have their special behavior (this is a little redundant, but it make them more useful in the other addr modes, and makes everything more uniform). Except i had a new idea: there should be a 'zero page' of pre-allocated memory, and the zero register in the pointer register bank should point to it, rather than be the PC (for the PC, there is a PC-relative addr mode, see below).

Now we have:

3 encoding length format bits, 4 opcode bits, 3 opcode addr mode bits, 1 opcode bit to control whether or not addr modes are in use or are just more opcodes, 3 x (4 operand bits, 3 addr mode bits)

(16*8 + 16) opcodes = 144

addr modes (note that some of these have two forms, int and ptr, for the two 'banks'):

Should &0 holds a null pointer? Or maybe some platforms won't let us have a null pointer? Let's see:

so... i guess this doesn't decide it? Sounds like on most systems, there is a null pointer so we are good, but there isn't always, but otoh, even on systems where there isn't, we could always allocate one byte somewhere and then declare that to be OUR null pointer.

Regarding which instructions take forced immediate operands (which is important/special because then the 3 addressing mode bits are part of that immediate), we could just start by saying that only that weird bank of 16 opcodes in the 'immediate' mode do this (or, do we want to use 'immediate' mode as a CALL3 to nearby PC-relative locations? i don't think that's very useful...).

---

i dunno, now that i look at all that, i think it does make it more complex after all:

soo... let's just have the encoding so that we can support the addr modes later, but still separate 'Boot' and 'BootX?'.BootX? is still the better name so that programmers won't be intimidated and think they have to do all the addr mode stuff before they can say they implemented the 'real' Boot.

Also, on that note, since we are focusing on simplicity and not code density now, maybe choose between register and stack machine? Maybe choose register machine and just say 'don't use registers $1 or $2 or &1 or &2, they are reserved for BootX?', and then provide special instructions to push and pop from the stacks? No, actually this is an issue, because now compiling BootX? to Boot means the runtime needs to reserve some registers to temporarily store stuff that should be on the stack in BootX?. If we support the stacks, then the whole point is that this sort of thing only requires us to increase the stack capacity instead, which is invisible to the instruction encoding.

i guess another thing i am still thinking about is: if we don't allow stack mode (stack as registers) in Boot, then ppl might implement the stack in a way that does not allow register access to it (if there are no other places available to temporarily store 16 stack locs if we have to pop/push to get at the deeper items. I guess we can solve that by requiring support for Boot instructions that do that. -- ok i added lsi lsp ssi ssp to remind me to do that.

also we should specify that Boot has a few more than 16 locs on each stack, so that the BootX? implementation has somewhere to work.

we should also probably have a pushpc command.

---

So if we are going to have 128 opcodes, does that fit all the other data types that we need?

Let's see, here's what we have right now:

ann lh lhu lb lbu sw sh sb sp add addi sub mul sll sru srs and or xor ap bne blt jd cas fence appi apwi reserved_r_27 reserved_r_28 reserved_r_29 reserved_r_30 rtwo

sysinfo cp cpp lsi lsp ssi ssp rone

halt lkp rzero

break bad1

  1. i5 lw lp beq beqp
  2. i8 li sai
  3. i11 jk jpc

total so far: 49

we gotta add long, float, double variants of stuff, and conversions, and division:

ll ld lf sl sd sf addl addd addf subl subd subf mull muld mulf slll srul srsl andl orl xorl beql bltl beqf bltf beqd bltd (note, these float/double branches use f.p. total ordering, not floating-point comparison) lsl lsf lsd ssl ssf ssd apl l2i i2l f2i i2f d2l l2d f2d d2f i2lu l2iu f2iu i2fu (we could also convert directly between i and d, which would be 4 more) divf divd divi divl remi reml cmplt fcmpeq fcmplt fneg fsqrt, fcopysign, fmin, fmax fabs fceil floor ftrunc fnearest fpow fisfinite fisinf fisnan

that's 69 right there.

49 + 69 = 118

we also wanted: pushpc

so that's 119.

also we actually will get a few more opcodes than this because we haven't done our MORE/rtwo trick yet. So we have a few more spots for read, write, poll, etc.

so we can make it!

---

how many extra stack locs should we allocate for Boot to ensure that BootX? can compile to Boot and have 16 stack locs left for BootX? user programs?

we need at least 1 for doing addr mode computations (i think we only need one, assuming we have SWAP available for each stack; so i just added swapsw and swapsp).

i don't think we need 2 for CALL3; the input args go onto the ordinary stacks and take up space

so i think we really only need 1. Of course we may want to add some instructions later that need some, too. Such as software floating point. That sounds like a beast, so who knows how many it needs. Also, doubles will take up two slots on the int stack.

so let's say 6. 3 sounds universal (i tried to lookup how many you actually need for universality but i couldn't immediately find it). 6 is 3 x doubles. So that's 22 stack locs. So on a 64-bit system, we're wasting 6*4 + 6*8 = 72 bytes per VM state to allow for implementation temporaries. Not cheap, but what can ya do. We can reduce it later if we find we don't need it.

---

if the 16-bit encoding does become for efficiency only, maybe it would be time to more seriously consider things like RISC-V's idea with their compressed encoding that you should prioritize using the first few registers, but also duplicating operands (op0 and op2 become identical? but mb not because op2 is often the immediate. So mb op0 and op1). So far with the 16-bit encoding we have 3-address-code (3 operands) and 8 regs throughout, mb it would be good to have encoding format options for this (spend another bit?). If it's just for compression, we won't need all the opcodes anyways.

---

i guess an obvious idea is to get rid of some of the large immediates, to make room for a subencoding in which you have 4 bits per operand, but only 2 operands instead of 3:

1 length encoding bit, 1 R-format identifier bit, then:

note that this would eat up 4 more R format opcodes (because the I format opcode is now only 2 bits, not 3)

also an interesting thing is that, with 4 operand bits, now the 'rtwo' instruction can access 16 instructions instead of 8! so we have more instructions to work with.

---

maybe we should make Boot that revised 16-bit format?

it's not THAT much more complex than dealing with the 32-bit format, and it's half the size.

otoh if we really want the 16-bit format to be a COMPRESSED encoding, we might want to use a different set of instructions later to compensate for the lack of addr modes (e.g. some instructions with extra immediate fields).

otoh that would make decoding and analysis more complex b/c now the opcodes wouldnt match between the 16-bit and 32-bit format.

otoh they don't quite match anyways because of the I format opcodes? i guess we could just put those at the end, though.

---

yeah, y'know, if we made Boot the 32-bit encoding there are actually a few probs that make it just as complex:

so actually, just stick with the 16-bit encoding. Yay, i don't have to throw out that work.

---

so what would the UNPACKED encoding look like? By which i mean, separate conceptual fields are separate, each field is at least 8-bits and is a power-of-two data type, and each field is large enough to hold its largest possible conceptual value? That is, like the C struct that a non-peephole, not-too-fancy code optimizer would probably have.

If we relax the 'large enough to hold its largest possible conceptual value' requirement then we fit nicely into 64-bits:

So, 8 bytes if we don't include the encoding length format bits. Of course, the optimizer's struct would probably add its own metadata, like at what position in the instruction stream is this (or possibly just a (probably 64-bit) pointer to that position).

But with the 'large enough to hold its largest possible conceptual value' requirement, operand 0 could hold a 21-bit value, operand 1 could hold a 14-bit value, and operand 2 could hold 7-bit value, so:

So now we're at 12 bytes (without encoding bits).

---

so what would OVM look like?

the first key choice is: is this really a traditional VM (with linear bytecode), or some kind of AST with treewalking? (or something else)

the second key choice if this has linear bytecode is: do we want to support e.g. 64k local variables at this point, or do we still have more draconian 'limits', e.g. limit the number of variables to 2^12 (4k).

I think that whatever limits we have for the OVM should be the same in Oot itself, because the OVM implementation is supposed to take care of all of the annoying 'administrative stuff', and register allocation fits within that. That is to say, the layers above OVM (like the translation from Oot Core to OVM) shouldn't have to do any register allocation.

The argument for 64k limits is that, while 4k seems like a lot, if you think about it there are a few times when it is exceeded (such as # of symbols in a module). And we imagine that intermediate representations used by an optimizer might want to exceed it even more. Also sometimes machine-generated code will exceed this, which will annoy Oot users. Also, with 64k limits the 'unpacked' representation can be almost the same as the 'packed' representation.

The argument for 4k limits is that it makes OVM instructions 16 bits shorter.

if we have 64k limits then we still have to decide on the limits for immediate formats that squash multiple operands together; should the maximum immediate operand by 32-bits or 64-bits?

if we have 64k limits with a 64-bit maximum immediate operand then we're looking at:

total: 20 or 21 bytes (w/o vs w encoding length format) note: the opcode and operands themselves are exactly 16 bytes here

if we have 64k limits with a 32-bit maximum immediate we have:

total: 16 or 17 bytes note: the opcode and operands themselves are exactly 12 bytes here

so the maximum 32-bit immediate is pretty attractive in this case, as it allows us to fit the unpacked format in exactly 2 64-bit aligned fields. I'm leaning towards that.

This means that Oot code that wants to give a 64-bit immediate operand will be translated into 2 OVM instructions, and also that (most of) OVM's optimizations won't be able to 'see into' that.

Does that mean we can't have floating-point constant folding for f64? No, we can still have instructions representing floating-point constants of f64 with 32-bit immediate operands; it just means that if a floating-point constant is greater than what can fit in an f32, then constant folding will stop at that point. For the few people for whom this is important, fancy optimizers can use a different representation, but we want to choose our formats based on making unfancy optimization better.

note that there is no space for encoding length format bits here. So this isn't a BootX? LARGE format, this is a totally different representation.

So the candidate OVM instruction representation is:

---

having duplicate-operand format in the 16-bit encoding brings up trouble for 'CP' if we want to copy in R format. Those are very important, so might have to burn a 3-operand opcode here? Alternately, could get rid of the rule that the opcodes are the same for R- and O-format. Same for:

so, i guess we should actually just have two sets of opcodes. oh well.

so we need to find space for 6 more instructions. what requires 3 operands? cas. beq beqp ble require 3 operands but we'd like to replace those with skips. can also get rid of fence ann b/c we only need one of them. can get rid of bne b/c you don't need both bne and beq. So getting rid of 4 (cas bne fence ann) and adding 6 (and changing 3 to skips: beq beqp ble). So we need to find two more to get rid of.

our i8s can't apply to the last 16 registers, so we're already without li. If you add that, you either have to take away one of the i8s or i5s (e.g. jk or sai), or you have to take away a whole bunch of instructions.

ugh more complexity.. won't this make this platform hard to target? not necessarily; could just tell the compiler that only 8 GPRs are available, and also 8 other 'local variables'. Or even just tell the compiler only that only 8 registers are available.

this suggests that we get rid of computational operations for our other two. e.g. mul, shifts, logical. I think mul and shifts and logical are the least used operations. I think amongst those, mul is the least, xor is the most, shl is sometimes used a lot. So srl sra and or are the least. I think srl and and are used more than sra and or. Sub is pretty unpopular too, and we already have signed ADD...

another route is to say we don't really need those immediate variants like addi appi apwi, so get rid of those. But those are useful for iterating, and a 'storage-only register' sounds like it could make a good 'index register' because you'd park pointers there (for similar reasons i like turning the branches to skips rather than getting rid of them). Could also get rid of some of the smaller bitwidth load/store variants.

in dynamic frequency counts the arithmetic/shifts/logical are less than those things tho. So let's get rid of two of these three: sra or sub. Choose sub b/c it's inessential (this sub can only mutate the source operand, remember; also it can't subtract from pointers). Mb choose OR.

i dunno man. we're basically saying that a code generator from BootX? to Boot has to replace some instructions with sequences of instructions if they use the wrong register. So beq $10 $11 #3 becomes cpy $s 10, cpy $s 11, beq $s $s #3. sub $10 $11 $12 becomes cpy $s $11, cpy $s $12, sub $s $s $s, cpy $10 $s. The skips are optimizations that a code generator dosen't have to use. I guess we're having to replace single instructions with sequences anyways in the case of addr modes. So i guess this is fine.

Which reminds me; each instruction has 4 addr modes, not 1, so might we need more than 6 stack temporary spaces? Let's see; the worst one is probably CALL3 with all addr modes. To make the call at the end, we'll need two slots for the arguments, plus one for the address; in addition we need temporaries for calculation. But we can do these cals in sequence. So 6 is probably enough; 3 for call address and two arguments to the call; and then 3 more for intermediate calculation of the operands.

---

a more serious concern; the R-format variants of instructions can't be handled by just a uniform wrapper function; because sometimes when there is an immediate (as in the loads, where there is an immediate offset) the immediate should be set to zero, whereas other times (such as in ADDI), the immediate is still present and the other two arguments are unified.

so now this is getting more complicated than just a variant of the 32-bit encoding.

so i guess we should just provide the cp cpp lsi ssi lsp ssp in R-format, and leave the rest out of Boot. Most of the R-format becomes just a BootX? optimization.

Y'know, that actually makes it much simpler to implement in x86_64 assembly using my only-lowest-8-regs-in-registers idea.

---

one remaining issue is that there is nothing to force the Boot implementor to make room for 64-bit registers anywhere, so they might still have trouble moving up to BootX? in that sense. Should we make them make room, so that they don't have to add state when moving to BootX?, or do we not care (b/c the 64-bit registers are conceptually sort of separate anyhow), or do we NOT want to make them mix, b/c we don't want to make ppl deal with telling foreign backend optimizers such as LLVM about how when you use int64 register X you also occupy int32 register Y?

one option is to take the middle ground: pretend that the registers become 64-bit and can hold floating-point. Advantages:

the disadvantages are:

i think it's worth it. Seems like the simplest thing.

one other issue though: if a platform does use separate FP registers (as i expect that most will), then an integer write to a register currently holding an fp value won't actually erase it, which may present security concerns. We'd have to specify that, but even so, compilers probably won't bother (and this complicates the semantics).

but, the same issue applies to implementations that have actual separate 64-bit registers.

also, even if in theory we have less state with shared Boot registers, on platforms with separate fp register banks this actually makes implementation more complicated, because the implementation has to remember whether each register currently contains an FP value or not, and check that each time the register is read from.

hmm...

i dunno man, let's have a separate fp register bank then. That's what most CPUs do anyways. We can always implement it as a register bank in memory on 32-bit processors. It makes implementation simpler, even if it does increase state.

now what about 64-bit registers?

i guess most implementations will not have separate banks of them. So we'll keep them in the registers like we said before.

---

another issue is jump ranges. code expands as we compile from BootX? to Boot, and it's possible that something that starts out in jump range could end up outside of it. I haven't computed the maximal amount of expansion, so we don't have a simple way of being sure that our BootX? jump ranges are compilable.

one thing to note: if Boot is 16-bit and BootX? is 32-bit, then because the conditional branch max offset is larger in BootX? than in Boot, sometimes a conditional branch directly to code will have to be replaced by a branch to a jump to the code, which inflates code a little bit (at least that jump, plus the jump for the code that used to be just before the code being moved). If they are both 32-bit then no problem.

i think this argues for a 32-bit Boot, to make compilation simpler; although the problem really isn't solved either way, because we still have to deal with the fact that each instruction could be expanded into many instructions, so either we have to either greatly restrict jump ranges in BootX?, or we have to deal with this in the compiler anyways.

And compiling from BootX? to Boot is supposed to be a program written once by us, not written every time by every porter.

---

in my head i've been thinking about an amd64 (x86_64) implementation and i should set it down in writing:

the lowest 8 registers of each register bank are stored in the 16 x86_64 registers. The upper registers, and the stacks, are stored in memory. The special registers are stored as follows:

$0: doesn't need to be stored. The corresponding platform register is available for compiler/Boot implementation use. $s: stores the current intstack depth (as an integer) $t: stores the intstack TOS

&0: holds a reference to the global page &s: stores the current ptrstack depth (as an integer) &t: stores the ptrstack TOS

Pointers to the stacks themselves, and pointers to the upper register banks and floating-point register bank, don't need to be stored, because these stacks and register banks are found at fixed offsets from the global page pointer &0. This is possible because all of the global page, the stacks, and the register banks are of fixed sizes. The stacks are ring buffers (circular buffers) of size 15 (each TOS is stored only in its TOS register).

If the compiler/implementation needs 2 more registers for its own use, it can choose to not bother caching the stack TOSs.

---

so currently still aiming at a 16-bit Boot and a 32-bit BootX?. Why not a 32-bit Boot?

---

i guess if we said that any BootX? instruction could be compiled to 8 Boot instructions, then we could reduce the jump range for BootX? by 3 bits (from signed 11 to 8). But i think it might take more than 8 in some cases; if some of the addressing modes take 2 instructions to load, and we're using CALL3, then that's 4 things that need 2 instructions, and then we have the actual instruction at the end, which would be 9 total. Otoh if all of the addressing modes take only 1 instruction to load then we can do it under 8. So let's say 3 or 4 bits.

So if we limit jumps to 11 - 4 = 7 signed = 6 unsigned, that's a jump range of only +-64. Seems like we'd definitely need to use longjump (jk) functionality for even small programs.

So maybe instead of having a jump range restriction we should mandate long jump functionality, and let the BootX? compiler decide what is a long jump and what is a short jump?

Which suggests that we specify how the long jumps are encoded in the binary. I think the simplest thing is to put a jump table close to the beginning of the program. How about this: we leave one instruction at the beginning for a 'jump' instruction to the real code, then there's a jump table. Actually, better; the first entry in the jump table specifies the actual entry point; the VM starts execution at that code location without having to dispatch an initial jump.

actually, lets make it the second entry in the table specifies the entry point; that way the constant pointer table is the same as the jump table and the first entry in the constant pointer table can be a pointer to the global page.

---

we don't need duplicates of one- and two- operand instructions over both the P-type and R-type Boot 16-bit instructions; e.g. we only need 'cp' in one of them

---

(later: copied to ootOvmNotes1)

so i think some stuff that OVM should implement is:

todo should also read more about the OOP parts of JVM and CIL. todo should also read about Cmm/C-- runtime support todo update Oot todos

as you can see, the purpose of OVM is coming into focus:

---

Three kinds of I/o: shared memory (even if writeonly or read-only on each side), msgs, streams

---

some old notes:

I guess vanilla boot should have the two banks of eight registers, so that we can easily compile to platform assembly on for example x86_64

We need a 'library' opeco that rather than having a fixed set of functions, calls a given function pointer using the platform ABI

---

why not use the following existing platforms:

RISC-V looks pretty simple. The encoding is slightly more complicated than we would like, but that's okay. There's only a few issues with it:

---

zig has facilities for making dealing with errors easier, without using exceptions.

zig has facilities to replace C's preprocessor with something that compiles faster etc

---

yknow although it's good to finish the 16-bit Boot format for learning (to make sure everything can fit), ultimately i think even the little bit of encode/decode complexity in there is too much for it to be 'dead simple'. We should ultimately change to an 'mostly unpacked' 32-bit format:

16-bit operands (i11 format) will still have to be represented using more than one of the op fields.

Note that there is no format length encoding bit(s) so this format is not binary compatible with BootX? at all.

Although i guess we could smuggle some format length encoding bits into the opcode field? Not sure what the point is though, as i can't see this ever getting mixed with the BootX? formats in the same bytestream (in the sense that one instruction is 32-bit BootX?, the next instruction is this format) for any reason.

Note that the 16-bit limits will remain (only 8 registers except for cp and l/ss* cmds, immediate range of 8, etc)

PC offsets should be unchanged from the 16-bit encoding, which means that PC offsets should be multiplied by 2 (if interpreted in terms of bytes, b/c now instrs are 4 bytes instead of 2), or divided by 2 (if interpreted in terms of instructions)

the 2 I11-type instructions should still be treated differently. Should they be the first two instructions? Probably, yes. Should the I8s be next, for good measure?

should we reserve ordered portions of the opcode space for what will become I-type (4 opcodes) and P-type (32+8+8+8 = 56) and R-type (32 + 16 + 16 = 64)? So we could have I-type, plus 4 reserved instructions, plus P-type (64 so far), plus R-type (128 so far)

also, should the branch instructions all go in a certain part of opcode space (so that you can deal with their special immediate offset coding all at the same time by checking for a certain opcode range)?

---

here's a reason not to make Boot 32-bits:

What if code dynamically computes a code pointer by adding an offset to the PC, then using the JY instruction to jump to that address?

In Boot we could just have an INSTRUCTION_SIZE in SYSINFO. But in BootX? the code will rely upon the relative sizes of 8-bit, 16-bit, and 32-bit instructions.

If we say that dynamic computation of code pointers is not allowed, that makes it impossible to do PC-relative jumps to farther away than the provided jump immediate instructions.

... on the other hand, it would be hard to compile BootX? code that computes such dynamic PC-relative jumps into Boot code anyways, because the compiler needs to know the jump target as a label since it will be changing the distances between various pieces of code. And if the compiler can label the jump target, then it can recompute the offsets correctly.

Btw this provides a strong argument for a tablejump instruction, because such an instruction is a form of indirect jump that the compiler can easily recompile to different instruction sizes; we probably want to rule out general unlabeled computed indirect jumps for this reason, i guess i see why now.

also i guess i had forgotten, but it's hard to compile unlabeled computed indirect jumps to platforms that require labels. So mb we should forbid that sort of thing anyhow? No, actually, this is a difficult topic that touches on the heartof the reason for Boot in the first place: to escape the strictures of control flow in various high-level platforms:

These platforms which require possible jump targets to be labeled still make use of unlabeled indirect jumping in the form of a call stack. But we are at a lower level than a call-stack here. In order to be able to implement a call stack, we need unrestricted indirect jumps. Which means that there can't be a direct compilation from Boot to many of those platforms; Boot can only be interpreted on those platforms (although OVM can probably be compiled directly to them).

Now, we don't necessarily need indirect jumps to a code address that has been manipulated by pointer arithmetic. So the issue of instruction sizes is somewhat separate. But we can't just get rid of the JY instruction completely; the most we could do is specify that pointer arithmetic on code pointers is invalid (thereby ruling out use of jump tables).

So i guess my conclusions are:

---

a defining property of Boot is turning out to be the bifurcation between integers and pointers. I think i should rename the language Duo, or Biplane, or Gemini.

or Bicycle or t42 or bivalve

i guess bivalve is best because it's a Bi VM; bivm, bivasm

---

Mb 4 kinds of ANNotations: ascii comments, source maps and identifiers and labels, types and stack maps and other standardized dedications, implementation dependent and reserved for run time

---

yknow, for transpilation an AST-based OVM would be best. But for easy interpreter-with-interop (interop with the platforms's data structures and GC), a linear bytecode-based OVM would be best. So perhaps we need to expand the OVM layer into two layers.

---

yknow, having written Boot, i'm thinking of just merging Boot and BootX?. I'm guessing that implementing BootX? will only take about 2x as much labor as implementing Boot. So it may not be worth the ecosystem complexity (and i'm feeling like it will be much easier to implement BootX? than to implement that BootX?->Boot compiler!). Also most of the additional opcodes are stuff which are easy to implement on platforms that support them at all, such as floating-point and 64-bit. And honestly, i'm not going to implement soft-fp and soft-64-bit anytime soon anyhow.

otoh, once we get into using the MORE trick on the 32-bit representation, we can fit a bunch more instructions... which suggests adding more complicated things like arrays and structs and maybe strings and dicts, in which case we do probably need a separation between Boot and BootX?.

otoh, while that's a noble goal, another goal is just to finish the project soon, and it would be quicker to just start working on a full BootX? implementation at this point, and stop worrying about Boot for now (since BootX?, and even OVM, can and will be directly implemented on an underlying platform, for the most part). Perhaps we could come back to Boot later, rather than evolve them simultaneously.

i should probably get Boot to a stopping place first -- test it, document it

---

yknow, if indirect jump targets are indeed dynamically generated, with no labels, it's going to be very hard to compile BootX? to Boot perfectly in all cases. Either we should unify Boot and BootX?, or BootX? should require all jump targets to be labeled.

i guess the fact that this difficulty even exists for BootX?->Boot argues strongly for requiring labels in BootX?; as i've said before, it's going to be hard to compile to any non-processor-ISA target without required labels.

i guess this also means that Boot CAN'T be easily compiled to x86, because we don't know how to map computed indirect jumps.

should look at the 4 strategies i enumerated in plChAssemblyLanguagePrincipals for dealing with this:

---

so in OVM i'm thinking about making the GC/refcounting implicit. Is that what Python does with its stack-based bytecode?

Yes:

" >>> def test(): foo = 5

>>> import dis >>> dis.dis(test) 2 0 LOAD_CONST 1 (5) 3 STORE_FAST 0 (foo) 6 LOAD_CONST 0 (None) 9 RETURN_VALUE ... Note though, that this does not increase the reference count of that value. That’s what happens with LOAD_CONST, or any other byte code instruction that gets a value from somewhere:

TARGET(LOAD_CONST) { x = GETITEM(consts, oparg); Py_INCREF(x); PUSH(x); FAST_DISPATCH(); } "

[4]

---

Guile (Scheme)'s VM appears to place the procedure being called in the first argument slot on the stack when calling any procedure (presumably for efficient recursion?) [5] [6]

interesting idea.

---

so i guess the minimum state per thread is around:

16*5 = 80 words

+ a few words for state not in the registers, e.g. the PC

Sounds like a lot. What does Erlang do? Oh, actually they have even more: 309 words[7] (although it's not really comparable b/c Erlang processes also get their own heap area)

So actually we might still be sufficiently lean.

---

frame pointer in calling convention?

---

so i guess the OVM instruction format is looking like:

3 x (16-bit operands) + 1 x (32-bit operand (larger operand for 32-bit immediates)) + 4 x (8-bit addr modes) = 14 bytes?

let's add another operand (so 1 opcode and 4 operands) so that we can do e.g. CAS easily. MB this can double as a generic polymorphism type operand sometimes. So:

4 x (16-bit operands) + 1 x (32-bit operand) + 4 x (8-bit addr modes) = 16 bytes

we're short one addr mode... that's probably ok tho, either the 4th operand (which is rarely used) or the opcode can do without an addr mode (e.g. we used to use the opcode addr mode to do CALL3, but now we could just have a CALL3 opcode and do that with the 4th operand). Probably the opcode should do without.

---

the best way i can think of to make Boot compilable is to start defining annotations.

Specify that each JI instruction must be immediate followed by an annotation. The choices for this annotation are:

i dunno man this is getting pretty complicated for little old Boot. All the compiler really needs to know is the complete set of locations that might be indirectly jumped to anywhere in the code without acquiring the target address at runtime (that is, it needs all the labels), then it can replace each JI with a much slower command that does a lookup into that table.

even if we just do that though, should we require ANN instructions in the program code next to the jump targets (slows the program down), or just provide the compiler with a table of all of them at the beginning (more complicated file encoding)? And shouldn't we have a jump table instruction?

Let's look at the form of the only indirect jump instruction, JY:

  jy &targetbase #imm6: dYnamic (indirect) jump

This instruction is pretty wacky/powerful. It can index into a dynamically chosen jump table, but the index into the table is static/immediate. So this does not replace a jump table instruction, which i define as a dynamic index into a static/immediate table.

Note that the idea of indexing into a dynamically-provided jump table would require informing the compiler that each of the entries in the table are a table (unless the table is dynamically provided by foreign code, in which case our Boot program had better understand the platform conventions for code size). However, since the table itself is dynamically proveded, this sort of thing alone would not save us from having to do a slow lookup at runtime.

A little unrelated, but note also that if a program WRITES BOOT CODE AT RUNTIME INTO MEMORY and then JUMPS TO THE CODE IT WROTE, we can't do that at all without an interpreter.

i think maybe what we need to do is introduce the concept of a list of source addresses that might be jumped to (a jump table); there can be multiple such lists in the program; the list of global labels is one such list; but there can be others too. Each such list is in the file somewhere (near or in the constant table i guess). Each JY must be annotated with which of these tables it uses, or that it uses a dynamically provided address (from PUSHPC or the like). Even if lookup is required, then, this ensures that it's kinda quick. But then we may as well go the extra mile and have a jump table instruction. Also, may as well allow for the possibility of a totally unconstrained JY (for debuggers and the like).

ok here's what i think we need:

ok i did that

---

note that the higher-level VMs avoid pushing the complexity of calling conventions onto the user by simply providing CALL instructions that provide each called function with a fresh set of registers (and presumably implementing calling conventions on the underlying platform within the implementation of the CALL instruction).

---

still going back and forth about the xcall calling convention. Should it use the registers or smallstack or both?

arguments for using registers:

arguments for using smallstack:

i'm leaning towards registers, just because with smallstack we wouldn't want to pass too many arguments before resorting to the variadic 'rest' case..

---

there's a problem with the variadic 'rest' case though. Who deallocates that memory? If we want to be completely agnostic about whether that memory is on a stack or is malloc'd, then the caller must deallocate, because the callee doesn't know how to deallocate (pop a stack pointer, or mfree?). However, this is bad for tail calls! For tail calls, we want callee cleanup, because control will never return to the caller.

i guess you could still do tail calls this way but with a 'wrapper' function around the tail-calling function that sets up and tears-down its memory.

Perhaps a simpler solution is just to say that these registers are callee-saved, as usual for argument-passing registers, and if the caller needs to deallocate, it must duplicate those pointers in other caller-saved registers before the call.

---

and for xcall/xpostcall, should there be any caller-saved registers, or should we guarantee callee-saving of everything? Caller-saving is a waste in many implementations because we'll just store the whole register file and external code won't touch it. Otoh if we just use the same convention which is suited for actual Boot code, we retain the option of doing xcalls as normal calls into other Boot code if we need to for some reason. I guess that's worth something.

---

ok i think i decided. Register argument passing calling convention. CallER?-cleanup for the extra registers, because we are agnostic about call stacks at this level. I updated boot_reference accordingly.

---

something still to be decided: what to put in the R- instruction slots, which can access all 16 registers (as opposed to the P-instruction slots, which can only access 8).

I ran out of room in P-1 so i moved HALT to R-1 (since HALT is 'syscally')

also, now that we have room again, should put back those other syscalls, eg POLL

---

an alternative would be to provide alternatives to xcall:

or maybe just at least the 0, 1, and 2-argument smallstack variants: xcalls_i_ii, xcalls_p_ii, xcalls_i_ip, xcalls_i_pp, etc. There are 8 such 2-argument variants, 2 such 0-argument variants (based on return value type), and 4 such 1-argument variants, for 14 such instructions total.

this has the advantage of allowing the program to xcall without saving any registers, and also of providing Boot primitives for each of the 8 CALL3 types/cases in BootX?.

we could also have another calling convention to mimic what happens what a Boot program itself is run; all registers except r3 are caller-saved (that is, available for the Boot "callee"), and an integer value from any register is returned.

ok, i did that.

---

thinking about my new r12 calling convention, in which r3 is reserved for a stack pointer or somesuch (so that the caller has SOMEwhere to store their data), makes me think maybe we should just specify r3 as a memory stack pointer after all.

A stack pointer is useless if it might have remaining capacity 0, so we could also specify that it has capacity at least 16 of the larger of 32-bit integers or pointers.

And that makes me realize that my planned displacement r3 mode in BootX? is pretty useless because an immediate displacement would be in what units (since we don't know the size of either 32-bit integers or pointers)?

I think the only sensible way to do that would be to go back to a split-mode, where, out of the 4 operand bits, 2 of them are for a count 'x' of 32-bit integers (so between 0 and 3), and two of them are for a count 'y' of pointers (so bettween 0 and 3), and then the displacement is = x*INT_SIZE + y*PTR_SIZE. So now our displacement mode can only always address up to 3 items into the memstack (and can sometimes go up to 6), which isn't so hot, and is much less useful than 16 items.

might be more useful to have an index address mode that specifies an integer register with the operand, and then adds the signed contents of that register to the contents of ptr-r3 to get the effective address.

heh heh.. or mb a whole suite of 16 special addr modes that do something split-register-y and fix both registers: &3+$3 &3+$3*INT_SZ &3+$3*PTR_SZ &3+$3*INT_SZ+$4*PTR_SZ &3+$4 &4+$3 &3+INT_SZ &3+PTR_SZ (*&3)+$3*PTR_SZ &3+(*&4) &3+$T &T+$T &T+$3 ...etc... i guess what we want here is combinations of &S, $S, &T, $T, $3, &3, +-INT_SZ, +-PTR_SZ (forget the r4s). hmm... here's 15 such combinations:

&S + $S &S + $T &S + $3 &S + INT_SZ &S + PTR_SZ &T + $S &T + $T &T + $3 &T + INT_SZ &T + PTR_SZ &3 + $S &3 + $T &3 + $3 &3 + INT_SZ &3 + PTR_SZ

would be nice to get the -INT_SZ, -PTR_SZs too but we only have one spot left.. actually the INT_SZ, PTR_SZs aren't as important as they seem because we have constant INT_SZ, PTR_SZ (and their negation) in BootX? as immediate constants, so we don't need to do this to do increments and decrements. Without those guys we would have 9 items here (7 spots left to get to 16). How about:

&S + $S &S + $T &S + $3 &S + INT_SZ &S + PTR_SZ &S - INT_SZ &S - PTR_SZ &T + $S &T + $T &T + $3 &T + INT_SZ &T + PTR_SZ &T - INT_SZ &T - PTR_SZ &3 + $S &3 + $T

no, i like having &3 + $3. How about:

&S + $S &S + $T &S + $3 &S + INT_SZ &S + PTR_SZ &S - INT_SZ &S - PTR_SZ &T + $S &T + $T &T + $3 &T + INT_SZ &T + PTR_SZ &T - PTR_SZ &3 + $S &3 + $T &3 + $3

y'know, i think this sort of thing is a good idea, but all the possibilities here just sort of demonstrate that we should just have one RESERVED addressing mode, and then decide what to do with it later via profiling real code.

---

another thing to remember: in OVM we want to be able to run code sandboxed. So yeah we do need capabilities or at least ambient authority, and no, we can't turn off boundary-checking unless you have the capabilities to do so.

---

not sure if the r12 calling convention is really needed but i think it's nice to have something symmetric with the Boot program's 'calling convention' itself. I added text saying that Boot programs should restore &3 to its original content upon exit.

---

also, yknow, its easier to type 'i3' and 'p3' than $i and &3. So maybe call the registers that? although it's easier to read instruction specifications like

lw $dest &addr

than

lw idest paddr

---

if we are doing memory-management within the OVM implementation, are we doing capabilities in there? If so, then how does E do it?

Based on the tiny 'mint' example in [8], it looks like:

Wikipedia says that Pony was inspired by E, btw. Pony has reference capabilities to enforce Milewski-like access limits on variables for the purpose of safe concurrency (e.g. "Box. This is for references to data that is read-only to you").

Another cool thing that E does is it has both 'call' and 'eventual send' (distributed (message-based?) call, which may get lost and never execute and returns a promise). E talks about how its eventual send and capabilities allow for distributed computation and 'mobile code'. And I recall that AliceVM? and SEAM also talked about 'mobile code'.

i guess if we are putting bounds checking and capabilities in OVM, it's appropriate to have some sort of an object model in there too. So OVM would have bounds-checking, memory-management, capabilities (including seal/unseal, and Pony-like reference capabilities), object model, guaranteed privacy within the object model, promises, eventual send/distributed message-passing, channels, serialization, mb laziness?. This is getting pretty intense, are we sure we don't want to leave some or all of this stuff for Oot Core?

right now, i'm thinking no, this really is all OVM stuff. The reason is that it:

Are we just making Oot Core here? No, because the OVM level has to be very un-dynamic (remember the lesson from PyPy?).

i think it's time to create a new series of notes files for OVM. Done.