---

190529

we need to take a hard look at if Boot is meeting its goals: i'm not sure that it's any easier to implement than RISC-V RV32I.

it provides other benefits:

the only thing that depends on native pointer size is pointer size
it's 32-bit
pointers are opaque
it will incorporate opcodes for all the syscall primitives necessary for bootstrapping and core interop
because of the secondary registers and smallstack, you can probably write a bunch of core algorithms in it directly
it's a 16-bit encoding instead of 32-bits
it has that cool Smallstack, which probably increases code density
it's a stepping stone to BootX?, which is pretty expressive

it's better than WASM (no weird structured control flow), simpler than JVM, much simpler than CIL, simpler and lower level than Lua, and simpler than ARM (esp. in instruction encoding). But it may be more complicated than RISC-V. And RISC-V is more popular, so even if it were equal complexity, RISC-V should win.

i used to be scared of RISC-V's instruction encoding variants, but now we have a bunch of those too! And ours are (right now, at least) even less regular than theirs, and also theirs are more thought out and have a bunch of properties that they claim are important for simple, efficient hardware decoding:

their register operands are always in the same place
the high bit (sign bit) of their immediate operand is always in the MSb bit (bit 31)
also, they didn't mention this, but i notice that their rd (destination register) is nearest to the opcode, whereas we have it in the other order -- not sure if that's important though

Yes, their 32-bit and 64-bit platforms are different not only in pointer size, which makes it harder for us to write code targeting both. But we already have an unknown pointer size in Boot, is it really that much trouble to have distinct 32-bit and 64-bit targets? And it's probably not so much trouble to make it worth it to have to start from scratch toolchain-wise with a new assembly language.

Yes, for our purposes, it's slightly simpler to include the necessary syscalls as opcodes. But again, not worth the complexity of introducting a new assembly language.

the only thing that i think might really be important is pointer opaqueness. I'm imagining interop situations where we are running inside a high-level platform like Python, and we want to interoperate with its data structures. In this context we don't have pointers, we have opaque references. We can store them, but not in integers. We can't do pointer arithmetic on them even if we want to. I'm imagining that we will be accessing struct fields (e.g. objects) by assigning a numerical offset for each field and then adding that offset to the base 'reference's, which is why i include the operation of adding an int to a pointer. (now that i say this, it strikes me that even blt-ptr (bltp) is too much, ok i just took that out).

in older versions of Boot we had systems that made the pointer size totally opaque; the program didn't even have to know it for memory layout b/c memory was accessed in terms of pointer-sized words. Should we go back to that?

In any case, we should at least work on our instruction encoding to make it more regular, more like RISC-V.

i think the takeaways are:

yes, it may still be worth it to do Boot, because pointer opaqueness is a big deal
i took out blt-ptr just now
need to improve instruction encoding. Probably should split up the opcodes, like RISC-V does, and also put high bits of immediates at the end, they seem to think that's important.
consider going back to memory operations that operate in terms of words (slots) of the same size as the ptr. we can still have quantities restricted to 32 bits... this would allow us to have programs that are completely architecture-independent, which is a big win

---

ok so thinking a little more, we need to go back a little to the more complex system where memory was a linear sequence of locations of opaque type and size. The reason is that the whole reason that we want totally opaque pointers is in order to use Boot code in situations like when we are running on top of Python; memory we allocate is Python arrays and we access Python objects by assigning a numbering to their fields and then treating the object reference as a pointer and adding integers to that pointer to specify the fields. The ability to do this is why Boot is more useful for interop than RISC-V.

so some considerations:

we need to bring back the OTHER type that i took out. Namely, Boot memory locations and registers (which register bank? they might be big, like pointers, but they can't be dereferenced) can hold opaque values of unknown type (these represent objects in some type that the underlying platform knows about but that the Boot implementor hasn't yet made interoperable with Boot). All you can do with these guys is copy them around.
we may need to go back to making int registers of an unknown bitwidth which is >= 32 bits. The semantics of arithmetic may need to change so that wrapping at 2^32 is no longer guaranteed. This would introduce issues; now it's hard to check for overflow (you can't just see if the resulting value is too small), so we may need to go back to arithmetic with overflow checks, but then that can't be compiled to efficient code on many platforms. Alternately, we can keep int registers at 32 bits, which is what i'm leaning towards. This would imply that OTHERs definitely go in ptr regs.
we need to bring back the more complicated semantics that i just deleted from the previous version about how it's an error to try and dereference OTHER types or ints
it's not as easy as just saying that every addressable memory location can hold a pointer. Sometimes you want to navigate memory in smaller chunks than that, for example, when navigating the Boot instruction stream you want offsets in units of one byte because of the BootX? 8-bit instructions. If the pointer size is 64 bits, you don't want to spend 8 bytes to represent each byte of Boot code just because the smallest size that you can address is one pointer-word. Before i had this additional type called CODEPTR. Now i think we don't need CODEPTR, we just need to have a single general system for addressing memory.
should we rethink dual register banks? No, i still like them; (a) if we do go with 32-bit arithmetic then we definitely need one bank that does that and another bank that can hold ptrs+OTHERs (b) prevents programmer errors (c) simplifies implementation when refs are weird (you're not always special-case checking if the value of your int register is actually some other type) (d) i was going to say it makes semantics simpler, but now that we have OTHERS in one or the other of the banks too it's not so simple
i guess mainly what we need, in addition to a way to get the PTRSZ, is a fix for the issue mentioned above, when each addressable memory location can already hold an i32; you don't want to take up 4 of these for each 32-bit number in such a case.
Note that perhaps the amount that each memory location can hold might change for different (base) pointers! For example, one pointer may refer to a Boot-allocated array of bytes (that would be good for holding Boot instruction code); another pointer may refer to a Python object (each of whose fields might hold a reference). These two examples bring up another issue; some pointers might refer to memory locations that cannot hold pointers or OTHERs no matter how many adjacent memory locations you smash together

So, TODO:

need to improve instruction encoding. Probably should split up the opcodes, like RISC-V does, and also put high bits of immediates at the end, they seem to think that's important.
think about memory addressing and pointer sizes
- can the values in INT registers sometimes be >2^32? i'm thinking NO
- keep in mind the running-on-top-of-Python-memory-offsets-are-struct-fields case
- some pointers but not others may refer to byte-addressed regions of memory?
- some pointers but not others may refer to regions of memory in which ptrs cant be stored? for example if we are running on top of Python, so 'pointers' are opaque references, but we have a pointer to an array of bytes to put Boot code into...
- the issue where, if each addressable memory loc can hold 64-bits, we waste 4 x 64-bits to hold each 32-bit quantity
- do we want to bring back PEEK and POKE?
bring back OTHER, and bring back the semantics wording relating to that (about when you try to dereference an OTHER, etc)

---

so far i'm leaning towards/TODO:

having the int registers really only be 32 bits no matter what (this allows you to use typical operations to check for overflow, which means code will translate efficiently into assembly/contemporary platform primitives)
having a way to query any pointer and get: whether pointers can be stored here, how many locs are needed for each of: i8, i16, i32, ptr (and is there an assumption that any address you can get by adding to that pointer has the same properties? i think so). Should this be a compile-time or run-time operation? If compile-time it can't really vary by pointer, can it?
lb lh lw lp sb sh sw sp all read from/write to as many adjacent locations as are needed
still need to improve instruction encoding: see about having simple opcodes split into multiple places, keeping registers in the same place, having the immediate at the end instead of the middle

---

if our memory locations could literally even be references to consecutive fields in a C struct, then some of those C struct fields can hold i32s, some can't, some can hold pointers, some can't, and this changes for different types of structs. So in that case, either you can to check the size of things at every single location, or the program has to have some knowledge at compile-time.

And if you are relying on the program to have compile-time knowledge about what fits where, then is there anything left to be done? And in that case, are we really any better off than RISC-V?

Perhaps we want the Boot program to have the POSSIBILTY of doing low-level interfacing like this, but to also have a sort of 'general case' memory type that it can do internal computations in, which is more predictable?

here's a few cases to consider:

memories that can hold i32 values, possibly using more than one memory loc, but not ptrs at all
memories that can hold both i32 values and ptrs, possible with different sizes
memories whose sizes are uniform
memories whose sizes are idiosyncratic at each location (e.g. typed structs)

If we rule out the idiosyncratic memories then things probably become more tractable. This means that memory locations never refer to things like references to consecutive fields in a C struct -- you would treat such things as fields of bytes, instead. But in this case how would we interop with structs on top of a typed HLL language, like Haskell?

---

so, i think that direct interop with external data structure layout is always going to require either program knowledge, or a debilitating amount of runtime introspection. When we say that we want Boot programs to be universal, all that we can really ask for is for the existence of uniform memory for internal data structures -- so e.g. if a hash map library is written in Boot, we want to be able to re-use that same code when running on top of Haskell, on top of Python, or on top of raw assembly. Now again, we can do this at compile time (have constants for the size of ptrs, the sizeof i323s, the size of i16s), or we can do this at runtime (introspection/sysinfo, possibly additional instructions such as addsizeof to make it easier to iterate through memory, although we still need to do other stuff when we need to e.g. compute how many memory locations an array of 100 ptrs takes up).

so, can we/should we just use RISC-V for this? maybe... the semantics we are using are pretty different (pointer arithmetic is invalid) but that may not be a problem. should think further. my instinct is no, the semantics are so different that you don't gain much by using it, so just make something new, it'll confuse ppl less anyways. but that may just be Not Invented Here syndrome.

---

okay i've thought about this a little today, i think it's still worth doing, and here's why:

the other option we are considering is just using RISC-V, but adapted for our case.

this means just specifying that pointers are opaque.

so what happens when you load a pointer? what instruction do you use? We need to add a 'load pointer' instruction, and a 'store pointer' instruction.

and what happens when you add an int to a pointer? what instruction do you use? You could just use 'add'. But then the VM has to check what type is in the registers each type it encounters an 'add', in order to know whether we are adding integers, or adding one pointer and one integer. it would be better to have an add-int-ptr, like Boot.

what happens if you try to do pointer arithmetic? it fails at runtime.

So, we need at least some new instructions (load and store pointer) and could really use some others (add int to pointer). That's already probably enough to make us think about a new language. And pointer registers seem to fit really well too.

we could always do an extension to RISC-V tho...

and there are other minor reasons to go with a new language. We're going to have to redefine the RISC-V semantics a little (no pointer arithmetic), and we're going to have to add the syscall fns. It's more confusing to have an overlay like this than just to make our own new thing.

---

can we make better immediate formats? actually i don't buy RISC-V's idea of splitting the immediates to keep the sign bit in the most-significant-bit. I think it's simpler to keep them together.

---

this 'BootX? should have no runtime requirement over Boot' is getting too expensive. I put in a special 32-to-8 and 8-to-32 cpy (mov) instruction (with its own instruction format) so we could access all 32 BootX? registers from Boot. And i put in extra cpfis, cptis, cpfps, cptps (cpy from int stack, cpy to int stack, cp from pointer stack, cpy to pointer stack) instructions so that we could access deep into the stack without pushing and popping. And i kept the intstack and ptrstack at capacity 8 so that those instructions can access all of it.

But this is getting ridiculous. I'm going to take all that out and assume that BootX? will need extra state and can store it in allocated memory.

---

i used to have this:

Five formats. For each format, from LSB to MSB, the fields are:

I12-format: 0000, signed immediate operand 0 (12-bit) (jpc)
I8-format: 0001, I8-opcode (1 bit), signed immediate operand 1 (8 bits), operand 0 (3 bits). (ldi, sai)
I5-format: 001, I5-opcode (2 bits), signed immediate operand 2 (5 bits), operand 1 (3 bits), operand 0 (3 bits). (ld, ldp, beq, beqp)
R3-format: 01, R3-opcode (5 bits), operand 2 (3 bits), operand 1 (3 bits), operand 0 (3 bits).
R2-format: 0111111, R2-opcode (3 bits), operand 1 (3 bits), operand 0 (3 bits).
R1-format: 0111111111, R1-opcode (3 bits), operand 0 (3 bits).
R0-format: 0111111111111, R0-opcode (3 bits)

In graphical form (from MSB to LSB):

R3-format: xxxyyyzzzooooo10
I5-format: xxxyyyiiiiioo100
I8-format: xxxiiiiiiiio1000
I12-format: iiiiiiiiiiii0000
R2-format: xxxyyyooo1111110
R1-format: xxxooo1111111110
R0-format: ooo1111111111110

but i'm gonig to change it to consolidate i5, i8, i12 into one format with a 2-bit signed immediate and a 3-bit opcode. The instructions can then add in the other operands themselves. This is slightly less efficient (now JPC only gets an 11-bit signed immediate; also the encoding logic has to split up the immediate), but it's slightly simpler/less scary (b/c 2 less formats), and it has the property that RISC-V likes that the signed bit of the immediate is in the same place everywhere.

i saved the former one in boot_old/pyboot_190529_i5_i8_i12_encoding/encoding.py

later that day:

ok i did that but i think it's more complicated, not less.

i saved it in

../boot_old/pyboot_190530_r_i_only_encoding

Also i reread the RISC-C comments about keeping the sign bit in one place and i think they are talking about when they have branch offset immediates in units of 2; rather than just multiplying the immediate by 2 they do some complicated thing where they swap around the location of bits, but they leave the sign bit in one place, and i think that's what they were talking about.

Regarding keeping the registers in one place, since in the immediate formats some registers are being used as immediates, i think it's fine for those to move, which is what the old format did.

Regarding the simplicity of having only R- and I- formats, i found it was cleaner to add in R2 R1 R0 subformats anyways, and i think the complexity of juggling the immediate bits outweighs not having I5 I8 I12 formats.

So forget about swapping around the sign bit, that's stupid.

But i like having a fixed number of opcode bits for the Is, even at the expense of reducing JPC range by 2x -- it's probably much less efficient in terms of code density, but it's cleaner. So let's do:

R-format: xxxyyyzzzooooo10
I-format: xxxyyyzzziiooo00
I5-subformat: xxxyyyiiiiioo000
I8-subformat: xxxiiiiiiiioo100
I12-subformat: iiiiiiiiiiio1100
R3-subformat: xxxyyyzzzooooo10
R2-subformat: xxxyyyooo1111110
R1-subformat: xxxooo1111111110
R0-subformat: ooo1111111111110

---

ok i've mostly implemented Boot now but i havent finished testing it yet. It took me much of yesterday to write the initial encode/decode logic, then today I spent about half a day changing the encoding, and the other half writing the actual VM, and the first draft of the VM is done now but i've only tested the first few instruction. So i think it's likely that this is a two-day job in total (which i guess means a two-weekend job for hobbyists) (although, even though someone else wouldn't have to spend time changing the design of Boot itself, they would have to spend more time reading and understanding the spec i wrote, so maybe it really is a three-day job).

Anyhow, the point is, unlike the previous attempt, this is a success, in that:

it's sufficiently easy to implement!!!

(although i dunno if i'd call it 'dead simple'; i could have left out the I encodings b/c 32-bits is bigger anyways, i could have left out the rtwo rone rzero/MORE instructions, i could have chosen either a register or a stack machine rather than mixing them, i could have had a variable-length encoding that just had whatever data was needed after a one-byte opcode).

However i noticed that i now think that BootX?, as i currently envision it, wouldn't really be that much more complicated to implement, at least not if you don't include all the instructions. I think much of the complexity that i ran into before came from complicated ideas for the addressing modes etc, for example, having 'stack-aware' instructions that take an operand of $S to mean, not auto-push-pop to/from the stack, but rather, do something to the stack itself. Also my CALL3 idea led to difficulty in terms of how the return argument from the instruction being called would get assigned to the register indicated in the destination operand (although as i think about it at this moment, i don't see the problem; just have the RET instruction look at the jump target and if it's right after a CALL3 instruction then look at that instruction to get the return register).

Anyway, so i've simplified the addressing modes, and i don't think this would be a problem anymore. In fact, the encoding might even be easier, since with bigger operands the need for immediate encoding formats is not so important (with a 32-bit encoding you wouldn't need those immediate formats so encoding/decoding gets simpler). Also, now that Boot and BootX? have different numbers of registers, the story of incrementally extending Boot to BootX? is not so good as it was. And because the 16-bit encoding was so cramped (only +-1KiB? direct jumps), i felt the need to let a constant table creep in to Boot, which is complexity we wouldn't need with BootX? b/c we could have a 21-bit immediate constant.

And there was other stuff i was dreaming about with BootX?, about having it be a good emulator and having all these hooks and having capabilites and having polymorphic functions, but now i think that that belongs only in OVM if anywhere. BootX? is just a lowest-common-denominator target platform to make it easier for me to port everything on top of it.

So what about just calling the thing we're calling BootX?, "Boot", and having "Boot0" just be a subset of Boot's instructions (also Boot0 would only use the 32-bit instruction encoding, whereas Boot could have more)?

To be clear, what i currently mean by "Boot" (was: BootX?) is:

32-bit fixed length encoding: 3 encoding length format bits, 6 opcode bits, 2 opcode addr mode bits, 3 x (5 operand bits, 2 addr mode bits)

Boot0 would have the addressing modes (this removes the necessity for us to provide immediate-mode instruction variants like ADDI that are only needed b/c we don't have an immediate constant addr mode), but might not have the instructions supporting any types beyond int and ptr. Boot0 would have the 32x2 registers, plus 32x2 stack items.

to be clear, i am really excited that the 16-bit encoding of Boot seems to be really complete and dense, possibly even denser than anything else out there. And working with those constraints did help me hone my ideas about what is really essential and fundamental at this level of the stack of intermediate languages. But now i have those ideas honed, and the primary remaining goal for Boot is ultimately just to be an easy-to-implement, easy-to-use, portable target, not to win contests about code density and beautiful compactness. If the 32-bit format is easier to compile to than the 16-bit format (b/c of the greater number of registers), and not tons harder to implement, and if we'd have to compile Boot-32 to a Boot-16-only machine (due to greater number of registers in Boot-32), rather than easily extend the Boot 16 machine , then we should just start out implementing the 32-bit one, and consider the 16-bit format just an optimization.

---

if we start out with 32-bits though i'm just bothered by having so much state. If there's 32 registers, times two banks, and 16 stack locs, times two stacks, on a 64-bit platform that's:

32*4 + 32*8 + 16*4 + 16*8 = 576 bytes of state

so how about just going back to 8 addressing modes. Now we have:

16*4 + 16*8 + 16*4 + 16*8 = 384 bytes of state

Still a ton but not quite as excessive.

First three registers are still 0, stack, tos, and still have their special behavior (this is a little redundant, but it make them more useful in the other addr modes, and makes everything more uniform). Except i had a new idea: there should be a 'zero page' of pre-allocated memory, and the zero register in the pointer register bank should point to it, rather than be the PC (for the PC, there is a PC-relative addr mode, see below).

Now we have:

3 encoding length format bits, 4 opcode bits, 3 opcode addr mode bits, 1 opcode bit to control whether or not addr modes are in use or are just more opcodes, 3 x (4 operand bits, 3 addr mode bits)

(16*8 + 16) opcodes = 144

addr modes (note that some of these have two forms, int and ptr, for the two 'banks'):

immediate; the 16 immediate integer constants are: -2, -1, 1, 2, 3, 4, 5, 6, PTR_SIZE*(1, 2, 3, -1), I32_SIZE*(1, 2, 3, -1). The 16 immediate pointer "constants" are PC-relative (PC -8..+7) addresses (note: you cannot neccesarily write to PC address space!)
constant tables (note: implementation-dependent way of loading the constant table)
register direct
register indirect (accessing R1 here pops the smallstack, and then accesses that memory location)
access stack locations as registers
push/pop user stacks (accessing R1 here pops the smallstack, expects the result to be a stack pointer, and then pushes/pops to/from that stack) (int/ptr form determine how far the stack ptr is moved)
displacement (from base pointer in &t, in terms of I32_SIZE or PTR_SIZE) (signed)
"absolute"/zero page/scratch page -- the VM provides a pre-allocated 'zero page' (not necessarily corresponding to the zero-numbered memory on the host platform -- the zero page is at least 16*max(I32_SIZE,PTR_SIZE) = >=64 bytes of free memory which is allocated/available for the Boot program's usage) -- this is different from registers because you can get a pointer to it (constant pointer #0) (should different VM 'threads' in the same VM 'process' share one zero page?). Could be useful as a scratch buffer for e.g. reading/writing to files/channels, and/or for storing root pointers to blocks of allocated memory so you don't have to always keep them in memory, and/or for storing globals. The operand here is treated as a positive displacement from constant pointer #0, in units of I32_SIZE or PTR_SIZE. Because the term 'zero page' is not familiar to everyone, and neither 'absolute' nor 'zero' are quite correct, let's officially call it the 'global page'.

Should &0 holds a null pointer? Or maybe some platforms won't let us have a null pointer? Let's see:

RISC-V and ARM have null pointers
JVM and CIL have null references
Python has nullable everything (None)
LuaJIT?2 has nil
LLVM has a Null pointer constant with identifier 'null' [1]
WASM however allows 0 to be an ordinary memory location [2]
- one of the comments on that page points out that technically, the C standard doesn't require the null pointer to have value 0, it just requires that there be some pointer which is considered the null pointer, although most C code doesn't respect this
- "Emscripten is very much like embedded software - linear memory model, NULL pointer is a valid pointer, strict code size constraints, , etc. - and like in many embedded systems, heap is implemented using sbrk...In Emscripten, the linear memory space consists of the stack, that has a fixed size, and a heap that follows after the stack." [3]

so... i guess this doesn't decide it? Sounds like on most systems, there is a null pointer so we are good, but there isn't always, but otoh, even on systems where there isn't, we could always allocate one byte somewhere and then declare that to be OUR null pointer.

Regarding which instructions take forced immediate operands (which is important/special because then the 3 addressing mode bits are part of that immediate), we could just start by saying that only that weird bank of 16 opcodes in the 'immediate' mode do this (or, do we want to use 'immediate' mode as a CALL3 to nearby PC-relative locations? i don't think that's very useful...).

---

i dunno, now that i look at all that, i think it does make it more complex after all:

the semantics of those addr modes are a lot cleaner than the first time i thought of it, but having them there still makes things more intimidating
more importantly, we want to be able to compile those addr modes to a sublanguage that only uses register direct addr mode (and for specially encoding instructions, immediates). So shouldn't we just implement that sublanguage first?
also btw i think that was the issue with CALL3 and returns -- if we compile it to a sublanguage, then what does RET do? Surely we have to compile those CALL3s into multiple instructions, which use the stack(s) to push the two arguments and pop the return argument. Which means that at the least, Boot0 needs a slightly larger-capacity bigger stack than BootX?. And this brings up another issue; how would those CALL3s know whether the arguments and return value are integers or pointers? Recall that CALL3 is just being implemented by the opcode field referencing a register or something like that. heh actually i have the perfect solution to this, but you aren't going to like it. We have two types (i32 and ptr), and two input arguments, and one output argument. So that's 2^3 = 8 possibilities for the type of subroutines that we can call with CALL3. And we have 8 addressing modes.. so get rid of those addressing modes, use those bits to include the type of the function being called in the CALL3-ing instruction, and only allow CALL3 to occur on register direct. The reason i say that you aren't going to like it is just that this prevents us from using CALL3 along with the constant table to allow each module to extend the assembly language with 16 special functions. Otoh it does reduce such a call to an easy-to-recognize sequence of two instructions; LOADK, CALL3. Also note that this means we can't use the 'immediate mode' opcodes as opcodes when the 'call3 bit' is set; so we only have 128 opcodes, not 144. A little tighter..

soo... let's just have the encoding so that we can support the addr modes later, but still separate 'Boot' and 'BootX?'.BootX? is still the better name so that programmers won't be intimidated and think they have to do all the addr mode stuff before they can say they implemented the 'real' Boot.

Also, on that note, since we are focusing on simplicity and not code density now, maybe choose between register and stack machine? Maybe choose register machine and just say 'don't use registers $1 or $2 or &1 or &2, they are reserved for BootX?', and then provide special instructions to push and pop from the stacks? No, actually this is an issue, because now compiling BootX? to Boot means the runtime needs to reserve some registers to temporarily store stuff that should be on the stack in BootX?. If we support the stacks, then the whole point is that this sort of thing only requires us to increase the stack capacity instead, which is invisible to the instruction encoding.

i guess another thing i am still thinking about is: if we don't allow stack mode (stack as registers) in Boot, then ppl might implement the stack in a way that does not allow register access to it (if there are no other places available to temporarily store 16 stack locs if we have to pop/push to get at the deeper items. I guess we can solve that by requiring support for Boot instructions that do that. -- ok i added lsi lsp ssi ssp to remind me to do that.

also we should specify that Boot has a few more than 16 locs on each stack, so that the BootX? implementation has somewhere to work.

we should also probably have a pushpc command.

---

So if we are going to have 128 opcodes, does that fit all the other data types that we need?

Let's see, here's what we have right now:

ann lh lhu lb lbu sw sh sb sp add addi sub mul sll sru srs and or xor ap bne blt jd cas fence appi apwi reserved_r_27 reserved_r_28 reserved_r_29 reserved_r_30 rtwo

sysinfo cp cpp lsi lsp ssi ssp rone

halt lkp rzero

break bad1

i5 lw lp beq beqp
i8 li sai
i11 jk jpc

total so far: 49

we gotta add long, float, double variants of stuff, and conversions, and division:

ll ld lf sl sd sf addl addd addf subl subd subf mull muld mulf slll srul srsl andl orl xorl beql bltl beqf bltf beqd bltd (note, these float/double branches use f.p. total ordering, not floating-point comparison) lsl lsf lsd ssl ssf ssd apl l2i i2l f2i i2f d2l l2d f2d d2f i2lu l2iu f2iu i2fu (we could also convert directly between i and d, which would be 4 more) divf divd divi divl remi reml cmplt fcmpeq fcmplt fneg fsqrt, fcopysign, fmin, fmax fabs fceil floor ftrunc fnearest fpow fisfinite fisinf fisnan

that's 69 right there.

49 + 69 = 118

we also wanted: pushpc

so that's 119.

also we actually will get a few more opcodes than this because we haven't done our MORE/rtwo trick yet. So we have a few more spots for read, write, poll, etc.

so we can make it!

---

how many extra stack locs should we allocate for Boot to ensure that BootX? can compile to Boot and have 16 stack locs left for BootX? user programs?

we need at least 1 for doing addr mode computations (i think we only need one, assuming we have SWAP available for each stack; so i just added swapsw and swapsp).

i don't think we need 2 for CALL3; the input args go onto the ordinary stacks and take up space

so i think we really only need 1. Of course we may want to add some instructions later that need some, too. Such as software floating point. That sounds like a beast, so who knows how many it needs. Also, doubles will take up two slots on the int stack.

so let's say 6. 3 sounds universal (i tried to lookup how many you actually need for universality but i couldn't immediately find it). 6 is 3 x doubles. So that's 22 stack locs. So on a 64-bit system, we're wasting 6*4 + 6*8 = 72 bytes per VM state to allow for implementation temporaries. Not cheap, but what can ya do. We can reduce it later if we find we don't need it.

---

if the 16-bit encoding does become for efficiency only, maybe it would be time to more seriously consider things like RISC-V's idea with their compressed encoding that you should prioritize using the first few registers, but also duplicating operands (op0 and op2 become identical? but mb not because op2 is often the immediate. So mb op0 and op1). So far with the 16-bit encoding we have 3-address-code (3 operands) and 8 regs throughout, mb it would be good to have encoding format options for this (spend another bit?). If it's just for compression, we won't need all the opcodes anyways.

---

i guess an obvious idea is to get rid of some of the large immediates, to make room for a subencoding in which you have 4 bits per operand, but only 2 operands instead of 3:

1 length encoding bit, 1 R-format identifier bit, then:

R format: 5 opcode bits, then 3x3 operand bits
or, 1 immediate vs 2-op bit, then:
- for large immediate format, 2 opcode bits, either 11 immediate bits, or 4 operand bits and 8 immediate bits
- for 2-op format, 5 opcode bits, 2x4 operand bits

note that this would eat up 4 more R format opcodes (because the I format opcode is now only 2 bits, not 3)

also an interesting thing is that, with 4 operand bits, now the 'rtwo' instruction can access 16 instructions instead of 8! so we have more instructions to work with.

---

maybe we should make Boot that revised 16-bit format?

it's not THAT much more complex than dealing with the 32-bit format, and it's half the size.

otoh if we really want the 16-bit format to be a COMPRESSED encoding, we might want to use a different set of instructions later to compensate for the lack of addr modes (e.g. some instructions with extra immediate fields).

otoh that would make decoding and analysis more complex b/c now the opcodes wouldnt match between the 16-bit and 32-bit format.

otoh they don't quite match anyways because of the I format opcodes? i guess we could just put those at the end, though.

---

yeah, y'know, if we made Boot the 32-bit encoding there are actually a few probs that make it just as complex:

even though there are no addr modes, we want the encoding to match what it will be in BootX?, when there are addr modes. So, for those instructions with an immediate operand, we want to fill in the addr mode bits to indicate immediate addr mode, even though the decoder can ignore those bits and just hardcode the addr mode. Which means that the encoder at least DOES have to know about addr modes, at least a little.
Either we resign ourselves to only having 12 bit immediates, or we have 21 bit immediates but we have a special immediate format for some opcodes. We want the latter, obviously. But this adds complexity similar to the I-format in 16-bits.

so actually, just stick with the 16-bit encoding. Yay, i don't have to throw out that work.

---

so what would the UNPACKED encoding look like? By which i mean, separate conceptual fields are separate, each field is at least 8-bits and is a power-of-two data type, and each field is large enough to hold its largest possible conceptual value? That is, like the C struct that a non-peephole, not-too-fancy code optimizer would probably have.

If we relax the 'large enough to hold its largest possible conceptual value' requirement then we fit nicely into 64-bits:

encoding length format bit(s): 1 byte (maybe not included)
opcode: 1 byte
opcode addr mode/call3 bit and type: 1 byte
operand 2: 1 byte
operand 2 addr mode: 1 byte
operand 1: 1 byte
operand 1 addr mode: 1 byte
operand 0: 1 byte
operand 0 addr mode: 1 byte

So, 8 bytes if we don't include the encoding length format bits. Of course, the optimizer's struct would probably add its own metadata, like at what position in the instruction stream is this (or possibly just a (probably 64-bit) pointer to that position).

But with the 'large enough to hold its largest possible conceptual value' requirement, operand 0 could hold a 21-bit value, operand 1 could hold a 14-bit value, and operand 2 could hold 7-bit value, so:

encoding length format bit(s): 1 byte (maybe not included)
opcode: 1 byte
opcode addr mode/call3 bit and type: 1 byte
operand 2: 1 byte
operand 2 addr mode: 1 byte
operand 1: 2 bytes
operand 1 addr mode: 1 byte
operand 0: 4 bytes
operand 0 addr mode: 1 byte

So now we're at 12 bytes (without encoding bits).

---

so what would OVM look like?

the first key choice is: is this really a traditional VM (with linear bytecode), or some kind of AST with treewalking? (or something else)

the second key choice if this has linear bytecode is: do we want to support e.g. 64k local variables at this point, or do we still have more draconian 'limits', e.g. limit the number of variables to 2^12 (4k).

I think that whatever limits we have for the OVM should be the same in Oot itself, because the OVM implementation is supposed to take care of all of the annoying 'administrative stuff', and register allocation fits within that. That is to say, the layers above OVM (like the translation from Oot Core to OVM) shouldn't have to do any register allocation.

The argument for 64k limits is that, while 4k seems like a lot, if you think about it there are a few times when it is exceeded (such as # of symbols in a module). And we imagine that intermediate representations used by an optimizer might want to exceed it even more. Also sometimes machine-generated code will exceed this, which will annoy Oot users. Also, with 64k limits the 'unpacked' representation can be almost the same as the 'packed' representation.

The argument for 4k limits is that it makes OVM instructions 16 bits shorter.

if we have 64k limits then we still have to decide on the limits for immediate formats that squash multiple operands together; should the maximum immediate operand by 32-bits or 64-bits?

if we have 64k limits with a 64-bit maximum immediate operand then we're looking at:

encoding length format bit(s): 1 byte (maybe not included)
opcode: 2 bytes
opcode addr mode/call3 bit and type: 1 byte
operand 2: 2 bytes
operand 2 addr mode: 1 byte
operand 1: 4 bytes
operand 1 addr mode: 1 byte
operand 0: 8 bytes
operand 0 addr mode: 1 byte

total: 20 or 21 bytes (w/o vs w encoding length format) note: the opcode and operands themselves are exactly 16 bytes here

if we have 64k limits with a 32-bit maximum immediate we have:

total: 16 or 17 bytes note: the opcode and operands themselves are exactly 12 bytes here

so the maximum 32-bit immediate is pretty attractive in this case, as it allows us to fit the unpacked format in exactly 2 64-bit aligned fields. I'm leaning towards that.

This means that Oot code that wants to give a 64-bit immediate operand will be translated into 2 OVM instructions, and also that (most of) OVM's optimizations won't be able to 'see into' that.

Does that mean we can't have floating-point constant folding for f64? No, we can still have instructions representing floating-point constants of f64 with 32-bit immediate operands; it just means that if a floating-point constant is greater than what can fit in an f32, then constant folding will stop at that point. For the few people for whom this is important, fancy optimizers can use a different representation, but we want to choose our formats based on making unfancy optimization better.

note that there is no space for encoding length format bits here. So this isn't a BootX? LARGE format, this is a totally different representation.

So the candidate OVM instruction representation is:

opcode: 2 bytes
operand 2: 2 bytes
operand 1: 4 bytes
operand 0: 8 bytes
opcode addr mode/call3 bit and type: 1 byte
operand 2 addr mode: 1 byte
operand 1 addr mode: 1 byte
operand 0 addr mode: 1 byte

---

having duplicate-operand format in the 16-bit encoding brings up trouble for 'CP' if we want to copy in R format. Those are very important, so might have to burn a 3-operand opcode here? Alternately, could get rid of the rule that the opcodes are the same for R- and O-format. Same for:

so, i guess we should actually just have two sets of opcodes. oh well.

so we need to find space for 6 more instructions. what requires 3 operands? cas. beq beqp ble require 3 operands but we'd like to replace those with skips. can also get rid of fence ann b/c we only need one of them. can get rid of bne b/c you don't need both bne and beq. So getting rid of 4 (cas bne fence ann) and adding 6 (and changing 3 to skips: beq beqp ble). So we need to find two more to get rid of.

our i8s can't apply to the last 16 registers, so we're already without li. If you add that, you either have to take away one of the i8s or i5s (e.g. jk or sai), or you have to take away a whole bunch of instructions.

ugh more complexity.. won't this make this platform hard to target? not necessarily; could just tell the compiler that only 8 GPRs are available, and also 8 other 'local variables'. Or even just tell the compiler only that only 8 registers are available.

this suggests that we get rid of computational operations for our other two. e.g. mul, shifts, logical. I think mul and shifts and logical are the least used operations. I think amongst those, mul is the least, xor is the most, shl is sometimes used a lot. So srl sra and or are the least. I think srl and and are used more than sra and or. Sub is pretty unpopular too, and we already have signed ADD...

another route is to say we don't really need those immediate variants like addi appi apwi, so get rid of those. But those are useful for iterating, and a 'storage-only register' sounds like it could make a good 'index register' because you'd park pointers there (for similar reasons i like turning the branches to skips rather than getting rid of them). Could also get rid of some of the smaller bitwidth load/store variants.

in dynamic frequency counts the arithmetic/shifts/logical are less than those things tho. So let's get rid of two of these three: sra or sub. Choose sub b/c it's inessential (this sub can only mutate the source operand, remember; also it can't subtract from pointers). Mb choose OR.

i dunno man. we're basically saying that a code generator from BootX? to Boot has to replace some instructions with sequences of instructions if they use the wrong register. So beq $10 $11 #3 becomes cpy $s 10, cpy $s 11, beq $s $s #3. sub $10 $11 $12 becomes cpy $s $11, cpy $s $12, sub $s $s $s, cpy $10 $s. The skips are optimizations that a code generator dosen't have to use. I guess we're having to replace single instructions with sequences anyways in the case of addr modes. So i guess this is fine.

Which reminds me; each instruction has 4 addr modes, not 1, so might we need more than 6 stack temporary spaces? Let's see; the worst one is probably CALL3 with all addr modes. To make the call at the end, we'll need two slots for the arguments, plus one for the address; in addition we need temporaries for calculation. But we can do these cals in sequence. So 6 is probably enough; 3 for call address and two arguments to the call; and then 3 more for intermediate calculation of the operands.

---

a more serious concern; the R-format variants of instructions can't be handled by just a uniform wrapper function; because sometimes when there is an immediate (as in the loads, where there is an immediate offset) the immediate should be set to zero, whereas other times (such as in ADDI), the immediate is still present and the other two arguments are unified.

so now this is getting more complicated than just a variant of the 32-bit encoding.

so i guess we should just provide the cp cpp lsi ssi lsp ssp in R-format, and leave the rest out of Boot. Most of the R-format becomes just a BootX? optimization.

Y'know, that actually makes it much simpler to implement in x86_64 assembly using my only-lowest-8-regs-in-registers idea.

---

one remaining issue is that there is nothing to force the Boot implementor to make room for 64-bit registers anywhere, so they might still have trouble moving up to BootX? in that sense. Should we make them make room, so that they don't have to add state when moving to BootX?, or do we not care (b/c the 64-bit registers are conceptually sort of separate anyhow), or do we NOT want to make them mix, b/c we don't want to make ppl deal with telling foreign backend optimizers such as LLVM about how when you use int64 register X you also occupy int32 register Y?

one option is to take the middle ground: pretend that the registers become 64-bit and can hold floating-point. Advantages:

on actual 64-bit architectures, this means that we don't waste two BootX? registers for every 1 64-bit value
on 32-bit architectures, this admits a simple implementation technique where each 64-bit quantity in a register is replaced by a pointer to 2 bytes in memory
on architectures that have combined FP and integer registers, we make it easy to keep track of which registers are in use

the disadvantages are:

on architectures with separate FP registers, we waste a BootX? integer register for each FP value being held
on 32-bit architectures, this makes it hard do use an implementation technique where two adjacent 32-bit registers are virtually combined into one BootX? 64-bit register
programs could try to coerce these values without converting them, which will cause implementation-dependent behavior

i think it's worth it. Seems like the simplest thing.

one other issue though: if a platform does use separate FP registers (as i expect that most will), then an integer write to a register currently holding an fp value won't actually erase it, which may present security concerns. We'd have to specify that, but even so, compilers probably won't bother (and this complicates the semantics).

but, the same issue applies to implementations that have actual separate 64-bit registers.

also, even if in theory we have less state with shared Boot registers, on platforms with separate fp register banks this actually makes implementation more complicated, because the implementation has to remember whether each register currently contains an FP value or not, and check that each time the register is read from.

hmm...

i dunno man, let's have a separate fp register bank then. That's what most CPUs do anyways. We can always implement it as a register bank in memory on 32-bit processors. It makes implementation simpler, even if it does increase state.

now what about 64-bit registers?

i guess most implementations will not have separate banks of them. So we'll keep them in the registers like we said before.

---

another issue is jump ranges. code expands as we compile from BootX? to Boot, and it's possible that something that starts out in jump range could end up outside of it. I haven't computed the maximal amount of expansion, so we don't have a simple way of being sure that our BootX? jump ranges are compilable.

one thing to note: if Boot is 16-bit and BootX? is 32-bit, then because the conditional branch max offset is larger in BootX? than in Boot, sometimes a conditional branch directly to code will have to be replaced by a branch to a jump to the code, which inflates code a little bit (at least that jump, plus the jump for the code that used to be just before the code being moved). If they are both 32-bit then no problem.

i think this argues for a 32-bit Boot, to make compilation simpler; although the problem really isn't solved either way, because we still have to deal with the fact that each instruction could be expanded into many instructions, so either we have to either greatly restrict jump ranges in BootX?, or we have to deal with this in the compiler anyways.

And compiling from BootX? to Boot is supposed to be a program written once by us, not written every time by every porter.

---

in my head i've been thinking about an amd64 (x86_64) implementation and i should set it down in writing:

the lowest 8 registers of each register bank are stored in the 16 x86_64 registers. The upper registers, and the stacks, are stored in memory. The special registers are stored as follows:

$0: doesn't need to be stored. The corresponding platform register is available for compiler/Boot implementation use. $s: stores the current intstack depth (as an integer) $t: stores the intstack TOS

&0: holds a reference to the global page &s: stores the current ptrstack depth (as an integer) &t: stores the ptrstack TOS

Pointers to the stacks themselves, and pointers to the upper register banks and floating-point register bank, don't need to be stored, because these stacks and register banks are found at fixed offsets from the global page pointer &0. This is possible because all of the global page, the stacks, and the register banks are of fixed sizes. The stacks are ring buffers (circular buffers) of size 15 (each TOS is stored only in its TOS register).

If the compiler/implementation needs 2 more registers for its own use, it can choose to not bother caching the stack TOSs.

---

so currently still aiming at a 16-bit Boot and a 32-bit BootX?. Why not a 32-bit Boot?

i notice that i am still learning things (e.g. having 8 GPR registers in each bank, e.g. which instructions to have) about Boot from considering constraints imposed by the 16-bit format
seems like you need a compile step from BootX? to Boot in any case, so you have to deal with expanding code and what to do about jump ranges which can't be represented in Boot in any case

---

i guess if we said that any BootX? instruction could be compiled to 8 Boot instructions, then we could reduce the jump range for BootX? by 3 bits (from signed 11 to 8). But i think it might take more than 8 in some cases; if some of the addressing modes take 2 instructions to load, and we're using CALL3, then that's 4 things that need 2 instructions, and then we have the actual instruction at the end, which would be 9 total. Otoh if all of the addressing modes take only 1 instruction to load then we can do it under 8. So let's say 3 or 4 bits.

So if we limit jumps to 11 - 4 = 7 signed = 6 unsigned, that's a jump range of only +-64. Seems like we'd definitely need to use longjump (jk) functionality for even small programs.

So maybe instead of having a jump range restriction we should mandate long jump functionality, and let the BootX? compiler decide what is a long jump and what is a short jump?

Which suggests that we specify how the long jumps are encoded in the binary. I think the simplest thing is to put a jump table close to the beginning of the program. How about this: we leave one instruction at the beginning for a 'jump' instruction to the real code, then there's a jump table. Actually, better; the first entry in the jump table specifies the actual entry point; the VM starts execution at that code location without having to dispatch an initial jump.

actually, lets make it the second entry in the table specifies the entry point; that way the constant pointer table is the same as the jump table and the first entry in the constant pointer table can be a pointer to the global page.

---

we don't need duplicates of one- and two- operand instructions over both the P-type and R-type Boot 16-bit instructions; e.g. we only need 'cp' in one of them

---

(later: copied to ootOvmNotes1)

so i think some stuff that OVM should implement is:

tail calls
management of activation frames/stack frames (in some way) (e.g. save them on the heap when they need to escape for closures)
- so do we introduce the confining stack abstraction here? i think so but am not sure
"a 'closure' instruction which handles all of the issues related to nested lexical scoping, and the management of activation frames" or something
pickling (see SEAM and their graph store (SEAM Abstract Store) for purposes of GC and pickling; for each bit of code they store both "Alice Abstract Code (so that it can be pickled), and also as compiled code (either compiled to bytecode or to native code). Compiled code is represented as binary 'chunk' nodes in the SEAM Abstract Store; since these are opaque to SEAM, the garbage collector can't tell which other nodes they reference, therefore binary code is wrapped by a pair node which contains (points to) both a binary chunk node and also to an "immediate environment" which is a node holding references to other nodes that are referred to by the binary code. Binary code is position-independent in anticipation of being moved around by the garbage collector.")
- note: this layer can't by itself pickle functions, although it offers facilities for other layers to do so: because that requires some sort of programming language representation -- however, it can be extended to custom node types, so the HLL can define some sort of bytecode or AST representation, and the OVM backend/platform can define native code blobs; so, assuming that the HLL did that, functions can be pickled
- note: one of the important functions here is to serialize the (probably cyclic) graph of pointers in stuff in memory
GC/memory management, read/write barriers
- ownership, e.g. distinction b/t 'move' and 'copy'
- aliases, value types vs reference types ?
- finalizers, 'resources' (in the SEAM sense of wrapping of external objects that cannot be fully reified inside the VM's object graph)
register assignment (on the backend)
arrays, aggregates, dicts, linked lists/S-expressions
- and interop with native data structures
something like futures (see SEAM transients)
concurrency spawning, synchronization and communication
- see L4
- see Golang
- see Erlang
scheduler (note: interacts with I/O because knows not to reschedule a thread waiting on I/O until the I/O completes; also must support nonblocking I/O)
- see L4
- see Python event loops
- since the HLL implementation on the layer above is trusted, we can use cooperative multitasking; the HLL must implement pre-emption if desired
bounds-checking
capabilities/sandboxing
- however, the HLL implementation on the layer above is trusted; this is a service to make it easy for the HLL to implement capabilities, not a sandbox of the HLL implementation itself
virtualization
calling conventions
- interop; glue with calling to/from platform, and with platform globals and other stuff
do we need to support any OOP stuff?
polymorphism? parameteric or adhoc?
expanded set of primitives (e.g. map, reduce); see ootAssemblyOps*
first class functions? suspended computations (is this different from futures)?
package management/binary module stuff
- dynamic loading/importing/linking/hotpatching of modules (mb call them 'components')
runtime
- constant tables
parsing and/or regexps?
direct compilation possible even to some platforms that Boot can only be interpreted on due to its use of unrestricted indirect jumps to implement a call stack (provided that these platforms provide enough flexibility in terms of stuff like tail calls, first-class continuations, closures, lookup tables, activation records on the stack, etc to do the sort of control flow that we need)
i guess it has to have structured control flow constructs so that it can support transpilation to native platforms. Should look at Haxe ( https://community.haxe.org/t/how-to-implement-new-haxe-target/1599/4 ).
copy-on-write

todo should also read more about the OOP parts of JVM and CIL. todo should also read about Cmm/C-- runtime support todo update Oot todos

as you can see, the purpose of OVM is coming into focus:

services e.g. GC that i think are annoying/administrative rather than 'cool semantics'
- note; these services should be able to be overridden by their platform-native equivalents (e.g. in Python or Java, use Python's or Java's GC if possible). Note that this means that an Oot program can't depend on their exact semantics, since this will vary with platform
code generation stuff that i think is annoying (eg register allocation)
primitives
maybe some optimization?
interop

---

Three kinds of I/o: shared memory (even if writeonly or read-only on each side), msgs, streams

---

some old notes:

I guess vanilla boot should have the two banks of eight registers, so that we can easily compile to platform assembly on for example x86_64

We need a 'library' opeco that rather than having a fixed set of functions, calls a given function pointer using the platform ABI

---

why not use the following existing platforms:

LLVM: too complicated to port
webassembly (WASM): almost perfect, but too complicated to target and also to port due to structured control flow
ARM: too complicated to port
CIL: too complicated to port
JVM: a little too complicated to port
LuaJIT?2: too high level
RISC-V: almost perfect but see below

RISC-V looks pretty simple. The encoding is slightly more complicated than we would like, but that's okay. There's only a few issues with it:

the semantics is that pointers are just integers -- we would have to use our own non-standard semantics to forbid this
similarly, we'd want to add instructions to load/store pointers, and to get the sizeof pointers
although the base instruction set is simple, the floating-point extension is too featureful -- many of our target platforms don't support rounding and exception modes, or classification to the extent that RISC-V does

---

zig has facilities for making dealing with errors easier, without using exceptions.

zig has facilities to replace C's preprocessor with something that compiles faster etc

---

yknow although it's good to finish the 16-bit Boot format for learning (to make sure everything can fit), ultimately i think even the little bit of encode/decode complexity in there is too much for it to be 'dead simple'. We should ultimately change to an 'mostly unpacked' 32-bit format:

opcode: 1 byte
op0: 1 byte
op1: 1 byte
op2: 1 byte

16-bit operands (i11 format) will still have to be represented using more than one of the op fields.

Note that there is no format length encoding bit(s) so this format is not binary compatible with BootX? at all.

Although i guess we could smuggle some format length encoding bits into the opcode field? Not sure what the point is though, as i can't see this ever getting mixed with the BootX? formats in the same bytestream (in the sense that one instruction is 32-bit BootX?, the next instruction is this format) for any reason.

Note that the 16-bit limits will remain (only 8 registers except for cp and l/ss* cmds, immediate range of 8, etc)

PC offsets should be unchanged from the 16-bit encoding, which means that PC offsets should be multiplied by 2 (if interpreted in terms of bytes, b/c now instrs are 4 bytes instead of 2), or divided by 2 (if interpreted in terms of instructions)

the 2 I11-type instructions should still be treated differently. Should they be the first two instructions? Probably, yes. Should the I8s be next, for good measure?

should we reserve ordered portions of the opcode space for what will become I-type (4 opcodes) and P-type (32+8+8+8 = 56) and R-type (32 + 16 + 16 = 64)? So we could have I-type, plus 4 reserved instructions, plus P-type (64 so far), plus R-type (128 so far)

also, should the branch instructions all go in a certain part of opcode space (so that you can deal with their special immediate offset coding all at the same time by checking for a certain opcode range)?

---

here's a reason not to make Boot 32-bits:

What if code dynamically computes a code pointer by adding an offset to the PC, then using the JY instruction to jump to that address?

In Boot we could just have an INSTRUCTION_SIZE in SYSINFO. But in BootX? the code will rely upon the relative sizes of 8-bit, 16-bit, and 32-bit instructions.

If we say that dynamic computation of code pointers is not allowed, that makes it impossible to do PC-relative jumps to farther away than the provided jump immediate instructions.

... on the other hand, it would be hard to compile BootX? code that computes such dynamic PC-relative jumps into Boot code anyways, because the compiler needs to know the jump target as a label since it will be changing the distances between various pieces of code. And if the compiler can label the jump target, then it can recompute the offsets correctly.

Btw this provides a strong argument for a tablejump instruction, because such an instruction is a form of indirect jump that the compiler can easily recompile to different instruction sizes; we probably want to rule out general unlabeled computed indirect jumps for this reason, i guess i see why now.

also i guess i had forgotten, but it's hard to compile unlabeled computed indirect jumps to platforms that require labels. So mb we should forbid that sort of thing anyhow? No, actually, this is a difficult topic that touches on the heartof the reason for Boot in the first place: to escape the strictures of control flow in various high-level platforms:

These platforms which require possible jump targets to be labeled still make use of unlabeled indirect jumping in the form of a call stack. But we are at a lower level than a call-stack here. In order to be able to implement a call stack, we need unrestricted indirect jumps. Which means that there can't be a direct compilation from Boot to many of those platforms; Boot can only be interpreted on those platforms (although OVM can probably be compiled directly to them).

Now, we don't necessarily need indirect jumps to a code address that has been manipulated by pointer arithmetic. So the issue of instruction sizes is somewhat separate. But we can't just get rid of the JY instruction completely; the most we could do is specify that pointer arithmetic on code pointers is invalid (thereby ruling out use of jump tables).

So i guess my conclusions are:

allow indirect jumps including pointer arithmetic on code pointers
in Boot, have INSTRUCTION_SIZE in SYSINFO
in BootX?, have INSTRUCTION_8_SIZE, INSTRUCTION_16_SIZE, INSTRUCTION_32_SIZE in SYSINFO

---

a defining property of Boot is turning out to be the bifurcation between integers and pointers. I think i should rename the language Duo, or Biplane, or Gemini.

or Bicycle or t42 or bivalve

i guess bivalve is best because it's a Bi VM; bivm, bivasm

---

Mb 4 kinds of ANNotations: ascii comments, source maps and identifiers and labels, types and stack maps and other standardized dedications, implementation dependent and reserved for run time

---

yknow, for transpilation an AST-based OVM would be best. But for easy interpreter-with-interop (interop with the platforms's data structures and GC), a linear bytecode-based OVM would be best. So perhaps we need to expand the OVM layer into two layers.

---

yknow, having written Boot, i'm thinking of just merging Boot and BootX?. I'm guessing that implementing BootX? will only take about 2x as much labor as implementing Boot. So it may not be worth the ecosystem complexity (and i'm feeling like it will be much easier to implement BootX? than to implement that BootX?->Boot compiler!). Also most of the additional opcodes are stuff which are easy to implement on platforms that support them at all, such as floating-point and 64-bit. And honestly, i'm not going to implement soft-fp and soft-64-bit anytime soon anyhow.

otoh, once we get into using the MORE trick on the 32-bit representation, we can fit a bunch more instructions... which suggests adding more complicated things like arrays and structs and maybe strings and dicts, in which case we do probably need a separation between Boot and BootX?.

otoh, while that's a noble goal, another goal is just to finish the project soon, and it would be quicker to just start working on a full BootX? implementation at this point, and stop worrying about Boot for now (since BootX?, and even OVM, can and will be directly implemented on an underlying platform, for the most part). Perhaps we could come back to Boot later, rather than evolve them simultaneously.

i should probably get Boot to a stopping place first -- test it, document it

---

yknow, if indirect jump targets are indeed dynamically generated, with no labels, it's going to be very hard to compile BootX? to Boot perfectly in all cases. Either we should unify Boot and BootX?, or BootX? should require all jump targets to be labeled.

i guess the fact that this difficulty even exists for BootX?->Boot argues strongly for requiring labels in BootX?; as i've said before, it's going to be hard to compile to any non-processor-ISA target without required labels.

i guess this also means that Boot CAN'T be easily compiled to x86, because we don't know how to map computed indirect jumps.

should look at the 4 strategies i enumerated in plChAssemblyLanguagePrincipals for dealing with this:

jump targets that are a snapshot of the PC sometime in the past (eg function call return addresses)
enumerated targets for each indirect jump
higher-level primitives for common cases e.g. jump tables being indexed into
a global/shared pool of 'labeled targets' which can be a target of any non-enumerated indirect jump

---

so in OVM i'm thinking about making the GC/refcounting implicit. Is that what Python does with its stack-based bytecode?

Yes:

" >>> def test(): foo = 5

>>> import dis >>> dis.dis(test) 2 0 LOAD_CONST 1 (5) 3 STORE_FAST 0 (foo) 6 LOAD_CONST 0 (None) 9 RETURN_VALUE ... Note though, that this does not increase the reference count of that value. That’s what happens with LOAD_CONST, or any other byte code instruction that gets a value from somewhere:

TARGET(LOAD_CONST) { x = GETITEM(consts, oparg); Py_INCREF(x); PUSH(x); FAST_DISPATCH(); } "

[4]

---

Guile (Scheme)'s VM appears to place the procedure being called in the first argument slot on the stack when calling any procedure (presumably for efficient recursion?) [5] [6]

interesting idea.

---

so i guess the minimum state per thread is around:

16 registers * 2 banks = 32 words for regs
16 smallstacks * 2 banks = 32 words for smallstacks
16 global page words

16*5 = 80 words

+ a few words for state not in the registers, e.g. the PC

Sounds like a lot. What does Erlang do? Oh, actually they have even more: 309 words[7] (although it's not really comparable b/c Erlang processes also get their own heap area)

So actually we might still be sufficiently lean.

---

frame pointer in calling convention?

---

so i guess the OVM instruction format is looking like:

3 x (16-bit operands) + 1 x (32-bit operand (larger operand for 32-bit immediates)) + 4 x (8-bit addr modes) = 14 bytes?

let's add another operand (so 1 opcode and 4 operands) so that we can do e.g. CAS easily. MB this can double as a generic polymorphism type operand sometimes. So:

4 x (16-bit operands) + 1 x (32-bit operand) + 4 x (8-bit addr modes) = 16 bytes

we're short one addr mode... that's probably ok tho, either the 4th operand (which is rarely used) or the opcode can do without an addr mode (e.g. we used to use the opcode addr mode to do CALL3, but now we could just have a CALL3 opcode and do that with the 4th operand). Probably the opcode should do without.

---

the best way i can think of to make Boot compilable is to start defining annotations.

Specify that each JI instruction must be immediate followed by an annotation. The choices for this annotation are:

totally unconstrained (makes Boot uncompilable unless the compiler is willing to make a dense map for all of the Boot program code addresses to the entire compiled code and put that in the target)
only jumps either a global label or one of the following enumerated local labels (and now we have to have more ANNs to list the enumerated local labels) (and we also have to define label annotations now; probably define two types, local lables @ and global labels @@ which go in the constant table)
only jumps to some address which is acquired at runtime by either a PUSHPC or a callback given to use by foreign code, and which was never added to or subtracted from
only jumps into the following jump table (?jump table instruction?)
only jumps into any of the previous three

i dunno man this is getting pretty complicated for little old Boot. All the compiler really needs to know is the complete set of locations that might be indirectly jumped to anywhere in the code without acquiring the target address at runtime (that is, it needs all the labels), then it can replace each JI with a much slower command that does a lookup into that table.

even if we just do that though, should we require ANN instructions in the program code next to the jump targets (slows the program down), or just provide the compiler with a table of all of them at the beginning (more complicated file encoding)? And shouldn't we have a jump table instruction?

Let's look at the form of the only indirect jump instruction, JY:

  jy &targetbase #imm6: dYnamic (indirect) jump

This instruction is pretty wacky/powerful. It can index into a dynamically chosen jump table, but the index into the table is static/immediate. So this does not replace a jump table instruction, which i define as a dynamic index into a static/immediate table.

Note that the idea of indexing into a dynamically-provided jump table would require informing the compiler that each of the entries in the table are a table (unless the table is dynamically provided by foreign code, in which case our Boot program had better understand the platform conventions for code size). However, since the table itself is dynamically proveded, this sort of thing alone would not save us from having to do a slow lookup at runtime.

A little unrelated, but note also that if a program WRITES BOOT CODE AT RUNTIME INTO MEMORY and then JUMPS TO THE CODE IT WROTE, we can't do that at all without an interpreter.

i think maybe what we need to do is introduce the concept of a list of source addresses that might be jumped to (a jump table); there can be multiple such lists in the program; the list of global labels is one such list; but there can be others too. Each such list is in the file somewhere (near or in the constant table i guess). Each JY must be annotated with which of these tables it uses, or that it uses a dynamically provided address (from PUSHPC or the like). Even if lookup is required, then, this ensures that it's kinda quick. But then we may as well go the extra mile and have a jump table instruction. Also, may as well allow for the possibility of a totally unconstrained JY (for debuggers and the like).

ok here's what i think we need:

JY: get rid of the immediate offset. JY shall only be for runtime-provided addresses (PUSHPC and foreign callbacks)
JK: separate the table of globally labeled/constant addresses from the other constants, so that the compiler knows it needs to map these
new instruction: JT: jump table $reg: index into jump table to immediately follow, first table size, then listed in terms of PC-relative jump from the JT instruction. Jump past table size falls thru to first instruction after table. Consider putting immediate tablesize into JT instruction? That would greatly limit the size of the table tho, as well as consume another 3-operand instruction

ok i did that

---

note that the higher-level VMs avoid pushing the complexity of calling conventions onto the user by simply providing CALL instructions that provide each called function with a fresh set of registers (and presumably implementing calling conventions on the underlying platform within the implementation of the CALL instruction).

---

still going back and forth about the xcall calling convention. Should it use the registers or smallstack or both?

arguments for using registers:

Boot needs a calling convention for registers anyways. Why not make it the same as the xcall convention?
using the smallstack requires keeping some stack locs free just for this in any leaf node that may need to xcall
the smallstack could be reserved for 'toolchain-y' things
this may make xcall on implementations that compile to a platform register machine (like x86 assembly) more efficient -- which is the time you most want it to be efficient
- it's dubious that it will actually be significantly more efficient, though. The platform probably has a different calling convention and probably has its own stack anyway, so xcall probably still has to move a bunch of registers around and push many of them to some internal stack regardless
the register calling convention is composable (not sure why this would matter tho)
you don't want to be using a variable number of smallstack slots depending on how many arguments you are passing, which means that if you use smallstack you can only pass about 2 arguments anyways before resorting to the variadic 'rest' case.

arguments for using smallstack:

simpler API
- special registers are dumb
on platforms that are not being compiled to a register machine, the effort by Boot programs to save registers in order to free them up for use by an xcall will be wasted

i'm leaning towards registers, just because with smallstack we wouldn't want to pass too many arguments before resorting to the variadic 'rest' case..

---

there's a problem with the variadic 'rest' case though. Who deallocates that memory? If we want to be completely agnostic about whether that memory is on a stack or is malloc'd, then the caller must deallocate, because the callee doesn't know how to deallocate (pop a stack pointer, or mfree?). However, this is bad for tail calls! For tail calls, we want callee cleanup, because control will never return to the caller.

i guess you could still do tail calls this way but with a 'wrapper' function around the tail-calling function that sets up and tears-down its memory.

Perhaps a simpler solution is just to say that these registers are callee-saved, as usual for argument-passing registers, and if the caller needs to deallocate, it must duplicate those pointers in other caller-saved registers before the call.

---

and for xcall/xpostcall, should there be any caller-saved registers, or should we guarantee callee-saving of everything? Caller-saving is a waste in many implementations because we'll just store the whole register file and external code won't touch it. Otoh if we just use the same convention which is suited for actual Boot code, we retain the option of doing xcalls as normal calls into other Boot code if we need to for some reason. I guess that's worth something.

---

ok i think i decided. Register argument passing calling convention. CallER?-cleanup for the extra registers, because we are agnostic about call stacks at this level. I updated boot_reference accordingly.

---

something still to be decided: what to put in the R- instruction slots, which can access all 16 registers (as opposed to the P-instruction slots, which can only access 8).

I ran out of room in P-1 so i moved HALT to R-1 (since HALT is 'syscally')

also, now that we have room again, should put back those other syscalls, eg POLL

---

an alternative would be to provide alternatives to xcall:

xcall0..5 for specific number of arguments/registers (same number for both integers and pointers)
xcall0..5_0..5 for specific number of arguments/registers (different numbers for integers and pointers)
xcalls0..5 and xcalls0..5_0..5 for smallstack variants (note: we also need to indicate return value types/counts tho)

or maybe just at least the 0, 1, and 2-argument smallstack variants: xcalls_i_ii, xcalls_p_ii, xcalls_i_ip, xcalls_i_pp, etc. There are 8 such 2-argument variants, 2 such 0-argument variants (based on return value type), and 4 such 1-argument variants, for 14 such instructions total.

this has the advantage of allowing the program to xcall without saving any registers, and also of providing Boot primitives for each of the 8 CALL3 types/cases in BootX?.

we could also have another calling convention to mimic what happens what a Boot program itself is run; all registers except r3 are caller-saved (that is, available for the Boot "callee"), and an integer value from any register is returned.

ok, i did that.

---

thinking about my new r12 calling convention, in which r3 is reserved for a stack pointer or somesuch (so that the caller has SOMEwhere to store their data), makes me think maybe we should just specify r3 as a memory stack pointer after all.

A stack pointer is useless if it might have remaining capacity 0, so we could also specify that it has capacity at least 16 of the larger of 32-bit integers or pointers.

And that makes me realize that my planned displacement r3 mode in BootX? is pretty useless because an immediate displacement would be in what units (since we don't know the size of either 32-bit integers or pointers)?

I think the only sensible way to do that would be to go back to a split-mode, where, out of the 4 operand bits, 2 of them are for a count 'x' of 32-bit integers (so between 0 and 3), and two of them are for a count 'y' of pointers (so bettween 0 and 3), and then the displacement is = x*INT_SIZE + y*PTR_SIZE. So now our displacement mode can only always address up to 3 items into the memstack (and can sometimes go up to 6), which isn't so hot, and is much less useful than 16 items.

might be more useful to have an index address mode that specifies an integer register with the operand, and then adds the signed contents of that register to the contents of ptr-r3 to get the effective address.

heh heh.. or mb a whole suite of 16 special addr modes that do something split-register-y and fix both registers: &3+$3 &3+$3*INT_SZ &3+$3*PTR_SZ &3+$3*INT_SZ+$4*PTR_SZ &3+$4 &4+$3 &3+INT_SZ &3+PTR_SZ (*&3)+$3*PTR_SZ &3+(*&4) &3+$T &T+$T &T+$3 ...etc... i guess what we want here is combinations of &S, $S, &T, $T, $3, &3, +-INT_SZ, +-PTR_SZ (forget the r4s). hmm... here's 15 such combinations:

&S + $S &S + $T &S + $3 &S + INT_SZ &S + PTR_SZ &T + $S &T + $T &T + $3 &T + INT_SZ &T + PTR_SZ &3 + $S &3 + $T &3 + $3 &3 + INT_SZ &3 + PTR_SZ

would be nice to get the -INT_SZ, -PTR_SZs too but we only have one spot left.. actually the INT_SZ, PTR_SZs aren't as important as they seem because we have constant INT_SZ, PTR_SZ (and their negation) in BootX? as immediate constants, so we don't need to do this to do increments and decrements. Without those guys we would have 9 items here (7 spots left to get to 16). How about:

&S + $S &S + $T &S + $3 &S + INT_SZ &S + PTR_SZ &S - INT_SZ &S - PTR_SZ &T + $S &T + $T &T + $3 &T + INT_SZ &T + PTR_SZ &T - INT_SZ &T - PTR_SZ &3 + $S &3 + $T

no, i like having &3 + $3. How about:

&S + $S &S + $T &S + $3 &S + INT_SZ &S + PTR_SZ &S - INT_SZ &S - PTR_SZ &T + $S &T + $T &T + $3 &T + INT_SZ &T + PTR_SZ &T - PTR_SZ &3 + $S &3 + $T &3 + $3

y'know, i think this sort of thing is a good idea, but all the possibilities here just sort of demonstrate that we should just have one RESERVED addressing mode, and then decide what to do with it later via profiling real code.

---

another thing to remember: in OVM we want to be able to run code sandboxed. So yeah we do need capabilities or at least ambient authority, and no, we can't turn off boundary-checking unless you have the capabilities to do so.

---

not sure if the r12 calling convention is really needed but i think it's nice to have something symmetric with the Boot program's 'calling convention' itself. I added text saying that Boot programs should restore &3 to its original content upon exit.

---

also, yknow, its easier to type 'i3' and 'p3' than $i and &3. So maybe call the registers that? although it's easier to read instruction specifications like

lw $dest &addr

than

lw idest paddr

---

if we are doing memory-management within the OVM implementation, are we doing capabilities in there? If so, then how does E do it?

Based on the tiny 'mint' example in [8], it looks like:

OOP, where private methods and functions are really/guaranteed private
first class functions
a 'seal/unseal' mechanism that creates a pair of functions, 'seal'/'unseal'; 'seal' can "encrypt" a references to an object or function such that that "encrypted" reference can be passed around and held by untrusted code, but it cannot be dereferenced until it is "unsealed" using the corresponding unseal function ("unsealer")
static typing

Wikipedia says that Pony was inspired by E, btw. Pony has reference capabilities to enforce Milewski-like access limits on variables for the purpose of safe concurrency (e.g. "Box. This is for references to data that is read-only to you").

Another cool thing that E does is it has both 'call' and 'eventual send' (distributed (message-based?) call, which may get lost and never execute and returns a promise). E talks about how its eventual send and capabilities allow for distributed computation and 'mobile code'. And I recall that AliceVM? and SEAM also talked about 'mobile code'.

i guess if we are putting bounds checking and capabilities in OVM, it's appropriate to have some sort of an object model in there too. So OVM would have bounds-checking, memory-management, capabilities (including seal/unseal, and Pony-like reference capabilities), object model, guaranteed privacy within the object model, promises, eventual send/distributed message-passing, channels, serialization, mb laziness?. This is getting pretty intense, are we sure we don't want to leave some or all of this stuff for Oot Core?

right now, i'm thinking no, this really is all OVM stuff. The reason is that it:

is the sort of thing that we'd want to interoperate with the host language
is the sort of thing that is sufficiently frequently executed at runtime and sufficiently low-level that it could greatly benefit from platform-specific optimizations

Are we just making Oot Core here? No, because the OVM level has to be very un-dynamic (remember the lesson from PyPy?).

i think it's time to create a new series of notes files for OVM. Done.

proj-oot-ootAssemblyNotes24

16*5 = 80 words