proj-oot-ootAssemblyNotes24

Difference between revision 6 and current revision

No diff available.

---

190529

we need to take a hard look at if Boot is meeting its goals: i'm not sure that it's any easier to implement than RISC-V RV32I.

it provides other benefits:

it's better than WASM (no weird structured control flow), simpler than JVM, much simpler than CIL, simpler and lower level than Lua, and simpler than ARM (esp. in instruction encoding). But it may be more complicated than RISC-V. And RISC-V is more popular, so even if it were equal complexity, RISC-V should win.

i used to be scared of RISC-V's instruction encoding variants, but now we have a bunch of those too! And ours are (right now, at least) even less regular than theirs, and also theirs are more thought out and have a bunch of properties that they claim are important for simple, efficient hardware decoding:

Yes, their 32-bit and 64-bit platforms are different not only in pointer size, which makes it harder for us to write code targeting both. But we already have an unknown pointer size in Boot, is it really that much trouble to have distinct 32-bit and 64-bit targets? And it's probably not so much trouble to make it worth it to have to start from scratch toolchain-wise with a new assembly language.

Yes, for our purposes, it's slightly simpler to include the necessary syscalls as opcodes. But again, not worth the complexity of introducting a new assembly language.

the only thing that i think might really be important is pointer opaqueness. I'm imagining interop situations where we are running inside a high-level platform like Python, and we want to interoperate with its data structures. In this context we don't have pointers, we have opaque references. We can store them, but not in integers. We can't do pointer arithmetic on them even if we want to. I'm imagining that we will be accessing struct fields (e.g. objects) by assigning a numerical offset for each field and then adding that offset to the base 'reference's, which is why i include the operation of adding an int to a pointer. (now that i say this, it strikes me that even blt-ptr (bltp) is too much, ok i just took that out).

in older versions of Boot we had systems that made the pointer size totally opaque; the program didn't even have to know it for memory layout b/c memory was accessed in terms of pointer-sized words. Should we go back to that?

In any case, we should at least work on our instruction encoding to make it more regular, more like RISC-V.

i think the takeaways are:

---

ok so thinking a little more, we need to go back a little to the more complex system where memory was a linear sequence of locations of opaque type and size. The reason is that the whole reason that we want totally opaque pointers is in order to use Boot code in situations like when we are running on top of Python; memory we allocate is Python arrays and we access Python objects by assigning a numbering to their fields and then treating the object reference as a pointer and adding integers to that pointer to specify the fields. The ability to do this is why Boot is more useful for interop than RISC-V.

so some considerations:

So, TODO:

---

so far i'm leaning towards/TODO:

---

if our memory locations could literally even be references to consecutive fields in a C struct, then some of those C struct fields can hold i32s, some can't, some can hold pointers, some can't, and this changes for different types of structs. So in that case, either you can to check the size of things at every single location, or the program has to have some knowledge at compile-time.

And if you are relying on the program to have compile-time knowledge about what fits where, then is there anything left to be done? And in that case, are we really any better off than RISC-V?

Perhaps we want the Boot program to have the POSSIBILTY of doing low-level interfacing like this, but to also have a sort of 'general case' memory type that it can do internal computations in, which is more predictable?

here's a few cases to consider:

If we rule out the idiosyncratic memories then things probably become more tractable. This means that memory locations never refer to things like references to consecutive fields in a C struct -- you would treat such things as fields of bytes, instead. But in this case how would we interop with structs on top of a typed HLL language, like Haskell?

---

so, i think that direct interop with external data structure layout is always going to require either program knowledge, or a debilitating amount of runtime introspection. When we say that we want Boot programs to be universal, all that we can really ask for is for the existence of uniform memory for internal data structures -- so e.g. if a hash map library is written in Boot, we want to be able to re-use that same code when running on top of Haskell, on top of Python, or on top of raw assembly. Now again, we can do this at compile time (have constants for the size of ptrs, the sizeof i323s, the size of i16s), or we can do this at runtime (introspection/sysinfo, possibly additional instructions such as addsizeof to make it easier to iterate through memory, although we still need to do other stuff when we need to e.g. compute how many memory locations an array of 100 ptrs takes up).

so, can we/should we just use RISC-V for this? maybe... the semantics we are using are pretty different (pointer arithmetic is invalid) but that may not be a problem. should think further. my instinct is no, the semantics are so different that you don't gain much by using it, so just make something new, it'll confuse ppl less anyways. but that may just be Not Invented Here syndrome.

---

okay i've thought about this a little today, i think it's still worth doing, and here's why:

the other option we are considering is just using RISC-V, but adapted for our case.

this means just specifying that pointers are opaque.

so what happens when you load a pointer? what instruction do you use? We need to add a 'load pointer' instruction, and a 'store pointer' instruction.

and what happens when you add an int to a pointer? what instruction do you use? You could just use 'add'. But then the VM has to check what type is in the registers each type it encounters an 'add', in order to know whether we are adding integers, or adding one pointer and one integer. it would be better to have an add-int-ptr, like Boot.

what happens if you try to do pointer arithmetic? it fails at runtime.

So, we need at least some new instructions (load and store pointer) and could really use some others (add int to pointer). That's already probably enough to make us think about a new language. And pointer registers seem to fit really well too.

we could always do an extension to RISC-V tho...

and there are other minor reasons to go with a new language. We're going to have to redefine the RISC-V semantics a little (no pointer arithmetic), and we're going to have to add the syscall fns. It's more confusing to have an overlay like this than just to make our own new thing.

---

can we make better immediate formats? actually i don't buy RISC-V's idea of splitting the immediates to keep the sign bit in the most-significant-bit. I think it's simpler to keep them together.

---

this 'BootX? should have no runtime requirement over Boot' is getting too expensive. I put in a special 32-to-8 and 8-to-32 cpy (mov) instruction (with its own instruction format) so we could access all 32 BootX? registers from Boot. And i put in extra cpfis, cptis, cpfps, cptps (cpy from int stack, cpy to int stack, cp from pointer stack, cpy to pointer stack) instructions so that we could access deep into the stack without pushing and popping. And i kept the intstack and ptrstack at capacity 8 so that those instructions can access all of it.

But this is getting ridiculous. I'm going to take all that out and assume that BootX? will need extra state and can store it in allocated memory.

---

i used to have this:

Five formats. For each format, from LSB to MSB, the fields are:

In graphical form (from MSB to LSB):

but i'm gonig to change it to consolidate i5, i8, i12 into one format with a 2-bit signed immediate and a 3-bit opcode. The instructions can then add in the other operands themselves. This is slightly less efficient (now JPC only gets an 11-bit signed immediate; also the encoding logic has to split up the immediate), but it's slightly simpler/less scary (b/c 2 less formats), and it has the property that RISC-V likes that the signed bit of the immediate is in the same place everywhere.

i saved the former one in boot_old/pyboot_190529_i5_i8_i12_encoding/encoding.py

later that day:

ok i did that but i think it's more complicated, not less.

i saved it in

../boot_old/pyboot_190530_r_i_only_encoding

Also i reread the RISC-C comments about keeping the sign bit in one place and i think they are talking about when they have branch offset immediates in units of 2; rather than just multiplying the immediate by 2 they do some complicated thing where they swap around the location of bits, but they leave the sign bit in one place, and i think that's what they were talking about.

Regarding keeping the registers in one place, since in the immediate formats some registers are being used as immediates, i think it's fine for those to move, which is what the old format did.

Regarding the simplicity of having only R- and I- formats, i found it was cleaner to add in R2 R1 R0 subformats anyways, and i think the complexity of juggling the immediate bits outweighs not having I5 I8 I12 formats.

So forget about swapping around the sign bit, that's stupid.

But i like having a fixed number of opcode bits for the Is, even at the expense of reducing JPC range by 2x -- it's probably much less efficient in terms of code density, but it's cleaner. So let's do:

---

ok i've mostly implemented Boot now but i havent finished testing it yet. It took me much of yesterday to write the initial encode/decode logic, then today I spent about half a day changing the encoding, and the other half writing the actual VM, and the first draft of the VM is done now but i've only tested the first few instruction. So i think it's likely that this is a two-day job in total (which i guess means a two-weekend job for hobbyists) (although, even though someone else wouldn't have to spend time changing the design of Boot itself, they would have to spend more time reading and understanding the spec i wrote, so maybe it really is a three-day job).

Anyhow, the point is, unlike the previous attempt, this is a success, in that:

it's sufficiently easy to implement!!!

(although i dunno if i'd call it 'dead simple'; i could have left out the I encodings b/c 32-bits is bigger anyways, i could have left out the rtwo rone rzero/MORE instructions, i could have chosen either a register or a stack machine rather than mixing them, i could have had a variable-length encoding that just had whatever data was needed after a one-byte opcode).

However i noticed that i now think that BootX?, as i currently envision it, wouldn't really be that much more complicated to implement, at least not if you don't include all the instructions. I think much of the complexity that i ran into before came from complicated ideas for the addressing modes etc, for example, having 'stack-aware' instructions that take an operand of $S to mean, not auto-push-pop to/from the stack, but rather, do something to the stack itself. Also my CALL3 idea led to difficulty in terms of how the return argument from the instruction being called would get assigned to the register indicated in the destination operand (although as i think about it at this moment, i don't see the problem; just have the RET instruction look at the jump target and if it's right after a CALL3 instruction then look at that instruction to get the return register).

Anyway, so i've simplified the addressing modes, and i don't think this would be a problem anymore. In fact, the encoding might even be easier, since with bigger operands the need for immediate encoding formats is not so important (with a 32-bit encoding you wouldn't need those immediate formats so encoding/decoding gets simpler). Also, now that Boot and BootX? have different numbers of registers, the story of incrementally extending Boot to BootX? is not so good as it was. And because the 16-bit encoding was so cramped (only +-1KiB? direct jumps), i felt the need to let a constant table creep in to Boot, which is complexity we wouldn't need with BootX? b/c we could have a 21-bit immediate constant.

And there was other stuff i was dreaming about with BootX?, about having it be a good emulator and having all these hooks and having capabilites and having polymorphic functions, but now i think that that belongs only in OVM if anywhere. BootX? is just a lowest-common-denominator target platform to make it easier for me to port everything on top of it.

So what about just calling the thing we're calling BootX?, "Boot", and having "Boot0" just be a subset of Boot's instructions (also Boot0 would only use the 32-bit instruction encoding, whereas Boot could have more)?

To be clear, what i currently mean by "Boot" (was: BootX?) is:

32-bit fixed length encoding: 3 encoding length format bits, 6 opcode bits, 2 opcode addr mode bits, 3 x (5 operand bits, 2 addr mode bits)

Boot0 would have the addressing modes (this removes the necessity for us to provide immediate-mode instruction variants like ADDI that are only needed b/c we don't have an immediate constant addr mode), but might not have the instructions supporting any types beyond int and ptr. Boot0 would have the 32x2 registers, plus 32x2 stack items.

to be clear, i am really excited that the 16-bit encoding of Boot seems to be really complete and dense, possibly even denser than anything else out there. And working with those constraints did help me hone my ideas about what is really essential and fundamental at this level of the stack of intermediate languages. But now i have those ideas honed, and the primary remaining goal for Boot is ultimately just to be an easy-to-implement, easy-to-use, portable target, not to win contests about code density and beautiful compactness. If the 32-bit format is easier to compile to than the 16-bit format (b/c of the greater number of registers), and not tons harder to implement, and if we'd have to compile Boot-32 to a Boot-16-only machine (due to greater number of registers in Boot-32), rather than easily extend the Boot 16 machine , then we should just start out implementing the 32-bit one, and consider the 16-bit format just an optimization.

---

if we start out with 32-bits though i'm just bothered by having so much state. If there's 32 registers, times two banks, and 16 stack locs, times two stacks, on a 64-bit platform that's:

32*4 + 32*8 + 16*4 + 16*8 = 576 bytes of state

so how about just going back to 8 addressing modes. Now we have:

16*4 + 16*8 + 16*4 + 16*8 = 384 bytes of state

Still a ton but not quite as excessive.

First three registers are still 0, stack, tos, and still have their special behavior (this is a little redundant, but it make them more useful in the other addr modes, and makes everything more uniform). Except i had a new idea: there should be a 'zero page' of pre-allocated memory, and the zero register in the pointer register bank should point to it, rather than be the PC (for the PC, there is a PC-relative addr mode, see below).

Now we have:

3 encoding length format bits, 4 opcode bits, 3 opcode addr mode bits, 1 opcode bit to control whether or not addr modes are in use or are just more opcodes, 3 x (4 operand bits, 3 addr mode bits)

(16*8 + 16) opcodes = 144

addr modes (note that some of these have two forms, int and ptr, for the two 'banks'):

Should &0 holds a null pointer? Or maybe some platforms won't let us have a null pointer? Let's see:

so... i guess this doesn't decide it? Sounds like on most systems, there is a null pointer so we are good, but there isn't always, but otoh, even on systems where there isn't, we could always allocate one byte somewhere and then declare that to be OUR null pointer.

Regarding which instructions take forced immediate operands (which is important/special because then the 3 addressing mode bits are part of that immediate), we could just start by saying that only that weird bank of 16 opcodes in the 'immediate' mode do this (or, do we want to use 'immediate' mode as a CALL3 to nearby PC-relative locations? i don't think that's very useful...).

---

i dunno, now that i look at all that, i think it does make it more complex after all:

soo... let's just have the encoding so that we can support the addr modes later, but still separate 'Boot' and 'BootX?'.BootX? is still the better name so that programmers won't be intimidated and think they have to do all the addr mode stuff before they can say they implemented the 'real' Boot.

Also, on that note, since we are focusing on simplicity and not code density now, maybe choose between register and stack machine? Maybe choose register machine and just say 'don't use registers $1 or $2 or &1 or &2, they are reserved for BootX?', and then provide special instructions to push and pop from the stacks? No, actually this is an issue, because now compiling BootX? to Boot means the runtime needs to reserve some registers to temporarily store stuff that should be on the stack in BootX?. If we support the stacks, then the whole point is that this sort of thing only requires us to increase the stack capacity instead, which is invisible to the instruction encoding.

i guess another thing i am still thinking about is: if we don't allow stack mode (stack as registers) in Boot, then ppl might implement the stack in a way that does not allow register access to it (if there are no other places available to temporarily store 16 stack locs if we have to pop/push to get at the deeper items. I guess we can solve that by requiring support for Boot instructions that do that. -- ok i added lsi lsp ssi ssp to remind me to do that.

also we should specify that Boot has a few more than 16 locs on each stack, so that the BootX? implementation has somewhere to work.

we should also probably have a pushpc command.

---

So if we are going to have 128 opcodes, does that fit all the other data types that we need?

Let's see, here's what we have right now:

ann lh lhu lb lbu sw sh sb sp add addi sub mul sll sru srs and or xor ap bne blt jd cas fence appi apwi reserved_r_27 reserved_r_28 reserved_r_29 reserved_r_30 rtwo

sysinfo cp cpp lsi lsp ssi ssp rone

halt lkp rzero

break bad1

  1. i5 lw lp beq beqp
  2. i8 li sai
  3. i11 jk jpc

total so far: 49

we gotta add long, float, double variants of stuff, and conversions, and division:

ll ld lf sl sd sf addl addd addf subl subd subf mull muld mulf slll srul srsl andl orl xorl beql bltl beqf bltf beqd bltd (note, these float/double branches use f.p. total ordering, not floating-point comparison) lsl lsf lsd ssl ssf ssd apl l2i i2l f2i i2f d2l l2d f2d d2f i2lu l2iu f2iu i2fu (we could also convert directly between i and d, which would be 4 more) divf divd divi divl remi reml cmplt fcmpeq fcmplt fneg fsqrt, fcopysign, fmin, fmax fabs fceil floor ftrunc fnearest fpow fisfinite fisinf fisnan

that's 69 right there.

49 + 69 = 118

we also wanted: pushpc

so that's 119.

also we actually will get a few more opcodes than this because we haven't done our MORE/rtwo trick yet. So we have a few more spots for read, write, poll, etc.

so we can make it!

---

how many extra stack locs should we allocate for Boot to ensure that BootX? can compile to Boot and have 16 stack locs left for BootX? user programs?

we need at least 1 for doing addr mode computations (i think we only need one, assuming we have SWAP available for each stack; so i just added swapsw and swapsp).

i don't think we need 2 for CALL3; the input args go onto the ordinary stacks and take up space

so i think we really only need 1. Of course we may want to add some instructions later that need some, too. Such as software floating point. That sounds like a beast, so who knows how many it needs. Also, doubles will take up two slots on the int stack.

so let's say 6. 3 sounds universal (i tried to lookup how many you actually need for universality but i couldn't immediately find it). 6 is 3 x doubles. So that's 22 stack locs. So on a 64-bit system, we're wasting 6*4 + 6*8 = 72 bytes per VM state to allow for implementation temporaries. Not cheap, but what can ya do. We can reduce it later if we find we don't need it.

---

if the 16-bit encoding does become for efficiency only, maybe it would be time to more seriously consider things like RISC-V's idea with their compressed encoding that you should prioritize using the first few registers, but also duplicating operands (op0 and op2 become identical? but mb not because op2 is often the immediate. So mb op0 and op1). So far with the 16-bit encoding we have 3-address-code (3 operands) and 8 regs throughout, mb it would be good to have encoding format options for this (spend another bit?). If it's just for compression, we won't need all the opcodes anyways.

---

i guess an obvious idea is to get rid of some of the large immediates, to make room for a subencoding in which you have 4 bits per operand, but only 2 operands instead of 3:

1 length encoding bit, 1 R-format identifier bit, then:

note that this would eat up 4 more R format opcodes (because the I format opcode is now only 2 bits, not 3)

also an interesting thing is that, with 4 operand bits, now the 'rtwo' instruction can access 16 instructions instead of 8! so we have more instructions to work with.

---

maybe we should make Boot that revised 16-bit format?

it's not THAT much more complex than dealing with the 32-bit format, and it's half the size.

otoh if we really want the 16-bit format to be a COMPRESSED encoding, we might want to use a different set of instructions later to compensate for the lack of addr modes (e.g. some instructions with extra immediate fields).

otoh that would make decoding and analysis more complex b/c now the opcodes wouldnt match between the 16-bit and 32-bit format.

otoh they don't quite match anyways because of the I format opcodes? i guess we could just put those at the end, though.

---

yeah, y'know, if we made Boot the 32-bit encoding there are actually a few probs that make it just as complex: