proj-oot-ootBootNotes1

(note: this file was created pretty late in the game; Boot notes used to be in the ootAssemblyNotes series, and maybe also the ootOvmNotes series)

---

---

---

if we have 2 banks of 7 regs each, that's more than the 8 regs in x86. But it's less than the 16 regs in x86_64, so i think it's okay.

ARM has 16 regs but note that this includes the stack pointer, link register, and PC. So i think we're okay but not 100% sure.

---

mb dont specify bitwidth, except that it's at least 32. Not a problem b/c e.g. 32-bit width can be emulated by the programming by bitmasking (ANDing with 2^31) after each op. int32 ops in LOVM could compile to op+bitmasking in Boot.

---

---

how does AArch64 do it?

https://stackoverflow.com/questions/44949124/absolute-jump-with-a-pc-relative-data-source-aarch64

says to "rely on the compiler generated literal pools as usual:

LDR x9, =0xBADC0FFEE0DDF00D BR x9 "

https://modexp.wordpress.com/2018/10/30/arm64-assembly/#branch also has a 'B' mnemonic tho

https://wiki.cdot.senecacollege.ca/wiki/AArch64_Register_and_Instruction_Quick_Start uses 'BR'

according to https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/ARMv8_InstructionSetOverview.pdf , B has a range of +-128 MB; that's about 28 bits

so, seems like the AArch64 way is:

according to https://developer.arm.com/documentation/dui0473/c/writing-arm-assembly-language/literal-pools the literal pool offset from the instruction can be +-4k or +1k for 16-bit Thumb code.

mb see also discussion at https://stackoverflow.com/questions/41906688/what-are-the-semantics-of-adrp-and-adrl-instructions-in-arm-assembly

however, in Boot, since we may be compiling Boot to some other ISA, we don't like indirect jumps to loaded constants, because the compiler needs to be able to identify all potential jump target immediates to fix them up once the actual locations are known. But perhaps there's some advantage to grouping together these addresses into a 'literal pool'? Not sure.

---

okay i split jrel into 3 variants:

the idea is that:

a fancy assembler can offer a jrl pseudoinstruction and then just use the smallest one it can for each relative jump.

---

i guess one benefit to 'literal pools' is that literals can have natural alignment. In my current proposal, in the 16-bit or variable-length encoding, a 32-bit constant could be not in 32-bit alignment.

but, in the variable length encoding, you're always reading instructions which aren't aligned anyways, and in the 16-bit encoding, things are only 16-bit aligned which probably is smaller than the wordsize on the underlying platform anyway. And having a literal pool (a) means maybe more memory traffic, because the literal pool is not necessarily next to the instruction, and (b) might complicate the compiler, which has to fix up things that are not known to be jump targets except that some jump instruction refers to them (esp. if the referring jump is after the literal pool).

i am not enough of an expert of this stuff to know which of those is worse. For now i'll leave it like this.

---

riscv really does include the jump immediate in the instruction, mb we should too, for the 16 bit encoding?

done

---

could have a special register that is (or may be in some cases/is permitted to be) cleared at the end if each instruction except for li. that allows the implementation to treat li to that register as fused with the following instr, aloowing the benefits of having some instructions being floowed by an immediate value without actually doing that. or, could get a similar effec while still have the 'real' opcode first by constraining the data in the second part to be some special instruction ( ut then the data has to be parsed out...)

---

hmm i dunno i'm thinking again about the choice to have jump instructions that reference words in the instruction stream. wont this mess up hardware implementations bigtime? Because a hardware implementation can't assume that it has all the information it needs to execute an instruction after loading the instruction.

otoh i guess ARM literal pool load instructions like

LDR x9, =0xBADC0FFEE0DDF00D

'should' have the same problem, and they're fine. All i'm talking about is something like that, except that you load from the literal pool into the PC instead of into some other register. So imo it actually shouldn't be much more harder for the chip designer.

so i think it's good.

---

removed this from undef behavior section (in 16 bits this is impossible):

---

removed/rewrote this:

Instructions with exactly one unsigned immediate operand may have any unsigned value from 0 thru 4194303, inclusive, in that operand. Instructions with two operands, where unsigned immediates, may have any value from 0 thru 63, inclusive, in the first operand, and any value from 0 thru 65535, inclusive, in the second operand. Instructions with three operands, where unsigned immediates, may have any value from 0 thru 63, inclusive, in the first operand and any value from 0 to 255 in each other operand. When an immediate operand type for this instruction is signed, unsigned ranges from 0 to 63 are replaced by signed ranges from -32 to 31, unsigned ranges from 0 to 255 are replaced by signed ranges from -128 to 127, unsigned ranges from 0 to 65535 are replaced by signed ranges from -32768 to 32767, unsigned ranges from 0 to 4194303 are replaced by signed ranges from -2097152 to 2097151.

removed/rewrote this:

  1. ## Reserved instruction opcodes ### Boot only defines opcodes under 64; higher opcodes are reserved for extensions. Opcode 62 is implementation-defined. Opcode 63 is reserved for future use.
      1. Reserved sinfo query types ### Boot defines sinfo query numbers 0-3. Queries 247 thru 254 inclusive are implementation-defined. Other query numbers are reserved for extensions.
      2. Reserved xlib functions ### Boot defines no xlib functions, but reserves numbers 0 thru 127 for extensions. Numbers 128 thru 254 are implementation-defined.

removed/rewrote this:

Note that the opcodes (see table below) have the following properties:

Note that later additions that assign instruction(s) to the RESERVED opcode may break the 'only such' parts of these properties.

The short descriptions under Instructions, above, have been kept to at most 80 characters per line.

---

nah, i have a sweet spot for 16-bit

---

old:

---

Still reconsidering overlapping all the register banks, and adding a 6 element small stack to boot. This would keep boot implementatable in 16 registers. It would mean that l o v m would have 32+32= 64 items of state = 512 bytes of state instead of 1k+fp_regs like in The current proposal (assuming each register is 64 bits). Semantically we would still ban integers and pointers from mixing -- so an implementation could choose to keep those registers separate if it wanted. Would we have one small stack or two? The overlap way of doing things would say one I guess. Don't know if this is worth it because boot is in some ways actually less expressive because it loses the ability to directly address 16 registers -- although it gains the ability to have more than eight registers of each type. The benefit that I like the most is that now if you want small stack you don't have to change the semantics of boot at all when moving to l o v m. A tangential benefit is that since it's easy to reduce the size of the stack to six, you can give the platform two more hidden registers (one as a small stack TOS ptr and one as a temporary). I don't know man. This would make type checking require extra work, and it goes against the principle that more registers is better in lovm. Could just provide the small stacks anyway.

---

Still considering providing two small stacks in boot.

done

---

i'm still not happy with the lack of zero-operand instructions for the smallstacks. The 8-bit encoding isn't able to help much here because it has so few bits, and because i really want to be able to copy/load from/to any of the first 4 registers. The main value of the stack in this design is only that it gives access to a few more 'registers' without having to make operands bigger (10 registers instead of 7; it could be more but for now the 10 is chosen because i'm worried about implementations on 32-register machines having enough hidden registers for its own use).

an alternative would be to just make the 64 8-bit ops be the 64 opcodes, with all operands assumed to be the pseudostack register -- and then to special case the obviously useless ops (e.g. cp from/to the smallstack is useless)

---

the motivation for requiring that stack depth be staticly knowable at each instruction of the program is so that a compiler can map the stack locations to fixed register locations.

The reason we only count variance at the end of an instruction is so that ENTRY at the beginning of a function can clean it up .

---

actually that presents a bit of a conundrum.

first, if we disallow smallstack depth variance at the end of an instruction, then immediately after a RET instruction (which restores the stack to the way it was before ENTRY), we'll have forbidden depth variance at that moment. The better solution would seem to be to exempt one of ENTRY or RET from the condition (ENTRY if the condition is enforced before the instruction, RET if after).

Say that stack locations TOS, TOS-1, TOS-2 are mapped to registers R3, R2, R1 in function1. Now function1 calls function2. So at this point the compiler would want to replace TOS with R3 in function2.

But if function3 has a empty stack when it calls function2, then it will have stack location TOS mapped to R1. So the compiled code for function2 conflicts with this.

A solution would be to make the entire stack caller-save, and require the stack to be empty upon each function call (except perhaps for arguments passed on the stack). So make the caller push everything from the smallstack to memstack as part of each calling sequence. But this negates some of the benefits of having a stack (quick function calls).

What we'd like instead to do is to allow the implementation to do some behind-the-scenes register window stuff to spill from the bottom of the smallstack if it wants, and maintain the stack as a stack rather than mapping it to fixed registers, while exposing to the program the illusion that it was emptied after the function call. This mojo would occur upon encountering the ENTRY instruction, so we'd make the requirement for a constant stack apply at the beginning, not end, of instructions, and relax that requirement for a constant stack.

Another way of doing this (which i think i like better) is that if the smallstack doesn't all fit in registers, the part in registers is merely the top few items. By indexing from the top instead of the bottom we avoid this problem, but we get another one: pushing or popping the smallstack requires us to do n extra MOVs, where n is the number of stack items in registers.

And a third way is for the implementation to have indirection when accessing the smallstack, rather than mapping it to a fixed register.

actually i think we should remove the restriction on smallstack depth variance:

so, (in almost all cases) the (entire) stack cannot be staticly assigned to fixed registers (without any indirection).

---

we could encode the branch and jump targets to avoid useless stuff like zero offsets from the next instruction, but:

---

removed from spec (this should go in a design doc, not in the spec):

  1. ### Reserved instruction encoding space ####

The 0 in the most-significant bit is intended to allow Boot to be made a part of other instruction formats which use 0 in this bit to indicate a Boot instruction, a 1 to indicate something else (for example, instructions of different lengths). That is to say, instructions with a 1 in the most-significant bit are reserved for extensions.

--- design

do we implement labels in Boot assembly? i'm thinking not. LOVM will have a more usable assembler, Boot is just a target language.

---

i decided to leave them embedded in the instruction stream. Otherwise labels could only be loaded with an 8k range, which would either make implementations jump thru hoops by doing things like branching to somewhere near a desired label just to collect the label, and then branching back (at presumably great cost to performance); or it would mean that the Oot implementation would have to be careful never to compile to something with more than 8k within a function (LLVM only allows indirect branching within a function), which i'm not sure if would be feasible.

As a bonus, now we can assume that any Boot implementation supports larger programs.

We can have a BootX? anti-extension that disallows this.

--- design

Boot Assembly is spartan because Boot is not really meant to be written by humans.

--- design

sinfo queries are our way of defining platform-dependent constants. sinfo queries are static so that a Boot compiler can replace them with load-immediate-constant instructions, possibly allowing further optimizations through constant propagation.

--- design

this is undefined b/c it's undefined in LLVM (cite https://llvm.org/docs/LangRef.html#addresses-of-basic-blocks ):

"- comparison of code pointers to any pointer except the null pointer"

generally, whenever the naive translation of some Boot program to some common platform causes undefined behavior, then we want to specify that as undefined behavior in the Boot spec

--- old

mb there should be a way (like an instruction or annotation) for the program to give a total ordering on all registers on all banks so that the implementation knows which ones to preferentially map/cache to physical registers

so for each of 16, 32 physical registers (minus 4 or 8 for implementation stuff), need to specify how many of each of the four register banks to store. Can round to powers of two or something to save bits.

Can probably do it in nine bits (x2 for each of 16 and 32 registers) by: for each of the first three register bags specify three bits for the power of two registers to cache from that bank (e.g.: 4 int32 regs, zero any regs, 4 codeptr regs). The last register bank is pointers, and is all of the remaining registers.

Actually for a compiler it can do its own analysis I guess (register allocation), and for an interpreter you want to minimize the possible combinations of assignments of virtual registers to physical, because each such assignment requires an entire new binary of the interpreter (unless you want to do a table looking at runtime upon each register access). So maybe just two bits per register bank for six bits total? Still too many combinations (64); could lead to a huge interpreter binary. Maybe offer just four combinations.

Although really I think you'd want at least 256 combinations; that would allow you to say for each of the first three register banks: we need less of these than baseline, or more of these than baseline, or about the same amount of these as baseline.

(in the following, when i say banking registers, i mean that since the instruction determines the type of register in each operand, if you have eg 4 types, then each register number can map to one of 4 different physical registers, depending on which type it is; this allows you to address 4x more registers than you otherwise would in a fixed number of bits; for examples, 3 bits can address not just 8 but 4x8=32 registers if registers are banked and if the operand determines the register. You can think of this as the operand determining 1 of 4 registers and so providing 2 extra bits, so the effective register specifier is 3+2=5 bits instead of just 3.. anyway)

Anyway the insight above about specifying which registers should be mapped/cached is based on the following question:

How can it be that making the programming language MORE expressive (by having 4x banked registers, so now 3 bits of operand can address 32 registers) can TAKE AWAY information from the implementation (about which registers should be cached/mapped to physical registers if you have less than 32 physical registers available)? If you have 3 operand bits and non-banked, you only have 8 registers, so if you have more than 8 but less than 32 physical registers available, then obviously you should map all 8 VM registers to physical registers; but if you have banked registers, you can only map some of them, so let's say you map the first 2 registers in each bank, but then the program uses a lot of pointers and no anytypes; you wasted 2 of your physical registers on anytypes. This seems paradoxical, because it seems like the programming language got more expressive (we can talk about 32 registers instead of just 8), but less information seems to have been communicated to the implementation (about how the physical registers can best be used).

One way to think about why a VM with less register banks can be more efficient is that it forces the programmer/compiler to pre-compute which variables are important enough to be cached in registers. An approximation to this is to say which register banks are important enough to be assigned registers, hence the idea of specifying that.

However, this way you still don't get the savings of dynamic typing in the sense that a program would be allowed to use the same physical register for different banks at different times within the program. To get this you could also annotate the program with which physical registers you mean when use each register. E.g. before an instruction that stores a pointer to pointer register bank #3, you could have an annotation that says 'i recommend you store this to physical register #6'.

A softer way to do this would be to annotate each store with a priority level for storing that value in a physical register. Priority level of one bit is like having a register keyword. At least two bits is more appropriate for us because we have both the 16 register case and the 32 register case.

But again, back to the seeming paradox; it seems weird because it seems like if the program is providing less information (as i think it does with less register banks), then how is it that it is providing more information about how to execute the program efficiently? Could it be that in selecting which information to include, is in itself making a subtle statement? No actually I think that's wrong and the fewer registered banks really do provide less information.

In fact now that I think about it there's another way: The register numbers themselves provide a priority. for example if the program never uses registers 1 through 27 in some bank, but only uses registers 28-31, that could be a signal that there is very low priority for cacheing these in physical registers. In fact, the program could pretend that the banks overlap, that register n in one bank is actually the same as register n in another bank and would overwrite it. Of course that breaks if the program needs more than 31 total registers.

This suggests a simple scheme; up to some register number n (perhaps specified by annotation, although for the sake of tooling like LLVM, maybe better to do it fixed), the program assumes that register banks overlap in all banks. Higher register numbers are assumed to be distinct. The cutoff could be 16, to keep it simple. Except that I guess the small stack register is always distinct, as are the special registers (eg stack pointer vs other #2 registers), so let's say the first two are always distinct. So that's eight in total. So maybe make the cut off eight instead of 16. So now the total number of registers is 8+(24*4)=104 rather than 128. That's still a lot honestly. If the cut off were 16: 16+(16*4) = 80.

Issue is now that in order to spill to memstack, since we separate types on the stack and also different types are different sizes in memory (the memstack is in memory), for each of the 16 unified locations may need to specify two bits, to say what type it is.

Alternately just make everything on the memstack PTR_SIZE, now only need to specify how many of those 16 are in use, and then how many of each of the other four banks are in use. Which is 5 bits (to specify up to 32 registers in use) * 4 banks = 20 bits. Or, if only powers of 2 registers-in-use-counts are allowed, then you only need 3 bits per bank, so 12 bits.

Even simpler: just say that all 32 registers are shared, and only the small stacks are banked. Now total state is 32*5 words = 160 words = 640 bytes (32bit) or 1280 bytes (64bit)

do we lose our free type checking though? Actually, no; since we have strict typing, the implementation is free to regard the unified register file of 32 registers as actually consisting of 4 separate register banks of 32 registers each. We can specify that this is permissble simply by specifying that using a value in a register as if it were a different type than it is, is undefined behavior.

Could also consider unifying the small stacks. But there, i don't see how an implementation could choose to implement a unified small stack as four separate small stacks, so this would destroy free type checking, I think. So I think I don't like it. Also, for similar reasons, it would provide less flexibility about what is on top of the stack (if you push an int32 and then a pointer, the ptr is the only thing on top of the unified stack, whereas with 4 separate stacks an instruction expecting an int32 would be able to see that one on top of the int32 stack). So yeah I don't like this.

So, have 32 unified registers, + 4 * 32 small stacks.

But wait, there's a serious downside to this. With 32 GPRs we lose the ability to use 3-bit operands in the 8-bit encoding to reach a registers in each type for a total of 32 -- unless we permanently specify the type of each of those 32 registers in which case the 32-bit encoding has only eight registers of each type again, and furthermore the 32-bit encoding is redundant by two bits per operand because the OP code already determines the register type.

But what we gain by having unified registers is the ability to have the right values mapped into registers. Mb split the difference: first four registers are banked, the rest are unified.

No, that's too many banked, because then we already have 16 bank registers (12 If you don't count zero registers) which is our limit on some embedded and x86 (12 is our limit, not 16, b/c the implementation needs a few regs for PC, ptr to implementation memory storage, scratch). So how about just banking the first two registers (zero and the small stacked TOS banked), which is kind of what we really want to be banked anyways. So in the 12 physical registers we are storing 4 banked and 6 unified VM registers, so three-bit registers operands (in a hypothetical 16-bit encoding) could address 10 registers.

We could bank another register(s), but that's less clean, because with register 1 we had the motivation that it's the smallstack TOS and there will be 4 smallstacks in LOVM, but with other registers we don't have that motivation there. However, on the other hand, when the target has 32 registers, being able to reach only 10 of them is not much. So we could consider banking more registers. How many? Let's do a geometric mean (exp of mean of logs) between the case where our target platform has 32 regs and where it has 20 regs: for 32 registers we have about 26 free, so using half of those for banked registers would be 13 physical registers used for banked registers, divide by 4 to get about 3 banked virtual registers; and for 16-register targets, banking half of the available 12 would be 6 physical registers used for in-between 1 and 2 banked virtual regs. In log 2, 3 becomes 1.5, and 1.5 becomes 0.5. The average of that is 1, and 2^1 = 2. So this suggests to bank two virtual registers (TOS and one other).

This would allow a 3-bit operand to address 13 registers (4+4+5, not including the zero reg).

Could we go higher? Sure. No matter how many VM regs are defined as banked, the implementation doesn't really have to cache/map all of those to physical registers. So it wouldn't be crazy to bank the first 4 virtual registers (3 excluding the zero reg) (and expect implementations to 16-bit targets to not map all of those banked registers to physical ones).

So consider banking 1 tos reg, 1 zero reg, one special reg, and one caller save reg. Then the next 28 registers are unified. Now excluding zero registers, three bits can reach 16 registers (3 * 4 + 4). Pretty snazzy. So 0-3 would be banked and 4-31 would be unified GPRs. On the other hand, with this system you can't just say that all virtual registers below 8 are probably mapped into physical registers, whereas with the 3-register-banked system maybe you can say that.

To simplify specification in the 32-bit encoding, having the first four registers banked (including 0 reg) could correspond to assigning types to registers 1-12 -- now in the 32-bit encoding Boot spec you don't have to talk about banking at all, just about the type of these registers. This also means there would only be 32 registers total (including 0 reg) instead of 40 (0 reg + 4*3 banked regs + regs 4-31), which is a power of two.

if you had only the first 3 banked, then you only assign types to regs 1-8.

should the type specificity of some registers be mandatory or just recommended? I guess it's okay to make 8 or 12 registers type specific (mandatory). Compilers that can't handle that can simply choose to never use those registers. And it might make hardware implementations slightly smaller.

To further simplify the Boot spec, in Boot we could just define only the banked registers or only the unbanked/unified registers, and leave the full defn to LOVM. I guess it's simpler for most implementators if we define only the unbaked/unified registers (implementors who can only implement typed regs can just multiply the registers by 4...). Now, in the 16-bit encoding, we are going to want to access some of those banked regs (at least the smallstack TOSs!) but i guess the mapping from 16-bit encoding 8-bit regs to LOVM regs doesn't have to be the identity map.

eh, that's not really great; the registers that the 16-bit encoding can access should be the same ones that are mapped to physical regs; so in the case where LOVM is being compiled to Boot, that means that we do want Boot to know about those banked regs

we should also provide a tip about which regs should be cached/mapped into physical registers first.

---

to summarize the previous section, i'm leaning towards:

---

if we do that, how do the argument registers square with LOVM's input/output argument registers?

could treat the 16 argument regs as 'output' regs; but that's kind of a waste b/c wouldn't it be nicer to just push outgoing arguments onto a smallstack? it's really the input arguments that you want there

also should leave some room in the 'map lower-numbered' tip for mapping smallstacks, right? or, is that tip only applicable to Boot, and if we compile LOVM to Boot then caching the smallstacks will be done in regs, so it's taken care of?

and maybe the previous two ideas should fit together.. the place we plan to cache Boot smallstacks should fit with the LOVM argument passing convention so that it is compatible with the Boot argument-passing convention

---

so how about the same as previous, for Boot, and in LOVM we add:

umm, no, two issues with that.

---

ok so i think if we want Boot programs to 'interoperate' with LOVM, with the same calling convention, and if we want LOVM to pass arguments in the smallstacks, then:

this is inefficient but it's okay b/c LOVM is for efficiency, not Boot. Boot is for easyness.

if LOVM is COMPILED TO Boot (or interpreted by a Boot interpreter), then we can have non-identity mapping of registers from LOVM to Boot to make space to cache the LOVM smallstacks

this also implies that if you are calling a Boot fn from LOVM, you should flush the smallstacks, and if the smallstacks are empty but you look deep into them, you should view the memstack (b/c they are caches)

---

is it REALLY worth it to have banked smallstacks?

benefits:

costs:

also think about how the smallstacks could be 'secretly' stored on the memstack below the virtual SP by the implementation. The stacks are being pushed and popped, and an interpreter may not know a priori how big they'll get, so - no that's wrong, in LOVM we can require a max stack size allocation in ENTRY. Boot was where i was saying we want to leave the stack semantics uncoupling with function calling, but then we removed the smallstacks from Boot - but this is still kinda dumb; at the beginning, a function can know the max # of output regs it'll need, but not exactly how many it'll actually use of each type, because different control flow paths lead to different function calls (and even on one control flow path different functions take different numbers of arguments of each type). So for example if you build the smallstacks covertly below the virtual SP, and at one point you will need to call a fn taking 16 anytypes, then the max anytype stack must be at least size 16. So if anytype goes below ptr then the implementation will have 16 slots on the memstack below the ptr smallstack for the anytype smallstack. So now if you end up calling another fn which only takes ptr arguments, then you must address up past all those anytype slots before you can even reach the ptr smallstack. This hurts cache, may be greater than immediate register offset addressing ranges, etc. Worse, you have to pass state to that other fn somehow telling them how many items are on the anytype smallstack; they can't just assume that exactly the arguments that they need are the numbers of things on each smallstack; the alternative to passing this information is to recopy everything onto memstack the way the callee expects it. - altho this passed state could just be hidden in the implementation (the current stack depth). But no, we want to flush the stacks to memstack upon fn call, right? - whereas if you did not bank the smallstacks, you could just push the arguments in order and make the call, without passing extra state

let's think about what we really want here:

i think the answer is to split each 32-item smallstack into 2 16-item smallstacks. An 'input+local' smallstack and an 'output' smallstack. These are essentially two stack frames (or at least, the smallstacks portion of stack frames). The input smallstacks let you read arguments you were passed. The output smallstacks let you write arguments to a function you call. The callee only needs to know the sizes of the output smallstacks, which they already do.

This takes extra state in the interpreter to store the size of 8 16-item smallstacks instead of 4 32-item smallstacks, but luckily that still fits into only one 32-bit register.

so i think we can do this.

--- old

(this note was cut-and-pasted possibly out-of-order from another file)

I don't know maybe do have banking as long as it is both dynamically computable (keep active bank sizes in a dynamic register) and statically (lexical scoping of active bank sizes)

No even if it's dynamically and statically computable, there's an issue with the memory bandwidth caused by saving the active stack bank metadata on the stack.

---

so:

note: do we want a reg for platform use?

(note: AArch64 doesn't have thread or global ptr reg, only RISC-V)

---

actually we still have a problem here. Say you pass me 3 int32s. Now i want to push a 4th int32 onto the local smallstack. But this will hit the ptr smallstack, so i can't; unless ENTRY resizes the smallstacks, which means copying stuff unnecessarily.

the obvious solution is to subdivide the smallstacks even farther, into an input, a local, and an output. but:

i guess the implementation could secretly divide the input/local smallstack into input and local. but that's a little more state. and also this is getting convoluted.

i think, for simplicity, it might be better to let go of banked smallstacks!

in which case i don't think we have enough motivation for banked registers at all.

let's review:

seen this way, it looks like eliminating banked regs may not be optimal, but it might be simpler

so, get rid of all banked smallstacks and regs? - 1 register set of 32 regs

also, leaning towards no separate callstack, now that i see how having just one stack meshes so well with the concept of a stack

leaning towards yes

---

decision: introduced the handle datatype. Can't be used for blind copying, though.

---

---

next question: should we have a separate fp bank?

the possibility makes me reconsider having other banks too..

recall that the reason i turned away from register banking was to give the program a simple way to communicate which virtual registers should be 'cached'/mapped as physical ones. However, most popular physical architectures have their own FPU registers anyhow (16 or 32 of them), so that might be a reason to separate them.

also, if some low registers were not banked and some high ones were, that would increase expressiveness while still giving the program a way to declare which registers were most important. Could think of the banked ones as a separate 'addressing mode'. Say, the top 16 registers could be banked.

if register types are immediate, ptr, label, handle, 32-bit float, 64-bit float, then could have 6 banks, for 6*16 = 96 banked words.

hmm.. so, first i thought mb registers should be unbanked, b/c of the problem of communicating the priority of banked registers. But then, i thought smallstacks should be unbanked, b/c of the problem of addressing items on a smallstack when the smallstack is stored on a memory stack. But really, high banked registers would probably be stored on a memory stack too. So we'd have the same problem there. So maybe we shouldn't bank anything, just for that reason.

Is it really that bad? The capacities of each register bank, and of each smallstack, is statically knowable for each function. Which means that the offsets on the memory stack could be staticly computed. But... statically knowable is not the same as immediately present in the instruction encoding -- if this were a HLL you could compute the offsets and use them to generate the code, but this isn't; we don't want that sort of intermediate compilation step. And, if the offset were immediately present, you wouldn't have banking; that would be ordinary stack-pointer-relative addressing.

so, this actually kind of rises to the level of a fundamental principal of assembly vs higher level languages; i can't quite get the phrasing right, but it's something like, you don't want to translate instructions in the assembler to machine code instructions in a 1-to-many way that involves integrating non-local information in the assembly (the reason i say i can't get the phrasing right is that it would be okay to have a macroinstruction using a macroconstant, right? and what's really the difference between that and this? but there is a difference, imo; mb the point is just that such a macrolanguage would be a tiny HLL in its own right).

the other option is to just save all those registers upon context switch, and not have some of them 'turned off'. But that's too much state imo.

anyhow, so i'm still thinking: no banking

does this mean no FP register banking, even? It's a tough call because the most 'mechanical sympathy' would be the option of having an FP bank but no others. But, otoh that's 32 extra regs of state to save, that we're probably not going to be using much.

---

what does WASM do? WASM has up-to-32-bit (variable length!) integers to represent the index of locals. A far cry from our current 32 registers. (see https://github.com/WebAssembly/design/blob/master/BinaryEncoding.md#function-bodies ). And they require 32 bits just to address a local. They have a variable-length bytecode with one-byte opcodes.

i guess my response is, yeah, if you have 32-bit operands then you can address a lot more variables than our 7-bit operands, bu we don't want such big operands. Otoh they have this compact-ish variable-length stack machine rather than our 3-operand stuff.

---

this was an interesting read:

http://troubles.md/wasm-is-not-a-stack-machine/

the article points out that if WASM were a stack machine, liveness analysis would be trivial, but since it also has locals, to emit good code you have to do liveness analysis on locals, which is dumb b/c the compiler that emitted the WASM presumably already did that.

Instead he recommends eliminating locals and allowing blocks to take arguments (which are then placed on the stack), and return multiple values, and links to:

https://github.com/WebAssembly/multi-value/blob/master/proposals/multi-value/Overview.md

but for us the lesson is that SSA (immutable locals) is easy to optimize, stack machines are easy to optimize, but register machines require liveness analysis to optimize

eliminating our registers would be kind of the opposite direction to the one i was going in. i guess this highlights the importance of an 'undef' annotation, so the work of a higher compiler is not lost. but it does make me think.

---

leaning towards no FP reg bank.

we should have locals tho

---

the motivation of the 'register offset cache' semantics is to allow us to be agnostic as to whether the implementation caches the area immediately above the stack pointer (and, in LOVM, also immediately above the frame pointer). The purpose is to allow us to be agnostic about whether up to 16 function arguments are passed in registers or on the stack; since the calling convention is to pass arguments via the cache, implementations that want to pass them in registers can do so without hitting main memory, and implementations that want to pass them on the stack can do so too. The complicated semantics are necessary because we don't want to burden the implementations with intercepting reads and writes to main memory to see if they hit the cache, so the main memory could at some points in time diverge from the cache, and this would be visible to the program. And we want the same Boot program to be able to run on various implementations regardless of whether they use a cache or not. So since the indeterminacy of the state of the cache or memory in some situations is exposed, we have to specify it.

--- removed (more detailed than necessary)

Make space to store a cache of size 16*PTR_SIZE, as well as a copy of register 2 "R2_OLD", and a count "CACHE_USE" of how much of the cache is in use. When flushmem2roc is executed, set CACHE_USE=0. When flushroc2mem or lwc or lpc or swc or spc are executed, first check if R2 is different from R2_OLD, and if it is, set CACHE_USE=0 and set R2=R2_OLD; then, do the following. When flushroc2mem is executed, copy cache locations 0..CACHE_USE to memory locations R2..R2+CACHE_USE (with x..y interpreted as half-open range, that is, a range from x to y-1 including x and y-1 but not y). When swc or spc are executed, write the value into the cache and set CACHE_USE=max(CACHE_USE, (#imm7 from the instruction) + 1). When lwc or lpc are executed, if the #imm7 in the instruction is < CACHE_USE, then read from cache, otherwise read from memory

  1. 3: like #3 but instead of CACHE_USE, maintain a 'dirty' bit for each item in the cache to mark if it has been written to, and upon flushroc2mem (if register 2 is unchanged), flush the items which are dirty.

---

thinking again about banking the upper 16 regs.

for: generally a good thing to have more regs against:

leaning against

--- old

removed this from BootX?

  1. # The Boot Calling Convention ##

looking ahead to lovm, mb we dont want to save any arguments in these registers, so that we put them all in the smallstacks and have our nice register window-like convention to rotate them from output argument registers to input argument registers. OTOH we want some of our syscalls to work without memory access. Maybe the thing to do is to have a separate calling convention for syscalls. Or, maybe just call that the 'Boot calling convention', and have a different one for LOVM.?

Registers:

note: the numbers below are out of date

note: now that we have the smallstack, probably pass all arguments on smallstack (right-to-left order; the last argument is pushed first). Also consider leaving the last smallstack spot open? otoh that would only be 3 arguments. - so i guess the convention is that the smallstack must be empty except for these arguments? - also consider putting the first argument in a register, or even the first 2, so that (i) fns with few args can have them all in regs, (ii) the first arg is available at any time. Not sure if that's a good idea tho. - if you do stuff like that, mb make sure at least 2 (regs+stack locs) are free even with the max # of args), so that you can at least do a SWAP w/o spilling anything to memory first

note: I read somewhere that you should have callEE clean up of the stack for tail calls? I don't understand that but anyway shouldn't that apply to our extended argument allocated memory also? because that allocated memory could be on the stack.

note: Also should we just use the stack pointer register to point to the allocated memory for extra arguments (e.g. most of the time this means putting it on the stack)?

Registers 5 and 6 (both integer bank and pointer bank) are used to pass arguments and return values (from lower to higher number get arguments from left to right). If the function is variadic (that is, if the number of arguments is not fixed), then the number of arguments is passed in $1. If more than 2 integer arguments or more than 2 pointer arguments must be passed, then a pointer to all arguments after the first is passed in &1. The callee may overwrite the memory passed in &1 but must not deallocate it. The conventions for passing return values are the same as the conventions for passing arguments, except that the number of return values must be fixed and must fit in registers 5 and 6 (both banks) plus whatever memory, if any, was allocated for additional arguments and passed in &1.

So, upon being called, the callee expects to find the first two int32 arguments in $5 and $6, the first two pointer arguments in &5 and &6, and the return address in register &7. If the function is variadic, the number of arguments is in $1. If there are more arguments than fit in registers $5, &5, $6, &6, a pointer to the additional arguments is in &1.

There may or may be a memory stack, but if there is a stack, then the recommended convention is: the memory stack pointer is in &3, and the stack grows downward. Note: when there is a stack, and &1 is being used to point to additional arguments, &1 may or may not point into the stack.

The callee can overwrite and use registers 4 thru 7 (both banks) for any purpose. If additional arguments were passed in memory, the callee can also overwrite those arguments.

Upon returning, the caller expects a fixed number of return values. These are stored first in registers 5 thru 6 (both banks), and then, only if the caller passed a pointer to additional memory in &1, in that memory. The values in registers 1 thru 3 (both banks) are the same as they were at the beginning of the call. If the caller passed a pointer to additional memory in &1, then that memory has not been deallocated by the callee.

...describe also the convention for spilling the smallstacks to the stack...

frame layout of top frame on memory stack: ... (this is planned to be used to store the counts of integer smallstack items and pointer smallstack items spilled to the memory stack, in conjunction with $3; altho mb not b/c we don't really need that many bits for that)

...

  1. # Profiles The profiles are arranged in a linear ordering from smallest to largest, and each profile includes all of the smaller profiles.

The following profiles are defined:

Any of these may be suffixed with either 'vanilla 32-bit' or 'vanilla 64-bit' to indicate the combination of the indicated profile with the indicated 'vanilla' restrictions.

current space left:

3-operand instructions: 18 2-operand instructions: 6 1-operand instructions: 3 0-operand instructions: 5

  1. # Tiny profile The Tiny profile consists of the following new instruction extensions:

count of new 3-operand instructions: 8 count of new 2-operand instructions: 11 count of new 1-operand instructions: 0 count of new 0-operand instructions: 1

note: gonna have to introduce 2 new pseudoinstructions (similar to opcode 63) to house all these 2-operand fp instructions

Semantics notes:

The floating-point width is implementation-dependent.

Most of the details of the semantics of floating point is implementation-dependent.

New sinfo queries:

These sinfo query results are static.

todo: do we really want to have fcp fld fst i32tof ftoi32 without doing MORE16 so as to save encoding space?

--- old

removed from Boot reference (i guess since < and <= are only defined for integers we implicitly have these properties)

--- old

removed from bootX reference:

--- old

removed from bootX reference:

--- old

removed from bootX ref:

-- -- --

  1. ## Registers

32 general purpose registers:

Registers are referred to by Rn where n is the register number, e.g. R0, R1, R31.

Registers also have secondary nicknames that assist in remembering their functions:

There are also 32 floating point registers, F0..F31. F0 is constant zero, F1-F16 are caller-save, F17-F31 are callee-save. Nicknames:

Note that the item on the top of the stack, s0, is aliased to register r1, so it is read/written to using 'register addressing mode' rather than 'stack cache read/write addressing mode'.

---

old ideas for BootX? profiles

  1. ## Floating-point extension

New architectural features:

The following new instructions are defined:

integer division 1div32
floating point 1fadd fsub fmul fdiv fcmp i32tof ftoi32 fcp fld fst beqf bltf floor ceil nearest trunc binf bnan

f32add f32sub f32mul f32div f32cmp i32tof32 f32toi32 fcp fld fst beqf bltf floor ceil nearest trunc binf bnan

New undefined behavior:

      1. Small profile ### The Small profile consists of the following new instruction extensions:

plus the following new lib call extensions:

The following new instructions are defined:

The following lib calls are defined:

  1. ## xlib 2: malloc(size: $5)
      1. xlib 3: mfree(region: &5)

New undefined behavior:

      1. Standard profile ### The Standard profile includes everything in the small profile, plus the following extensions:

Instruction:

Syscall:

      1. Performance profile ### The Performance profile includes everything in the Standard profile, plus the following extensions:

Instruction:

Syscall:

Functionality extensions

  1. ## Syscall functionality extension ==
      1. Oot assembly functionality extension == (longer mnemonics?)

Instruction extensions

These extensions add new instructions.

  1. ## Floating point 1 instruction extension ###
floating point i2f lf sf ceil flor trunc nearest addf subf mulf divf copysign bnan binf beqf bltf fcmp

could save a few fp opcodes by having ftoi take an immediate operand specifying rounding type: floor ceil round nearest. mb could elide either bltf beqf or fcmp. could compress isinf isnan into fclass like riscv does. so: fadd fsub fmul fdiv fclass fcmp itof ftoi fcp fld fst. thats still a lot tho.

  1. ## Floating point 2 instruction extension ### Includes Floating point 1 and adds:
additional floating point remf sqrt minf maxf powf bnonfinite bgtf beqtotf blttotf ftotcmp
  1. ## Floating point Triglog instruction extension ### Includes Floating point 2 and adds:
additional floating point

trig, log, exp etc fns from math.h (at this point are we missing any other math from C library or math.h?)

  1. ## Floating point elusive eight instruction extension ###
additional floating point

https://www.evanmiller.org/statistical-shortcomings-in-standard-math-libraries.html#functions

  1. ## Atomics seqc 1 instruction extension ###
atomics (sequential consistency)lpsc lwsc spsc swsc casrmwsc casprmwsc fencesc
  1. ## Atomics seqc 2 instruction extension ###
atomic additional rmw ops (sequential consistency)addrmwa aprmwa andrmwa orrmwa xorrmwa
  1. ## Atomics rc 1 instruction extension ###
atomics (release consistency)lprc lwrc sprc swrc casrmwrc casprmwrc
  1. ## Atomics rc 2 instruction extension ###
atomics (relaxed consistency)lprlx lwrlx sprlx swrlx casrmwrlx crsprmwrlx
atomic additional rmw ops (release consistency)addrmwrc aprmwrc andrmwrc orrmwrc xorrmwrc addrmwrc
atomic additional rmw ops (relaxed consistency)addrmwrlx aprmwrlx andrmwrlx orrmwrlx xorrmwrlx addrmwrlx
  1. ## Non-branching conditionals instruction extension ###
      1. SIMD extension ###

Syscall extensions

These extensions add new syscalls.

  1. ## Filesys syscall extension ###

Syscalls: explain syscalls only; mb put these sort of extensions under a separate heading?

filesys read write open close seek flush poll
  1. ## Environment variables extension ###
      1. IPC 1 syscall extension ###
      2. IPC 2 syscall extension ###
      3. TUI syscall extension ###
      4. Process control syscall extension ###
      5. Local memory allocation syscall extension ###
      6. Shared memory allocation syscall extension ###
memory allocation malloc_shared malloc_local

(which of malloc_shared/malloc_local is ordinary malloc? i think the ordinary malloc is already malloc_local)

  1. ## Clocks syscall extension ###

Restrictive extensions

These extensions further specify details which are unspecified in Boot.

  1. ## Vanilla 32-bit extension ###
      1. Vanilla 64-bit extension
    1. Details about semantics
      1. Stubs

Much functionality could be trivally implemented in a way that always returns a null result or an exceptional condition, or in a way that does not take advantage of native platform facilities, even when present. This is compliant provided that the corresponding functionality is indicated as 'stub' or 'partially stubbed'.

For example, if the Small profile is implemented but every attempt to allocate memory returns null, the implementation must not be described as "implementing the BootX? Small profile", but may be described as "implementing the BootX? Small profile, partially stubbed".

For another example, if the Standard profile is implemented in a way that prevents more than one thread/process from executing simultaneously, yet the target platform natively provides true parallel processing, then the implementation must be described as "Standard profile, partially stubbed".

Note that "stub" doesn't have to be said in cases where the reader already known that the native target does not support the given functionality.

Boot instructions fall into three categories:

arithmetic of ints (result is undefined if the result is greater than 32 bits):

standard profile adds (52 instructions, for 92 total; all opcodes are below 128):

==
non-branching conditionals cmovi cmovip cmovpp
other control flow lpc
==

optional instructions (all opcodes are 128 or greater):

==
implementation-defined impl1 thru impl16
interop xentry xcall0 xcalli xcallp xcallii xcallmm xcallim xcallip xcallpm xcallpp xcall xcallv xlibcall0 xlibcalli xlibcallm xlibcallp xlibcall xlibcallv xret0 xreti xretp xpostcall
indirect control flow lci jy

general lci, with target that doesnt have to be xentry

lpc, for (intrusive) debuggers?

Some instructions are followed by data.

A semicolon means that the instruction is followed by data; 'instr ; data'.

jump constants only 32 bits

lentry and JMP data is relative to beginning of program

note in boot spec that bootx will define some syscalls below 128? and some sinfos? and some/all instructions? or maybe just dont mention it much? or maybe say that some things are RESERVED for extension languages?

interop xcall xentr xaftr xret0 xreti xretp xtail

If the function being called takes a variable number of arguments, then the total number of integer arguments is passed in register $11 and the total number of pointer arguments is passed in register $12.

If more than 3 integer arguments or more than 3 pointer arguments need to be passed, then a pointer to the remaining integer arguments is passed in &11 and/or a pointer to the remaining pointer arguments is passed in &12. The contents of the memory holding the additional arguments may be overwritten by the callee, just as with registers 5,6,7. However, the registers 11,12 (both banks) themselves are still callee-saved and, if modified, must be restored before return. The callee must not deallocate the memory pointed to by pointer registers 11 or 12 (that is, the memory holding the additional arguments).

On platforms which pass values which are neither integers nor pointers, when arguments are passed which are neither 32-bit integers nor pointers, if the value is guaranteed to fit within 32-bits, it is passed as an integer, otherwise the value is stored in memory and a pointer to the value is passed.

undef behav:

split 8-bit immediate offsets into 2 4-bit immediate offsets, and have one of those be ints, and the other be ptrs, so that you can specify an offset into a struct mixing ints and ptrs

I/O:

y'know, if we just make device a register, but make both STDIN and STDOUT device &0 (null ptr), then we're probably set (b/c the &0 register works for that). Then we dont need the #imm8u. Although we may still want in1 and out1 so that we can do i/o from registers, w/o main memory. Can i think of a clever way to stuff those into one instruction without changing the 'addressing mode' of operands depending on circumstance?

maybe for in, if &dest is &0 then $len is the dest? and for out, if &src is &0 then $len is the src?

i dunno, that's a little ugly. At least we can stuff in1 and out1 into MORE.

    1. I/O ## If standard console streams STDIN, STDOUT, exist on the platform and are supported by the implementation, they must be devices #0, #1, respectively, and device #2 must be STDERR if it exists, and otherwise should be an alias to STDOUT or may be a null device (one which never emits anything and to which writing has no effect).

An implementation does not have to support INP, OUTP.

(actually later i decided that both stdin and stdout should be device #0)

When using xlib, there is no need to set the link register, these instructions will set it if needed. Upon making a call, up to 3 integer arguments are in integer registers 1, 2, 4, and up to 3 pointer arguments are in pointer registers 1, 2, 4, and the return address is found in pointer register 5. Upon returning from a call, up to 3 integer and up to 3 pointer return values will be found in registers 1,2,4 using the same convention as for calling.

in, out

57,in,0,ri,rf,u3,1,0,0,0,0,0 58,out,0,rf,rf,ri,1,0,0,0,0,0

i decided to leave them embedded in the instruction stream. Otherwise labels could only be loaded with an 8k range, which would make implementations jump thru hoops. As a bonus, now we can assume that any Boot implementation supports larger programs.

We can have a BootX? anti-extension that disallows this.

-- -- --


old

removed from Boot/BootX?

However, if you want to implement register-associated cache in some fancy way (for example, flushing the cache in a background thread concurrently with execution of the Boot program), BootX? provides an extension to broaden permissive register-associated cache behavior.

---

should we provide a facility for constant pointers, code pointers (labels), handles, embedded in the instruction stream?

on the one hand, this would seem to be the practical choice; eg. i get the impression that often linkers or loaders or maybe compilers will insert pointers to the platform/process-specific address of the process's globals as constants embedded in the instruction stream, rather than passing this in a gp register like RISC-V does.

on the other hand, as these things are opaque, their meaning and even their size is non-portable. And i thought we were saving constant table functionality for LOVM.

i guess since their size is non-portable, they really can't be embedded in the instruction stream. And since their semantics aren't defined, there's no portable way to even indicate within the BootX? language what you want the linker/loader to give you.

also, since we do have a global pointer (gp) in a register, the constant table could be over there. although maybe we should have a read-only global pointer too! and maybe the gp tp (and read-only globals) registers should be constants themselves, like the zero register.

although... that suggests a use for immediate addressing mode with pointer (source) operands: they could refer to constant pointers!

hmm.. actually it's still not obvious how to treat constant tables. The issue is that, for example, in an object file you might write out 32-bit integers as 4 bytes, but then in memory (running on top of Python) they might occupy just one memory location. But maybe the implementation doesn't know about the boundaries between constants, and so just maps the file into memory as 1-byte:1-storage-location. But, maybe not: maybe constants occupy the same sizes as everything else in memory. And what about bytestrings, is each byte one storage location? These questions suggest that we should provide constant table read instructions, even if the entries in the constant tables themselves can only be populated in an implementation-dependent way.

---