Bayle Shanks's website: proj-oot-ootBootNotes1

(note: this file was created pretty late in the game; Boot notes used to be in the ootAssemblyNotes series, and maybe also the ootOvmNotes series)

---

consider again just using the 16-bit encoding, the 32-bit encoding seems like a big waste
no, i think Boot should be 32-bit because otherwise there's no room for boot ex OP codes

---

consider eliminating the 'user code profile' i think this was already moved to BootX? anyway

---

if we have 2 banks of 7 regs each, that's more than the 8 regs in x86. But it's less than the 16 regs in x86_64, so i think it's okay.

ARM has 16 regs but note that this includes the stack pointer, link register, and PC. So i think we're okay but not 100% sure.

---

mb dont specify bitwidth, except that it's at least 32. Not a problem b/c e.g. 32-bit width can be emulated by the programming by bitmasking (ANDing with 2^31) after each op. int32 ops in LOVM could compile to op+bitmasking in Boot.

---

should jrls use twos-complement or offset or sign-magnitude encoding of the 9-bit signed immediate offset?

---

should we really have 32-bit relative jump targets sitting in the instruction stream, or should we instead have 24-bit relative jump targets in the instruction?

how does AArch64 do it?

https://stackoverflow.com/questions/44949124/absolute-jump-with-a-pc-relative-data-source-aarch64

says to "rely on the compiler generated literal pools as usual:

LDR x9, =0xBADC0FFEE0DDF00D BR x9 "

https://modexp.wordpress.com/2018/10/30/arm64-assembly/#branch also has a 'B' mnemonic tho

https://wiki.cdot.senecacollege.ca/wiki/AArch64_Register_and_Instruction_Quick_Start uses 'BR'

according to https://armkeil.blob.core.windows.net/developer/Files/pdf/graphics-and-multimedia/ARMv8_InstructionSetOverview.pdf , B has a range of +-128 MB; that's about 28 bits

so, seems like the AArch64 way is:

have one instruction with a large-as-possible jump immediate
if larger immediate is needed, load the jump target into a register using a 'load immediate' instruction which loads the immediate from the instruction stream (possibly not immediately adjacent, but perhaps in some nearly literal pool), and then do an indirect jump

according to https://developer.arm.com/documentation/dui0473/c/writing-arm-assembly-language/literal-pools the literal pool offset from the instruction can be +-4k or +1k for 16-bit Thumb code.

however, in Boot, since we may be compiling Boot to some other ISA, we don't like indirect jumps to loaded constants, because the compiler needs to be able to identify all potential jump target immediates to fix them up once the actual locations are known. But perhaps there's some advantage to grouping together these addresses into a 'literal pool'? Not sure.

---

okay i split jrel into 3 variants:

jrls, jrlm, jrll with ranges of 9, 24, 32 bits

the idea is that:

jrls can fit in the 16-bit encoding
in the 16-bit encoding jrlm either acts like jrll or mb must be translated to a jrll; in any case a 32-bit target follows in the instruction stream (so takes up 48 bits total). In the 32-bit encoding, jrlm fits in one 32-bit instructions.
jrll has a 32-bit immediate immediately following in the instruction stream; so takes up 48 bits in the 16-bit encoding and 64 bits in the 32-bit encoding

a fancy assembler can offer a jrl pseudoinstruction and then just use the smallest one it can for each relative jump.

---

i guess one benefit to 'literal pools' is that literals can have natural alignment. In my current proposal, in the 16-bit or variable-length encoding, a 32-bit constant could be not in 32-bit alignment.

but, in the variable length encoding, you're always reading instructions which aren't aligned anyways, and in the 16-bit encoding, things are only 16-bit aligned which probably is smaller than the wordsize on the underlying platform anyway. And having a literal pool (a) means maybe more memory traffic, because the literal pool is not necessarily next to the instruction, and (b) might complicate the compiler, which has to fix up things that are not known to be jump targets except that some jump instruction refers to them (esp. if the referring jump is after the literal pool).

i am not enough of an expert of this stuff to know which of those is worse. For now i'll leave it like this.

---

riscv really does include the jump immediate in the instruction, mb we should too, for the 16 bit encoding?

done

---

could have a special register that is (or may be in some cases/is permitted to be) cleared at the end if each instruction except for li. that allows the implementation to treat li to that register as fused with the following instr, aloowing the benefits of having some instructions being floowed by an immediate value without actually doing that. or, could get a similar effec while still have the 'real' opcode first by constraining the data in the second part to be some special instruction ( ut then the data has to be parsed out...)

---

hmm i dunno i'm thinking again about the choice to have jump instructions that reference words in the instruction stream. wont this mess up hardware implementations bigtime? Because a hardware implementation can't assume that it has all the information it needs to execute an instruction after loading the instruction.

otoh i guess ARM literal pool load instructions like

LDR x9, =0xBADC0FFEE0DDF00D

'should' have the same problem, and they're fine. All i'm talking about is something like that, except that you load from the literal pool into the PC instead of into some other register. So imo it actually shouldn't be much more harder for the chip designer.

so i think it's good.

---

removed this from undef behavior section (in 16 bits this is impossible):

an instruction with op0 > 63 (when interpreted as unsigned)

---

removed/rewrote this:

Instructions with exactly one unsigned immediate operand may have any unsigned value from 0 thru 4194303, inclusive, in that operand. Instructions with two operands, where unsigned immediates, may have any value from 0 thru 63, inclusive, in the first operand, and any value from 0 thru 65535, inclusive, in the second operand. Instructions with three operands, where unsigned immediates, may have any value from 0 thru 63, inclusive, in the first operand and any value from 0 to 255 in each other operand. When an immediate operand type for this instruction is signed, unsigned ranges from 0 to 63 are replaced by signed ranges from -32 to 31, unsigned ranges from 0 to 255 are replaced by signed ranges from -128 to 127, unsigned ranges from 0 to 65535 are replaced by signed ranges from -32768 to 32767, unsigned ranges from 0 to 4194303 are replaced by signed ranges from -2097152 to 2097151.

removed/rewrote this:

## Reserved instruction opcodes ### Boot only defines opcodes under 64; higher opcodes are reserved for extensions. Opcode 62 is implementation-defined. Opcode 63 is reserved for future use.

removed/rewrote this:

Note that the opcodes (see table below) have the following properties:

opcodes between 0-1, inclusive, have 22-bit immediates spanning op0, op1, op2, and are the only such opcodes
opcodes between 2-4, inclusive, have 16-bit immediates spanning op1, op2, and are the only such opcodes
opcodes between 5-44, inclusive, have an 8-bit immediate in op2, and are the only such opcodes (excluding mallo and mfree which have _ (forced 0) in op2)
opcodes between 6-31, inclusive, have a signed 8-bit immediate in op2, and are the only such opcodes
opcodes between 6-14, and 32-36, inclusive, are control flow, and are the only such opcodes
opcodes between 37-40, inclusive, are I/O, and are the only such opcodes
opcodes between 15-26, inclusive, are load/stores, and are the only opcodes that directly access memory (although there are others, such as mallo, which call system subroutines that probably access memory)
opcodes between 45-48, inclusive, have a conditional in op2, and are the only such opcodes
opcodes between 45-58, inclusive, have an integer register in op2, and are the only such opcodes

Note that later additions that assign instruction(s) to the RESERVED opcode may break the 'only such' parts of these properties.

The short descriptions under Instructions, above, have been kept to at most 80 characters per line.

---

mb get rid of ap16, ap16m?

nah, i have a sweet spot for 16-bit

---

old:

mb remove HALT and just have xret instead? and mb xret to 0 is halt, or something like that. Hmm, some platforms will have some cleanup stuff to do upon HALT, and we don't want to make xret less efficient by having to check if the return addr is zero every time; and if we started the program by 'calling' it from a virtual/sentinel address (which would be found in the link register, &4, at the beginning of the program), then letting the program xret to that address to halt, now we have to introduce the inefficiency of checking the xret address on each xret call. Not sure if that inefficiency is material, though. -- later: for those reasons, i left in HALT
how do we do halfwords if the platform natively stores words in one storage location? Can halfwords still overlap with words? Is it possible to write to both the high and low halfword and then read the word (and each halfword) and see the expected results? I think not; I think whether there is overlap or not is implementation-dependent (and so is undefined behavior to rely upon? Nah, surely you are allowed to scan through memory at any stride -- it's just that the values you see are in that case are not deterministic with what was written, in the overlap sense). Must document. -- later: i added documentation to clarify that you cannot mix integer sizes, and you cannot blindly load memory
should you be able to loads ints into ptr regs, or should you be able to load ptrs into int regs, when doing blind copying?
ptr regs are bigger
but it might be nice to check at load time if a memory location is definitely a ptr, before even allowing it into a ptr register
however by that reasoning, maybe you would want to choose the pointer regs for blind copying anyways, b/c otherwise when you 'launder' pointers through the int32 regs during a blind copy, they would lose their 'pointer' attribute when written back out
mb introduce yet another register type?!?
actually i think this is the best idea. For systems with locations in terms of bytes, this wont even be much memory (16 bytes is only 4 int32s)
rename anytype to byte? but right now they can hold other HLL constructs too.. -- later: i removed anytype. memcpy is the (only) way to blindly cp memory
should i add the jy instruction, with the proviso that it must only be used intraprocedure? would it be hard to convert to a switch statement on JVM?
probably, and probably also:
rename lentr back to lcm; specify (if not already specified) that valid xcall targets must have an xentr
specify that the target of a jy instruction must not be in a different 'compilation unit', which is implementation-defined; a jy, all lcms that might feed into it, and all of their targets, must all be available in one compilation unit at compile time
reiterate that jy, like other jumps, must not jump between xcall subroutines
mb, along with CALL, we should split codeptr/ptrc into two data types, 'Boot code ptr' and 'external code ptr', because mb external code ptrs are a different format than Boot code ptrs. Eg external code ptrs could be a native ptr (into a Harvard memory), and a Boot code ptr could just be an integer describing an offset from the start of code (that a Boot interpreter uses, as opposed to a compiler, where the Boot code pointers will probably just be external code ptrs).
should we define CALL-subroutines as code between a target of a CALL instruction, and the next such, and require these blocks of code to be like 'xentr-subroutines' in the sense that you can't jump between them except via CALL? This would allow CALL-subroutines to be compiled to functions/methods in e.g. JVM, rather than having a sort of switch statement and compiling return addresses on a return stack to constant integers, then dispatching through a SWITCH statement to actually go to them. -- later: i gave up on JVM-ability at this level, and added lpc instructions to load labels (the PC plus some immediate offset)
maybe we should specify that one of our regs is designated 'stack pointer' -- later: did that in bootx
since l32m is already loading immediates in the instruction stream, maybe have an absolute jump that loads an immediate from the instruction stream? -- later: done
consider eliminating IN and OUT and just having read and write syscalls
consider moving all I/O into syscalls -- later: I/O moved to bootx small
normalize the position of memory addresses and their offsets, eg in STORE, put the address in the same place it was in LOAD (in accordance with the 2nd bullet point in Ugly in [1]). Look to see if there are any other memory address calculation instructions in here. -- later: done
mb platformspecific int bitwidth after all (and bitwidth, and intmax, in sysinfo) (and not specifying the result of overflow, rather than soecifying wraparound?). With minimum of 16 bits specified. -- later: nah
should we make a separate register bank for codeptrs? Ot1h, this makes it easier for implementations to verify that no arithmetic is done on codeptrs. Otoh, we do probably need a few of these, so it adds state; also, an implementation can get around the no-arithmetic-on-codeptrs thing just by writing the codeptr to memory and reading it back as something else (this is undefined, but so is arithmetic on codeptrs anyhow). An alternative would be to say that arithmetic can only be performed on the first 8 ptrs (note that later we want to put SMALLSTACK in there). However, this alternative doesn't help out Harvard architectures, like a new register bank would. The only really really safe thing is to not allow codeptrs to be read/written from general memory, e.g. Java has return addrs on the stack and that's it? But don't they allow method handles to be written as data? i guess java keeps track of the type of stuff in memory so it's not a problem either way. check out what they do there -- later: decided no separate codeptr bank, forbid arith on codeptrs. never got around to checking what java does about method handles, if anything.
if wacky control flow, then no need for x* interop instrucs here; move to BootX?; also move calling conventions there -- later: done

---

Still reconsidering overlapping all the register banks, and adding a 6 element small stack to boot. This would keep boot implementatable in 16 registers. It would mean that l o v m would have 32+32= 64 items of state = 512 bytes of state instead of 1k+fp_regs like in The current proposal (assuming each register is 64 bits). Semantically we would still ban integers and pointers from mixing -- so an implementation could choose to keep those registers separate if it wanted. Would we have one small stack or two? The overlap way of doing things would say one I guess. Don't know if this is worth it because boot is in some ways actually less expressive because it loses the ability to directly address 16 registers -- although it gains the ability to have more than eight registers of each type. The benefit that I like the most is that now if you want small stack you don't have to change the semantics of boot at all when moving to l o v m. A tangential benefit is that since it's easy to reduce the size of the stack to six, you can give the platform two more hidden registers (one as a small stack TOS ptr and one as a temporary). I don't know man. This would make type checking require extra work, and it goes against the principle that more registers is better in lovm. Could just provide the small stacks anyway.

---

Still considering providing two small stacks in boot.

done

---

i'm still not happy with the lack of zero-operand instructions for the smallstacks. The 8-bit encoding isn't able to help much here because it has so few bits, and because i really want to be able to copy/load from/to any of the first 4 registers. The main value of the stack in this design is only that it gives access to a few more 'registers' without having to make operands bigger (10 registers instead of 7; it could be more but for now the 10 is chosen because i'm worried about implementations on 32-register machines having enough hidden registers for its own use).

an alternative would be to just make the 64 8-bit ops be the 64 opcodes, with all operands assumed to be the pseudostack register -- and then to special case the obviously useless ops (e.g. cp from/to the smallstack is useless)

---

the motivation for requiring that stack depth be staticly knowable at each instruction of the program is so that a compiler can map the stack locations to fixed register locations.

The reason we only count variance at the end of an instruction is so that ENTRY at the beginning of a function can clean it up .

---

actually that presents a bit of a conundrum.

first, if we disallow smallstack depth variance at the end of an instruction, then immediately after a RET instruction (which restores the stack to the way it was before ENTRY), we'll have forbidden depth variance at that moment. The better solution would seem to be to exempt one of ENTRY or RET from the condition (ENTRY if the condition is enforced before the instruction, RET if after).

Say that stack locations TOS, TOS-1, TOS-2 are mapped to registers R3, R2, R1 in function1. Now function1 calls function2. So at this point the compiler would want to replace TOS with R3 in function2.

But if function3 has a empty stack when it calls function2, then it will have stack location TOS mapped to R1. So the compiled code for function2 conflicts with this.

A solution would be to make the entire stack caller-save, and require the stack to be empty upon each function call (except perhaps for arguments passed on the stack). So make the caller push everything from the smallstack to memstack as part of each calling sequence. But this negates some of the benefits of having a stack (quick function calls).

What we'd like instead to do is to allow the implementation to do some behind-the-scenes register window stuff to spill from the bottom of the smallstack if it wants, and maintain the stack as a stack rather than mapping it to fixed registers, while exposing to the program the illusion that it was emptied after the function call. This mojo would occur upon encountering the ENTRY instruction, so we'd make the requirement for a constant stack apply at the beginning, not end, of instructions, and relax that requirement for a constant stack.

Another way of doing this (which i think i like better) is that if the smallstack doesn't all fit in registers, the part in registers is merely the top few items. By indexing from the top instead of the bottom we avoid this problem, but we get another one: pushing or popping the smallstack requires us to do n extra MOVs, where n is the number of stack items in registers.

And a third way is for the implementation to have indirection when accessing the smallstack, rather than mapping it to a fixed register.

actually i think we should remove the restriction on smallstack depth variance:

this only makes sense with a callstack and a division of a program into functions; if one function is called by multiple other functions which have different stack states, or even by the same caller in multiple stack states, then the stack contents will necessarily vary; it's only if you abstract that away and look only at the part of the stack since the function began that you can have invariance. But the division of a program into functions is not something that we want to affect the semantics of Boot -- that comes later. At the boot level, all we have is branches (or more fundamentally, a program counter that can be written to).

so, (in almost all cases) the (entire) stack cannot be staticly assigned to fixed registers (without any indirection).

---

we could encode the branch and jump targets to avoid useless stuff like zero offsets from the next instruction, but:

that is slightly more complicated
on implementations in very high level languages (like Python), the runtime cost of the extra addition operation may be significant

---

removed from spec (this should go in a design doc, not in the spec):

### Reserved instruction encoding space ####

The 0 in the most-significant bit is intended to allow Boot to be made a part of other instruction formats which use 0 in this bit to indicate a Boot instruction, a 1 to indicate something else (for example, instructions of different lengths). That is to say, instructions with a 1 in the most-significant bit are reserved for extensions.

--- design

do we implement labels in Boot assembly? i'm thinking not. LOVM will have a more usable assembler, Boot is just a target language.

---

decide what to do about >21 bit offsets and constants. Embedded in instruction stream? BootX??

i decided to leave them embedded in the instruction stream. Otherwise labels could only be loaded with an 8k range, which would either make implementations jump thru hoops by doing things like branching to somewhere near a desired label just to collect the label, and then branching back (at presumably great cost to performance); or it would mean that the Oot implementation would have to be careful never to compile to something with more than 8k within a function (LLVM only allows indirect branching within a function), which i'm not sure if would be feasible.

As a bonus, now we can assume that any Boot implementation supports larger programs.

We can have a BootX? anti-extension that disallows this.

--- design

Boot Assembly is spartan because Boot is not really meant to be written by humans.

--- design

sinfo queries are our way of defining platform-dependent constants. sinfo queries are static so that a Boot compiler can replace them with load-immediate-constant instructions, possibly allowing further optimizations through constant propagation.

--- design

this is undefined b/c it's undefined in LLVM (cite https://llvm.org/docs/LangRef.html#addresses-of-basic-blocks ):

"- comparison of code pointers to any pointer except the null pointer"

generally, whenever the naive translation of some Boot program to some common platform causes undefined behavior, then we want to specify that as undefined behavior in the Boot spec

--- old

mb there should be a way (like an instruction or annotation) for the program to give a total ordering on all registers on all banks so that the implementation knows which ones to preferentially map/cache to physical registers

so for each of 16, 32 physical registers (minus 4 or 8 for implementation stuff), need to specify how many of each of the four register banks to store. Can round to powers of two or something to save bits.

Can probably do it in nine bits (x2 for each of 16 and 32 registers) by: for each of the first three register bags specify three bits for the power of two registers to cache from that bank (e.g.: 4 int32 regs, zero any regs, 4 codeptr regs). The last register bank is pointers, and is all of the remaining registers.

Actually for a compiler it can do its own analysis I guess (register allocation), and for an interpreter you want to minimize the possible combinations of assignments of virtual registers to physical, because each such assignment requires an entire new binary of the interpreter (unless you want to do a table looking at runtime upon each register access). So maybe just two bits per register bank for six bits total? Still too many combinations (64); could lead to a huge interpreter binary. Maybe offer just four combinations.

Although really I think you'd want at least 256 combinations; that would allow you to say for each of the first three register banks: we need less of these than baseline, or more of these than baseline, or about the same amount of these as baseline.

(in the following, when i say banking registers, i mean that since the instruction determines the type of register in each operand, if you have eg 4 types, then each register number can map to one of 4 different physical registers, depending on which type it is; this allows you to address 4x more registers than you otherwise would in a fixed number of bits; for examples, 3 bits can address not just 8 but 4x8=32 registers if registers are banked and if the operand determines the register. You can think of this as the operand determining 1 of 4 registers and so providing 2 extra bits, so the effective register specifier is 3+2=5 bits instead of just 3.. anyway)

Anyway the insight above about specifying which registers should be mapped/cached is based on the following question:

How can it be that making the programming language MORE expressive (by having 4x banked registers, so now 3 bits of operand can address 32 registers) can TAKE AWAY information from the implementation (about which registers should be cached/mapped to physical registers if you have less than 32 physical registers available)? If you have 3 operand bits and non-banked, you only have 8 registers, so if you have more than 8 but less than 32 physical registers available, then obviously you should map all 8 VM registers to physical registers; but if you have banked registers, you can only map some of them, so let's say you map the first 2 registers in each bank, but then the program uses a lot of pointers and no anytypes; you wasted 2 of your physical registers on anytypes. This seems paradoxical, because it seems like the programming language got more expressive (we can talk about 32 registers instead of just 8), but less information seems to have been communicated to the implementation (about how the physical registers can best be used).

One way to think about why a VM with less register banks can be more efficient is that it forces the programmer/compiler to pre-compute which variables are important enough to be cached in registers. An approximation to this is to say which register banks are important enough to be assigned registers, hence the idea of specifying that.

However, this way you still don't get the savings of dynamic typing in the sense that a program would be allowed to use the same physical register for different banks at different times within the program. To get this you could also annotate the program with which physical registers you mean when use each register. E.g. before an instruction that stores a pointer to pointer register bank #3, you could have an annotation that says 'i recommend you store this to physical register #6'.

A softer way to do this would be to annotate each store with a priority level for storing that value in a physical register. Priority level of one bit is like having a register keyword. At least two bits is more appropriate for us because we have both the 16 register case and the 32 register case.

But again, back to the seeming paradox; it seems weird because it seems like if the program is providing less information (as i think it does with less register banks), then how is it that it is providing more information about how to execute the program efficiently? Could it be that in selecting which information to include, is in itself making a subtle statement? No actually I think that's wrong and the fewer registered banks really do provide less information.

In fact now that I think about it there's another way: The register numbers themselves provide a priority. for example if the program never uses registers 1 through 27 in some bank, but only uses registers 28-31, that could be a signal that there is very low priority for cacheing these in physical registers. In fact, the program could pretend that the banks overlap, that register n in one bank is actually the same as register n in another bank and would overwrite it. Of course that breaks if the program needs more than 31 total registers.

This suggests a simple scheme; up to some register number n (perhaps specified by annotation, although for the sake of tooling like LLVM, maybe better to do it fixed), the program assumes that register banks overlap in all banks. Higher register numbers are assumed to be distinct. The cutoff could be 16, to keep it simple. Except that I guess the small stack register is always distinct, as are the special registers (eg stack pointer vs other #2 registers), so let's say the first two are always distinct. So that's eight in total. So maybe make the cut off eight instead of 16. So now the total number of registers is 8+(24*4)=104 rather than 128. That's still a lot honestly. If the cut off were 16: 16+(16*4) = 80.

Issue is now that in order to spill to memstack, since we separate types on the stack and also different types are different sizes in memory (the memstack is in memory), for each of the 16 unified locations may need to specify two bits, to say what type it is.

Alternately just make everything on the memstack PTR_SIZE, now only need to specify how many of those 16 are in use, and then how many of each of the other four banks are in use. Which is 5 bits (to specify up to 32 registers in use) * 4 banks = 20 bits. Or, if only powers of 2 registers-in-use-counts are allowed, then you only need 3 bits per bank, so 12 bits.

Even simpler: just say that all 32 registers are shared, and only the small stacks are banked. Now total state is 32*5 words = 160 words = 640 bytes (32bit) or 1280 bytes (64bit)

do we lose our free type checking though? Actually, no; since we have strict typing, the implementation is free to regard the unified register file of 32 registers as actually consisting of 4 separate register banks of 32 registers each. We can specify that this is permissble simply by specifying that using a value in a register as if it were a different type than it is, is undefined behavior.

Could also consider unifying the small stacks. But there, i don't see how an implementation could choose to implement a unified small stack as four separate small stacks, so this would destroy free type checking, I think. So I think I don't like it. Also, for similar reasons, it would provide less flexibility about what is on top of the stack (if you push an int32 and then a pointer, the ptr is the only thing on top of the unified stack, whereas with 4 separate stacks an instruction expecting an int32 would be able to see that one on top of the int32 stack). So yeah I don't like this.

So, have 32 unified registers, + 4 * 32 small stacks.

But wait, there's a serious downside to this. With 32 GPRs we lose the ability to use 3-bit operands in the 8-bit encoding to reach a registers in each type for a total of 32 -- unless we permanently specify the type of each of those 32 registers in which case the 32-bit encoding has only eight registers of each type again, and furthermore the 32-bit encoding is redundant by two bits per operand because the OP code already determines the register type.

But what we gain by having unified registers is the ability to have the right values mapped into registers. Mb split the difference: first four registers are banked, the rest are unified.

No, that's too many banked, because then we already have 16 bank registers (12 If you don't count zero registers) which is our limit on some embedded and x86 (12 is our limit, not 16, b/c the implementation needs a few regs for PC, ptr to implementation memory storage, scratch). So how about just banking the first two registers (zero and the small stacked TOS banked), which is kind of what we really want to be banked anyways. So in the 12 physical registers we are storing 4 banked and 6 unified VM registers, so three-bit registers operands (in a hypothetical 16-bit encoding) could address 10 registers.

We could bank another register(s), but that's less clean, because with register 1 we had the motivation that it's the smallstack TOS and there will be 4 smallstacks in LOVM, but with other registers we don't have that motivation there. However, on the other hand, when the target has 32 registers, being able to reach only 10 of them is not much. So we could consider banking more registers. How many? Let's do a geometric mean (exp of mean of logs) between the case where our target platform has 32 regs and where it has 20 regs: for 32 registers we have about 26 free, so using half of those for banked registers would be 13 physical registers used for banked registers, divide by 4 to get about 3 banked virtual registers; and for 16-register targets, banking half of the available 12 would be 6 physical registers used for in-between 1 and 2 banked virtual regs. In log 2, 3 becomes 1.5, and 1.5 becomes 0.5. The average of that is 1, and 2^1 = 2. So this suggests to bank two virtual registers (TOS and one other).

This would allow a 3-bit operand to address 13 registers (4+4+5, not including the zero reg).

Could we go higher? Sure. No matter how many VM regs are defined as banked, the implementation doesn't really have to cache/map all of those to physical registers. So it wouldn't be crazy to bank the first 4 virtual registers (3 excluding the zero reg) (and expect implementations to 16-bit targets to not map all of those banked registers to physical ones).

So consider banking 1 tos reg, 1 zero reg, one special reg, and one caller save reg. Then the next 28 registers are unified. Now excluding zero registers, three bits can reach 16 registers (3 * 4 + 4). Pretty snazzy. So 0-3 would be banked and 4-31 would be unified GPRs. On the other hand, with this system you can't just say that all virtual registers below 8 are probably mapped into physical registers, whereas with the 3-register-banked system maybe you can say that.

To simplify specification in the 32-bit encoding, having the first four registers banked (including 0 reg) could correspond to assigning types to registers 1-12 -- now in the 32-bit encoding Boot spec you don't have to talk about banking at all, just about the type of these registers. This also means there would only be 32 registers total (including 0 reg) instead of 40 (0 reg + 4*3 banked regs + regs 4-31), which is a power of two.

if you had only the first 3 banked, then you only assign types to regs 1-8.

should the type specificity of some registers be mandatory or just recommended? I guess it's okay to make 8 or 12 registers type specific (mandatory). Compilers that can't handle that can simply choose to never use those registers. And it might make hardware implementations slightly smaller.

To further simplify the Boot spec, in Boot we could just define only the banked registers or only the unbanked/unified registers, and leave the full defn to LOVM. I guess it's simpler for most implementators if we define only the unbaked/unified registers (implementors who can only implement typed regs can just multiply the registers by 4...). Now, in the 16-bit encoding, we are going to want to access some of those banked regs (at least the smallstack TOSs!) but i guess the mapping from 16-bit encoding 8-bit regs to LOVM regs doesn't have to be the identity map.

eh, that's not really great; the registers that the 16-bit encoding can access should be the same ones that are mapped to physical regs; so in the case where LOVM is being compiled to Boot, that means that we do want Boot to know about those banked regs

we should also provide a tip about which regs should be cached/mapped into physical registers first.

---

to summarize the previous section, i'm leaning towards:

out of the 32 virtual registers:
0 reg
1-4: int32, ptr, codeptr, anytype (all callee-save) (these will be the TOS registers)
5-8: int32, ptr, codeptr, anytype (all caller-save) (scratch registers)
umm we need the stack ptr and link ptr either before or after this point
9-24: 16 argument/return registers; untyped, caller-save
25-31: untyped, callee-save
recommend that implementations map lower-numbered registers to physical registers first

---

if we do that, how do the argument registers square with LOVM's input/output argument registers?

could treat the 16 argument regs as 'output' regs; but that's kind of a waste b/c wouldn't it be nicer to just push outgoing arguments onto a smallstack? it's really the input arguments that you want there

also should leave some room in the 'map lower-numbered' tip for mapping smallstacks, right? or, is that tip only applicable to Boot, and if we compile LOVM to Boot then caching the smallstacks will be done in regs, so it's taken care of?

and maybe the previous two ideas should fit together.. the place we plan to cache Boot smallstacks should fit with the LOVM argument passing convention so that it is compatible with the Boot argument-passing convention

---

so how about the same as previous, for Boot, and in LOVM we add:

the 9-24 are 'input' arg registers, and are also the smallstack cache

umm, no, two issues with that.

first, those two things are flatly incompatible; smallstacks are typed and Boot arg registers appear to be untyped
second, in LOVM, we have a whole separate addr mode to address smallstacks, so if those are used for arg registers, then it's a waste to also use Boot regs for that

---

ok so i think if we want Boot programs to 'interoperate' with LOVM, with the same calling convention, and if we want LOVM to pass arguments in the smallstacks, then:

the Boot calling convention has to pass args via the memstack, not regs -- except that the first arg of each type can be passed in a reg b/c it's TOS
the smallstacks are like caches of the tip of the memstack

this is inefficient but it's okay b/c LOVM is for efficiency, not Boot. Boot is for easyness.

if LOVM is COMPILED TO Boot (or interpreted by a Boot interpreter), then we can have non-identity mapping of registers from LOVM to Boot to make space to cache the LOVM smallstacks

this also implies that if you are calling a Boot fn from LOVM, you should flush the smallstacks, and if the smallstacks are empty but you look deep into them, you should view the memstack (b/c they are caches)

---

is it REALLY worth it to have banked smallstacks?

benefits:

can address more stack locs with fixed operand size
can push/pop from multiple stacks so more than one thing can be available 'on top' in 8- and 16-bit encoding

costs:

when mapping onto memstack, stuff from stacks is (probably?) grouped together, which means that to address it you have to know how deep the memstacks in front of it are. For example, if the pointer smallstack is stored beneath (in memory) the int32 smallstack, then when reading that in from memstack to smallstack, the implementation has to know the depth of the ptr smallstack just to access the TOS of the int32 smallstack. That means that either the stack sizes (or sentinels/terminators where they end) has to be stored on the memstack (extra memory consumption and memory bandwidth for small fns), or it has to be staticly known.
well, staticly knowing that stuff isn't too bad; sure, you don't know your caller's smallstack size, but you know the numbers of the arguments they are sending you. So maybe the smallstacks should be serialized to the stack in two parts; 1) the 'caller-save' part 2) the 'output arguments' part. The callee knows the lengths of all smallstacks in (2) so they can address them now.
or, we could just say that smallstacks are interspered when mapping onto memstack, and require that every stack slot is PTR_SIZE? No, that doesn't work; when you run out of items from one smallstack that messes up the offsets to the rest of the items in the other stacks. Really the 'zero knowledge' way to do it is to always write 32 items for each stack.

also think about how the smallstacks could be 'secretly' stored on the memstack below the virtual SP by the implementation. The stacks are being pushed and popped, and an interpreter may not know a priori how big they'll get, so - no that's wrong, in LOVM we can require a max stack size allocation in ENTRY. Boot was where i was saying we want to leave the stack semantics uncoupling with function calling, but then we removed the smallstacks from Boot - but this is still kinda dumb; at the beginning, a function can know the max # of output regs it'll need, but not exactly how many it'll actually use of each type, because different control flow paths lead to different function calls (and even on one control flow path different functions take different numbers of arguments of each type). So for example if you build the smallstacks covertly below the virtual SP, and at one point you will need to call a fn taking 16 anytypes, then the max anytype stack must be at least size 16. So if anytype goes below ptr then the implementation will have 16 slots on the memstack below the ptr smallstack for the anytype smallstack. So now if you end up calling another fn which only takes ptr arguments, then you must address up past all those anytype slots before you can even reach the ptr smallstack. This hurts cache, may be greater than immediate register offset addressing ranges, etc. Worse, you have to pass state to that other fn somehow telling them how many items are on the anytype smallstack; they can't just assume that exactly the arguments that they need are the numbers of things on each smallstack; the alternative to passing this information is to recopy everything onto memstack the way the callee expects it. - altho this passed state could just be hidden in the implementation (the current stack depth). But no, we want to flush the stacks to memstack upon fn call, right? - whereas if you did not bank the smallstacks, you could just push the arguments in order and make the call, without passing extra state

let's think about what we really want here:

when we do function calls, we want to say what the arguments are. When we are called, we want to look at the arguments we were passed. We don't want to have to write extraneous copies from the input arguments to a storage area to clear up some 'argument registers' before we make further calls on down the line, we just want to read and write the arguments.
when we are saying what the arguments are, since these will ultimately have to be pushed onto the stack if we keep calling on down the line, it should be possible for the implementation to just put them there first, and save a copy
we want to ability to compactly encode instructions which reach the last argument of each type

i think the answer is to split each 32-item smallstack into 2 16-item smallstacks. An 'input+local' smallstack and an 'output' smallstack. These are essentially two stack frames (or at least, the smallstacks portion of stack frames). The input smallstacks let you read arguments you were passed. The output smallstacks let you write arguments to a function you call. The callee only needs to know the sizes of the output smallstacks, which they already do.

This takes extra state in the interpreter to store the size of 8 16-item smallstacks instead of 4 32-item smallstacks, but luckily that still fits into only one 32-bit register.

so i think we can do this.

--- old

(this note was cut-and-pasted possibly out-of-order from another file)

I don't know maybe do have banking as long as it is both dynamically computable (keep active bank sizes in a dynamic register) and statically (lexical scoping of active bank sizes)

No even if it's dynamically and statically computable, there's an issue with the memory bandwidth caused by saving the active stack bank metadata on the stack.

---

so:

out of the 32 virtual registers:
0 zero register
1 stack pointer (callee-saved) (ptr)
2-5: int32, ptr, codeptr, anytype (all callee-save) (these will be the TOS registers)
6-9: int32, ptr, codeptr, anytype (all caller-save) (scratch registers)
10: link register (caller-saved) (ptr)
11-19: untyped, caller-save
20-28: untyped, callee-save
29: thread pointer? (callee-saved)
30: global pointer? (callee-saved)
31: frame pointer? (callee-saved)

note: do we want a reg for platform use?

(note: AArch64 doesn't have thread or global ptr reg, only RISC-V)

recommend that implementations map lower-numbered registers to physical registers first
Boot calling convention:
last argument of each type in regs 2-5
everything else on the memstack, grouped by type, from top to bottom (in memory): anytype, codeptr, int32, ptr
8 smallstacks of 16 items each, in/local and out
CALL @codeptr, to (..?..)
RET x, to pop the memstack by x

---

actually we still have a problem here. Say you pass me 3 int32s. Now i want to push a 4th int32 onto the local smallstack. But this will hit the ptr smallstack, so i can't; unless ENTRY resizes the smallstacks, which means copying stuff unnecessarily.

the obvious solution is to subdivide the smallstacks even farther, into an input, a local, and an output. but:

now some of these smallstacks are only 8 items
now we can't fit their sizes into 32bit

i guess the implementation could secretly divide the input/local smallstack into input and local. but that's a little more state. and also this is getting convoluted.

i think, for simplicity, it might be better to let go of banked smallstacks!

in which case i don't think we have enough motivation for banked registers at all.

let's review:

with banked registers, it's difficult for the program to communicate to a naive/non-compiled/non-JIT implementation which virtual registers should be mapped into physical registers and which ones should be saved in memory
but maybe you could use them with just a few instructions explicitly for dealing with things that should not be mapped into memory; e.g. 'cp'-style instructions
but what's the point -- these will just be stored on the stack. Program can just index from the stack or frame pointer as usual
the point is to let it access these things in one instruction, without computing the offset, loading it into a register, and then using it
but by definition these are rarely used things (at least individually; in aggregate, they may or may not be a 'long tail' that is frequent in aggregate). Probably not worth complicating everything else just for this
with banked smallstacks, it's complicated to deal with communicating the locations of each smallstack on the memstack between caller and callee
with banked smallstacks, it's a problem to push to the input smallstack, which means that you need 3 smallstacks: input, local, output, and you need to save all those sizes. With non-banked smallstacks, you can just concatenate the local smallstack to the input

seen this way, it looks like eliminating banked regs may not be optimal, but it might be simpler

so, get rid of all banked smallstacks and regs? - 1 register set of 32 regs

out of the 32 virtual registers:
0 zero register
1: caller-save (this will be the TOS)
2: stack pointer (callee-saved) (ptr)
3: link register (caller-saved) (ptr)
4: callee-save
5-19: caller-save
20-28: untyped, callee-save
29: thread pointer? (callee-saved)
30: global pointer? (callee-saved)
31: frame pointer? (callee-saved)
2 smallstacks of 16 capacity each:
input/local
output
Boot calling convention is to pass args on stack, except for the last arg, which is passed in reg 1
space below the SP is not available to the program (is probably in use by the implementation for the smallstacks)

also, leaning towards no separate callstack, now that i see how having just one stack meshes so well with the concept of a stack

leaning towards yes

---

problem: what about when a syscall might return something which on some platforms is a pointer and on others is an int? e.g. the file description returned by 'open', which is an int in linux. I think what we want is:
guarantee that the size of a pointer in memory is >= the size of an int
introduce the data type of an undereferencable pointer (incidentally, a codeptr is probably like that)
make it okay to just load a bad ptr as long as you never use it
now you can use pointer registers for blind copying
but weren't we really happy that we avoided that via the existence of memcpy when we deleted the any register bank?

decision: introduced the handle datatype. Can't be used for blind copying, though.

---

should probably specifying that some amount of memory below the stack pointer is reserved for the Boot implementation? or maybe only LOVM needs that, since we don't have variable-length smallstacks or registers files here.

---

next question: should we have a separate fp bank?

the possibility makes me reconsider having other banks too..

recall that the reason i turned away from register banking was to give the program a simple way to communicate which virtual registers should be 'cached'/mapped as physical ones. However, most popular physical architectures have their own FPU registers anyhow (16 or 32 of them), so that might be a reason to separate them.

also, if some low registers were not banked and some high ones were, that would increase expressiveness while still giving the program a way to declare which registers were most important. Could think of the banked ones as a separate 'addressing mode'. Say, the top 16 registers could be banked.

if register types are immediate, ptr, label, handle, 32-bit float, 64-bit float, then could have 6 banks, for 6*16 = 96 banked words.

hmm.. so, first i thought mb registers should be unbanked, b/c of the problem of communicating the priority of banked registers. But then, i thought smallstacks should be unbanked, b/c of the problem of addressing items on a smallstack when the smallstack is stored on a memory stack. But really, high banked registers would probably be stored on a memory stack too. So we'd have the same problem there. So maybe we shouldn't bank anything, just for that reason.

Is it really that bad? The capacities of each register bank, and of each smallstack, is statically knowable for each function. Which means that the offsets on the memory stack could be staticly computed. But... statically knowable is not the same as immediately present in the instruction encoding -- if this were a HLL you could compute the offsets and use them to generate the code, but this isn't; we don't want that sort of intermediate compilation step. And, if the offset were immediately present, you wouldn't have banking; that would be ordinary stack-pointer-relative addressing.

so, this actually kind of rises to the level of a fundamental principal of assembly vs higher level languages; i can't quite get the phrasing right, but it's something like, you don't want to translate instructions in the assembler to machine code instructions in a 1-to-many way that involves integrating non-local information in the assembly (the reason i say i can't get the phrasing right is that it would be okay to have a macroinstruction using a macroconstant, right? and what's really the difference between that and this? but there is a difference, imo; mb the point is just that such a macrolanguage would be a tiny HLL in its own right).

the other option is to just save all those registers upon context switch, and not have some of them 'turned off'. But that's too much state imo.

anyhow, so i'm still thinking: no banking

does this mean no FP register banking, even? It's a tough call because the most 'mechanical sympathy' would be the option of having an FP bank but no others. But, otoh that's 32 extra regs of state to save, that we're probably not going to be using much.

---

what does WASM do? WASM has up-to-32-bit (variable length!) integers to represent the index of locals. A far cry from our current 32 registers. (see https://github.com/WebAssembly/design/blob/master/BinaryEncoding.md#function-bodies ). And they require 32 bits just to address a local. They have a variable-length bytecode with one-byte opcodes.

i guess my response is, yeah, if you have 32-bit operands then you can address a lot more variables than our 7-bit operands, bu we don't want such big operands. Otoh they have this compact-ish variable-length stack machine rather than our 3-operand stuff.

---

this was an interesting read:

http://troubles.md/wasm-is-not-a-stack-machine/

the article points out that if WASM were a stack machine, liveness analysis would be trivial, but since it also has locals, to emit good code you have to do liveness analysis on locals, which is dumb b/c the compiler that emitted the WASM presumably already did that.

Instead he recommends eliminating locals and allowing blocks to take arguments (which are then placed on the stack), and return multiple values, and links to:

https://github.com/WebAssembly/multi-value/blob/master/proposals/multi-value/Overview.md

but for us the lesson is that SSA (immutable locals) is easy to optimize, stack machines are easy to optimize, but register machines require liveness analysis to optimize

eliminating our registers would be kind of the opposite direction to the one i was going in. i guess this highlights the importance of an 'undef' annotation, so the work of a higher compiler is not lost. but it does make me think.

---

leaning towards no FP reg bank.

we should have locals tho

---

the motivation of the 'register offset cache' semantics is to allow us to be agnostic as to whether the implementation caches the area immediately above the stack pointer (and, in LOVM, also immediately above the frame pointer). The purpose is to allow us to be agnostic about whether up to 16 function arguments are passed in registers or on the stack; since the calling convention is to pass arguments via the cache, implementations that want to pass them in registers can do so without hitting main memory, and implementations that want to pass them on the stack can do so too. The complicated semantics are necessary because we don't want to burden the implementations with intercepting reads and writes to main memory to see if they hit the cache, so the main memory could at some points in time diverge from the cache, and this would be visible to the program. And we want the same Boot program to be able to run on various implementations regardless of whether they use a cache or not. So since the indeterminacy of the state of the cache or memory in some situations is exposed, we have to specify it.

--- removed (more detailed than necessary)

Make space to store a cache of size 16*PTR_SIZE, as well as a copy of register 2 "R2_OLD", and a count "CACHE_USE" of how much of the cache is in use. When flushmem2roc is executed, set CACHE_USE=0. When flushroc2mem or lwc or lpc or swc or spc are executed, first check if R2 is different from R2_OLD, and if it is, set CACHE_USE=0 and set R2=R2_OLD; then, do the following. When flushroc2mem is executed, copy cache locations 0..CACHE_USE to memory locations R2..R2+CACHE_USE (with x..y interpreted as half-open range, that is, a range from x to y-1 including x and y-1 but not y). When swc or spc are executed, write the value into the cache and set CACHE_USE=max(CACHE_USE, (#imm7 from the instruction) + 1). When lwc or lpc are executed, if the #imm7 in the instruction is < CACHE_USE, then read from cache, otherwise read from memory

3: like #3 but instead of CACHE_USE, maintain a 'dirty' bit for each item in the cache to mark if it has been written to, and upon flushroc2mem (if register 2 is unchanged), flush the items which are dirty.

---

thinking again about banking the upper 16 regs.

for: generally a good thing to have more regs against:

less mechanical sympathy (doesn't fit into even today's 32-reg machines)
more state to copy during context switches
somewhat more complicated

leaning against

--- old

removed this from BootX?

# The Boot Calling Convention ##

looking ahead to lovm, mb we dont want to save any arguments in these registers, so that we put them all in the smallstacks and have our nice register window-like convention to rotate them from output argument registers to input argument registers. OTOH we want some of our syscalls to work without memory access. Maybe the thing to do is to have a separate calling convention for syscalls. Or, maybe just call that the 'Boot calling convention', and have a different one for LOVM.?

Registers:

0: (constant 0/null pseudo-register)
1: persistent (callee-save) (future smallstack pseudo-register)
2: $2: ?frame size of top frame on memory stack? &2: memory stack pointer (callee-save)
3: persistent (callee-save)
4: $4: ?frame layout of top frame on memory stack? &4: link register (caller-save)
5: first argument (caller-save)
6: second argument (caller-save)
7: third argument (caller-save)

note: the numbers below are out of date

note: now that we have the smallstack, probably pass all arguments on smallstack (right-to-left order; the last argument is pushed first). Also consider leaving the last smallstack spot open? otoh that would only be 3 arguments. - so i guess the convention is that the smallstack must be empty except for these arguments? - also consider putting the first argument in a register, or even the first 2, so that (i) fns with few args can have them all in regs, (ii) the first arg is available at any time. Not sure if that's a good idea tho. - if you do stuff like that, mb make sure at least 2 (regs+stack locs) are free even with the max # of args), so that you can at least do a SWAP w/o spilling anything to memory first

note: I read somewhere that you should have callEE clean up of the stack for tail calls? I don't understand that but anyway shouldn't that apply to our extended argument allocated memory also? because that allocated memory could be on the stack.

note: Also should we just use the stack pointer register to point to the allocated memory for extra arguments (e.g. most of the time this means putting it on the stack)?

Registers 5 and 6 (both integer bank and pointer bank) are used to pass arguments and return values (from lower to higher number get arguments from left to right). If the function is variadic (that is, if the number of arguments is not fixed), then the number of arguments is passed in $1. If more than 2 integer arguments or more than 2 pointer arguments must be passed, then a pointer to all arguments after the first is passed in &1. The callee may overwrite the memory passed in &1 but must not deallocate it. The conventions for passing return values are the same as the conventions for passing arguments, except that the number of return values must be fixed and must fit in registers 5 and 6 (both banks) plus whatever memory, if any, was allocated for additional arguments and passed in &1.

So, upon being called, the callee expects to find the first two int32 arguments in $5 and $6, the first two pointer arguments in &5 and &6, and the return address in register &7. If the function is variadic, the number of arguments is in $1. If there are more arguments than fit in registers $5, &5, $6, &6, a pointer to the additional arguments is in &1.

There may or may be a memory stack, but if there is a stack, then the recommended convention is: the memory stack pointer is in &3, and the stack grows downward. Note: when there is a stack, and &1 is being used to point to additional arguments, &1 may or may not point into the stack.

The callee can overwrite and use registers 4 thru 7 (both banks) for any purpose. If additional arguments were passed in memory, the callee can also overwrite those arguments.

Upon returning, the caller expects a fixed number of return values. These are stored first in registers 5 thru 6 (both banks), and then, only if the caller passed a pointer to additional memory in &1, in that memory. The values in registers 1 thru 3 (both banks) are the same as they were at the beginning of the call. If the caller passed a pointer to additional memory in &1, then that memory has not been deallocated by the callee.

...describe also the convention for spilling the smallstacks to the stack...

frame layout of top frame on memory stack: ... (this is planned to be used to store the counts of integer smallstack items and pointer smallstack items spilled to the memory stack, in conjunction with $3; altho mb not b/c we don't really need that many bits for that)

...

# Profiles The profiles are arranged in a linear ordering from smallest to largest, and each profile includes all of the smaller profiles.

The following profiles are defined:

tiny: very common computational functionality
small: memory allocation and console I/O
standard: common operating system-provided functionality
performance: various common operations which could be computed using the primitives provided by the Standard profile, but which may be faster if implemented natively

Any of these may be suffixed with either 'vanilla 32-bit' or 'vanilla 64-bit' to indicate the combination of the indicated profile with the indicated 'vanilla' restrictions.

current space left:

3-operand instructions: 18 2-operand instructions: 6 1-operand instructions: 3 0-operand instructions: 5

# Tiny profile The Tiny profile consists of the following new instruction extensions:

Floating point 1
Integer division 1
Misc

count of new 3-operand instructions: 8 count of new 2-operand instructions: 11 count of new 1-operand instructions: 0 count of new 0-operand instructions: 1

note: gonna have to introduce 2 new pseudoinstructions (similar to opcode 63) to house all these 2-operand fp instructions

Semantics notes:

The floating-point width is implementation-dependent.

Most of the details of the semantics of floating point is implementation-dependent.

New sinfo queries:

11: FLOAT_SIZE, the number of memory locations per float

These sinfo query results are static.

todo: do we really want to have fcp fld fst i32tof ftoi32 without doing MORE16 so as to save encoding space?

--- old

removed from Boot reference (i guess since < and <= are only defined for integers we implicitly have these properties)

bltb, bltub, bltf, bltuf: the less-than relation is guaranteed to be not reflexive (x is not < x), not symmetric (if x < y then y is not < x), and transitive (if x < y and y < z then x < z)
bleb, bleub, blef, bleuf: x is less-than-or-equals y if and only if: (x is less-than y) or (x equals y)

--- old

removed from bootX reference:

transpilation of simple algorithms using a well-behaved subset of LOVM between two high-level languages can use LOVM as an intermediate language without losing readability

--- old

removed from bootX reference:

Boot: Boot is another 'assembly language' virtual machine in the Oot project. The reference Oot implementation targets BootX? initially and then compiles or interprets the generated BootX? to Boot. BootX? provides a larger instruction set and a more compact, if more complex, encoding than Boot. If the goal is to rapidly port the Oot implementation to a new platform, the easiest way to do this is to port Boot. However, if the goal is to make a performant Oot implementation, it would be better to skip the Boot layer and compile or interpret LOVM directly.

--- old

removed from bootX ref:

-- -- --

## Registers

32 general purpose registers:

Register #0 is the constant zero register. Writes to this register have no effect.
Register #1 is aliased to the top of the stack cache (see 'stack cache', below).
Register #2 is the stack pointer.
Register #3 is the link register, used to pass the return address when one function calls another.
Register #29 is a pointer to the beginning of this thread's thread-local storage, if any.
Register #30 is a pointer to the beginning of the program's data segment, if any.
Register #31 is the frame pointer.
Registers #4-28 are general purpose registers. By convention, registers 4 and 18-28 are callee-save ('persistent'), registers 5-17 are caller-save ('temporary').

Registers are referred to by Rn where n is the register number, e.g. R0, R1, R31.

Registers also have secondary nicknames that assist in remembering their functions:

0: R0
S0: R1
SP: R2
LR: R3
P0: R4
T0..T12: R5..R17
P1..P11: R18..R28
TP: R29
DP: R30
FP: R31

There are also 32 floating point registers, F0..F31. F0 is constant zero, F1-F16 are caller-save, F17-F31 are callee-save. Nicknames:

0: F0
FT0-FT15: F1-F16
FP0-FP14: F17-F31

Note that the item on the top of the stack, s0, is aliased to register r1, so it is read/written to using 'register addressing mode' rather than 'stack cache read/write addressing mode'.

---

old ideas for BootX? profiles

## Floating-point extension

New architectural features:

16 floating point registers

The following new instructions are defined:

integer division 1	div32
floating point 1	fadd fsub fmul fdiv fcmp i32tof ftoi32 fcp fld fst beqf bltf floor ceil nearest trunc binf bnan

f32add f32sub f32mul f32div f32cmp i32tof32 f32toi32 fcp fld fst beqf bltf floor ceil nearest trunc binf bnan

New undefined behavior:

Division by zero

Small profile ### The Small profile consists of the following new instruction extensions:

I/O 1

plus the following new lib call extensions:

Local memory allocation

The following new instructions are defined:

The following lib calls are defined:

## xlib 2: malloc(size: $5)

New undefined behavior:

freeing a pointer that was not created by malloc
freeing a pointer more than once

Standard profile ### The Standard profile includes everything in the small profile, plus the following extensions:

Instruction:

Atomics rc 1

Syscall:

Clocks
Environment Variables
Filesys
Process control
Shared memory allocation
TUI

Performance profile ### The Performance profile includes everything in the Standard profile, plus the following extensions:

Instruction:

Atomics rc 2
Atomics seqc 1
Atomics seqc 2
Floating point 2
Floating point Triglog
Non-branching conditionals
SIMD
push, pop, mb something like 'call' and 'ret' (using the link register, i guess). mb have 16 regs and have 4 cp16 commands

Syscall:

IPC 1
IPC 2

Functionality extensions

## Syscall functionality extension ==

profiles
capabilities/extensions?

Oot assembly functionality extension == (longer mnemonics?)

Instruction extensions

These extensions add new instructions.

## Floating point 1 instruction extension ###

floating point

i2f lf sf ceil flor trunc nearest addf subf mulf divf copysign bnan binf beqf bltf fcmp

could save a few fp opcodes by having ftoi take an immediate operand specifying rounding type: floor ceil round nearest. mb could elide either bltf beqf or fcmp. could compress isinf isnan into fclass like riscv does. so: fadd fsub fmul fdiv fclass fcmp itof ftoi fcp fld fst. thats still a lot tho.

## Floating point 2 instruction extension ### Includes Floating point 1 and adds:

additional floating point

remf sqrt minf maxf powf bnonfinite bgtf beqtotf blttotf ftotcmp

## Floating point Triglog instruction extension ### Includes Floating point 2 and adds:

additional floating point

trig, log, exp etc fns from math.h (at this point are we missing any other math from C library or math.h?)

## Floating point elusive eight instruction extension ###

additional floating point

https://www.evanmiller.org/statistical-shortcomings-in-standard-math-libraries.html#functions

## Atomics seqc 1 instruction extension ###

atomics (sequential consistency)

lpsc lwsc spsc swsc casrmwsc casprmwsc fencesc

## Atomics seqc 2 instruction extension ###

atomic additional rmw ops (sequential consistency)

addrmwa aprmwa andrmwa orrmwa xorrmwa

## Atomics rc 1 instruction extension ###

atomics (release consistency)

lprc lwrc sprc swrc casrmwrc casprmwrc

## Atomics rc 2 instruction extension ###

atomics (relaxed consistency)	lprlx lwrlx sprlx swrlx casrmwrlx crsprmwrlx
atomic additional rmw ops (release consistency)	addrmwrc aprmwrc andrmwrc orrmwrc xorrmwrc addrmwrc
atomic additional rmw ops (relaxed consistency)	addrmwrlx aprmwrlx andrmwrlx orrmwrlx xorrmwrlx addrmwrlx

## Non-branching conditionals instruction extension ###

Syscall extensions

These extensions add new syscalls.

## Filesys syscall extension ###

Syscalls: explain syscalls only; mb put these sort of extensions under a separate heading?

filesys

read write open close seek flush poll

## Environment variables extension ###

memory allocation

malloc_shared malloc_local

(which of malloc_shared/malloc_local is ordinary malloc? i think the ordinary malloc is already malloc_local)

## Clocks syscall extension ###

Restrictive extensions

These extensions further specify details which are unspecified in Boot.

## Vanilla 32-bit extension ###

integers are defined to be 32-bit, represented using little-endian with twos-complement for signed values
arithmetic is mod 2^32
pointers are defined to be represented as 32-bit integers
INT32_SIZE is 4
INT16_SIZE is 2
PTRD_SiZE? is 4
mixing integer bitwidths (writing to memory with one bitwidth and reading with another) is allowed

Vanilla 64-bit extension

integers are defined to be 64-bit, represented using little-endian with twos-complement for signed values
arithmetic is mod 2^64
pointers are defined to be represented as 64-bit integers
INT32_SIZE is 4
INT16_SIZE is 2
PTRD_SiZE? is 8
mixing integer bitwidths (writing to memory with one bitwidth and reading with another) is allowed

Details about semantics
1. Stubs

Much functionality could be trivally implemented in a way that always returns a null result or an exceptional condition, or in a way that does not take advantage of native platform facilities, even when present. This is compliant provided that the corresponding functionality is indicated as 'stub' or 'partially stubbed'.

For example, if the Small profile is implemented but every attempt to allocate memory returns null, the implementation must not be described as "implementing the BootX? Small profile", but may be described as "implementing the BootX? Small profile, partially stubbed".

For another example, if the Standard profile is implemented in a way that prevents more than one thread/process from executing simultaneously, yet the target platform natively provides true parallel processing, then the implementation must be described as "Standard profile, partially stubbed".

Note that "stub" doesn't have to be said in cases where the reader already known that the native target does not support the given functionality.

Boot instructions fall into three categories:

Small profile: These can be easily ported almost anywhere
Standard profile: This is what OVM requires. This profile adds integer division, reading the program counter, indirect branching to a previously read program counter value, floating point, atomics, constant tables, and 'systems' instructions for allocating memory, I/O and filesystem operations, interoperation, querying metadata about platform capabilities, and logging.
Optional instructions: These are not required but can be added, either to expose additional facilities to Boot programs, or to provide more efficient native implementations of certain operations.

arithmetic of ints (result is undefined if the result is greater than 32 bits):

add $dest $src1 $src2: $dest = $src1 + $src2
sub $dest $src1 $src2: $dest = $src1 - $src2
mul $dest $src1 $src2: $dest = $src1 * $src2

standard profile adds (52 instructions, for 92 total; all opcodes are below 128):

==
non-branching conditionals	cmovi cmovip cmovpp
other control flow	lpc
==

optional instructions (all opcodes are 128 or greater):

==
implementation-defined	impl1 thru impl16
interop	xentry xcall0 xcalli xcallp xcallii xcallmm xcallim xcallip xcallpm xcallpp xcall xcallv xlibcall0 xlibcalli xlibcallm xlibcallp xlibcall xlibcallv xret0 xreti xretp xpostcall

lpc $dest _ _: $dest = PC (program counter)

indirect control flow

lci jy

general lci, with target that doesnt have to be xentry

lpc, for (intrusive) debuggers?

pcmp &dest &src1 &src2: &dest = 1 if &src1 > &src2, or 0 if &src1 == &src2, or -1 if &src1 < &src2 todo if we have opaque refs, what if they are incomparable?
devop $value $device #imm8: implementation-dependent control operation of type #imm8 on a device. If successful, $3 is set to 0; if unsuccessful, a non-zero error code is written to $3.
xcall0 &target_address _ _: call external subroutine with no arguments
xcalli &target_address $arg1 #imm8: call external subroutine with one integer argument ($arg1 + #imm8)
xcallp &target_address &arg1 #imm8: call external subroutine with one integer argument (&arg1 + #imm8)
xcallm &target_address #imm16: call external subroutine with one integer argument (#imm16)
xcallii &target_address $arg1 $arg2: call external subroutine with two integer arguments
xcallmm &target_address #arg1_imm8 #arg2_imm8: call external subroutine with two immediate integer arguments
xcallim &target_address $arg1 #arg2_imm8: call external subroutine with one integer argument and one immediate integer argument
xcallip &target_address $arg1 &arg2: call external subroutine with one integer argument and one pointer argument
xcallpm &target_address &arg1 #arg2_imm8: call external subroutine with one pointer argument and one immediate integer argument
xcallpp &target_address &arg1 &arg2: call external subroutine with two pointer arguments
xlibcall0 #libfn_imm24: call external library function #libfn_imm24 with no arguments. Equivalent to doing an lkp to load a pointer constant #libfn_imm24, then doing an xcall0 to that pointer.
xlibcalli $arg1 #libfn_imm16: call external library function #libfn_imm24 with one integer argument $arg1. Equivalent to doing an lkp to load a pointer constant #libfn_imm16, then doing an xcalli to that pointer with $arg1 0.
xlibcallm #arg1_imm8 #libfn_imm16: call external library function #libfn_imm24 with one integer argument $arg1. Equivalent to doing an lkp to load a pointer constant #libfn_imm16, then doing an xcalli to that pointer with $0 #arg1_imm8.
xlibcalli &arg1 #libfn_imm16: call external library function #libfn_imm24 with one pointer argument $arg1. Equivalent to doing an lkp to load a pointer constant #libfn_imm16, then doing an xcallp to that pointer with &arg1 0.
xlibcall{ii,im,mm,im,ip,pm,pp}: etc (todo document, and add to tables above)
lentry32 &dest ; #imm32: &dest = Load register with a code pointer to an xentry instruction
jmp32 ; #imm32: unconditional jump

Some instructions are followed by data.

A semicolon means that the instruction is followed by data; 'instr ; data'.

jump constants only 32 bits

lentry and JMP data is relative to beginning of program

note in boot spec that bootx will define some syscalls below 128? and some sinfos? and some/all instructions? or maybe just dont mention it much? or maybe say that some things are RESERVED for extension languages?

interop

xcall xentr xaftr xret0 xreti xretp xtail

xcall &target_address #nargsi_imm8u #nargsp_imm8u: eXternal CALL subroutine
xentr _ #nargsi_imm8u #nargsp_imm8u: eXternal ENTRypoint to Boot function
xaftr _ #nretsi_imm8u #nretsp_imm8u: place immediately AFTeR? xcall or xlib
xret0 &return_address _ _: RETurn void to external platform
xreti &return_address $return_val #imm8: return int32 ($result + #imm8)
xretp &return_address &return_val #imm8: return ptr (&return_val + #imm8)
xcall: call external subroutine with #nargsi_imm8 integer arguments and #nargsv_imm8 pointer arguments. Before executing this instruction, the arguments must be placed as per the Boot Calling Convention
xentr: This instruction should be placed at each entry point that may be called from foreign code. #nretsi_imm8 and #nretsp_imm8 indicate the number of integer and pointer arguments expected. Every code path starting in xentr must end in an xret or xtail, with no other 'xentr's in between. The xentr most immediately previous to an instruction, if any, is considered to begin an 'xentr subroutine' containing that instruction. No source instruction shall branch or jump to a target location within any xentr subroutine, unless either both source and target are within the same xentr subroutine, or the jump is by way of one of the instructions: xcall, xtail, xlib, xret0, xreti, or xretp.
xaftr: reenter Boot function after returning from external subroutine call. This instruction should immediately follow each xcall. #nretsi_imm8 and #nretsp_imm8 indicate the number of int32s and ptrs being returned, respectively.
xret0, xreti, xretp: return from Boot function to external platform at &return_address. xreti's int32 $return_val is interpreted as signed. &return_address must be the value that was in pointer register 4 upon the corresponding xentr.
xretp: if &return_value holds a ptrc, then $imm8 must be 0
xtail: xtail cannot be used to call any function taking more than 3 integer arguments or more than 3 pointer arguments, unless it is within an xentr routine.

If the function being called takes a variable number of arguments, then the total number of integer arguments is passed in register $11 and the total number of pointer arguments is passed in register $12.

If more than 3 integer arguments or more than 3 pointer arguments need to be passed, then a pointer to the remaining integer arguments is passed in &11 and/or a pointer to the remaining pointer arguments is passed in &12. The contents of the memory holding the additional arguments may be overwritten by the callee, just as with registers 5,6,7. However, the registers 11,12 (both banks) themselves are still callee-saved and, if modified, must be restored before return. The callee must not deallocate the memory pointed to by pointer registers 11 or 12 (that is, the memory holding the additional arguments).

On platforms which pass values which are neither integers nor pointers, when arguments are passed which are neither 32-bit integers nor pointers, if the value is guaranteed to fit within 32-bits, it is passed as an integer, otherwise the value is stored in memory and a pointer to the value is passed.

l8m #imm9u: Load 8-bit iMmediate int constant (the 8 least-significant-bits of the #imm9u) into either $1 or $2, depending on if the most-significant-bit of the #imm9u is 0 or 1, respectively

undef behav:

trying to return, using the xret functions, to any &return_address other than the one passed in upon xentr
branching or jumping between distinct xentr subrountines, or into an xentr subrountines from outside of it (without using the interoperation instructions).
failing to restore callee-saved registers before returning or tail calling with xret0, xreti, xrept, xtail
mfreeing the memory allocated by a caller for extra arguments

split 8-bit immediate offsets into 2 4-bit immediate offsets, and have one of those be ints, and the other be ptrs, so that you can specify an offset into a struct mixing ints and ptrs

I/O:

in1 $dest $device #imm8u: read IN one int8 from device ($device + #imm8)
out1 $device $src #imm8u: write OUT one int8 to device ($device + #imm8)
in &dest &device $len: read IN $len int8s from device &device
out &device &src $len: write OUT $len int8s to device &device

y'know, if we just make device a register, but make both STDIN and STDOUT device &0 (null ptr), then we're probably set (b/c the &0 register works for that). Then we dont need the #imm8u. Although we may still want in1 and out1 so that we can do i/o from registers, w/o main memory. Can i think of a clever way to stuff those into one instruction without changing the 'addressing mode' of operands depending on circumstance?

maybe for in, if &dest is &0 then $len is the dest? and for out, if &src is &0 then $len is the src?

i dunno, that's a little ugly. At least we can stuff in1 and out1 into MORE.

l16 $dest $src #imm3u: Load 16-bit int constant (embedded in instruction stream immediately following instruction), add it to the result of shifting $src left by #imm3u, then write the result to $dest
lpc16 $dest 0 0: Load Program Counter plus 16-bit signed immediate offset (embedded in instruction stream immediately following instruction): $dest = PC + #imm16
j16 #imm16: unconditional Jump Relative (to the next instruction) Long; 16-bit signed offset in next word in instruction stream
in, out, inp, outp: If successful, $3 is set to the number of bytes read or written (either 0 or 1 for IN; for IN, if nothing was available to be read or if EOF was reached, IN writes 0 to $3 and the output register holds an arbitrary value; for either IN or OUT, some platforms may return 0 in 3 rather than returning an error in some cases); in case of an error, a negative error code is written to $3. The device number is interpreted as unsigned.

I/O ## If standard console streams STDIN, STDOUT, exist on the platform and are supported by the implementation, they must be devices #0, #1, respectively, and device #2 must be STDERR if it exists, and otherwise should be an alias to STDOUT or may be a null device (one which never emits anything and to which writing has no effect).

An implementation does not have to support INP, OUTP.

(actually later i decided that both stdin and stdout should be device #0)

When using xlib, there is no need to set the link register, these instructions will set it if needed. Upon making a call, up to 3 integer arguments are in integer registers 1, 2, 4, and up to 3 pointer arguments are in pointer registers 1, 2, 4, and the return address is found in pointer register 5. Upon returning from a call, up to 3 integer and up to 3 pointer return values will be found in registers 1,2,4 using the same convention as for calling.

in, out

57,in,0,ri,rf,u3,1,0,0,0,0,0 58,out,0,rf,rf,ri,1,0,0,0,0,0

in &dest &device $len: read IN $len int8s from device &device
out &device &src $len: write OUT $len int8s to device &device
User profile.
fma (a = a + b*c) with c immediate (useful for address computation; c is the scale, b*c is the scaled index register, a is the base register; interpret c as choosing between 1,2,4,8)
fma (a = a + b*c) with b and c immediate (useful for address computation; c is the scale, b*c is the scaled offset, a is the base register)
decide what to do about >21 bit offsets and constants. Embedded in instruction stream? BootX??

i decided to leave them embedded in the instruction stream. Otherwise labels could only be loaded with an 8k range, which would make implementations jump thru hoops. As a bonus, now we can assume that any Boot implementation supports larger programs.

We can have a BootX? anti-extension that disallows this.

-- -- --

old

removed from Boot/BootX?

However, if you want to implement register-associated cache in some fancy way (for example, flushing the cache in a background thread concurrently with execution of the Boot program), BootX? provides an extension to broaden permissive register-associated cache behavior.

---

should we provide a facility for constant pointers, code pointers (labels), handles, embedded in the instruction stream?

on the one hand, this would seem to be the practical choice; eg. i get the impression that often linkers or loaders or maybe compilers will insert pointers to the platform/process-specific address of the process's globals as constants embedded in the instruction stream, rather than passing this in a gp register like RISC-V does.

on the other hand, as these things are opaque, their meaning and even their size is non-portable. And i thought we were saving constant table functionality for LOVM.

i guess since their size is non-portable, they really can't be embedded in the instruction stream. And since their semantics aren't defined, there's no portable way to even indicate within the BootX? language what you want the linker/loader to give you.

also, since we do have a global pointer (gp) in a register, the constant table could be over there. although maybe we should have a read-only global pointer too! and maybe the gp tp (and read-only globals) registers should be constants themselves, like the zero register.

although... that suggests a use for immediate addressing mode with pointer (source) operands: they could refer to constant pointers!

hmm.. actually it's still not obvious how to treat constant tables. The issue is that, for example, in an object file you might write out 32-bit integers as 4 bytes, but then in memory (running on top of Python) they might occupy just one memory location. But maybe the implementation doesn't know about the boundaries between constants, and so just maps the file into memory as 1-byte:1-storage-location. But, maybe not: maybe constants occupy the same sizes as everything else in memory. And what about bytestrings, is each byte one storage location? These questions suggest that we should provide constant table read instructions, even if the entries in the constant tables themselves can only be populated in an implementation-dependent way.

---