Bayle Shanks's website: proj-oot-ootAssemblyNotes29

ok i'm trying to understand which calling convention we should have, and tail calls, etc.

i've read two things:

ppl say you should have callee cleanup if you have tail calls
but this seems inefficient to me, because it prevents the same buffer from being reused by the caller to pass arguments to multiple callees over time, doesn't it? Maybe if things are always passed on the stack rather than in a 'buffer' it's not as much of a problem...
ppl say you should have pascal order of arguments (i assume this means left-to-right) instead of C order (right-to-left). need to see why they say that [1] -- mb they really just meant the pascal calling convention in general?
someone also said that caller-cleanup is needed for varargs ( https://stackoverflow.com/questions/27596248/why-languages-such-as-c-pascal-cannot-implement-tail-recursion )

ok, on the callee cleanup:

at first i wasn't seeing it, because i think thinking tail call = reusing the stack frame and then skipping intermediate returns that shared a stack frame. So if A calls B and B calls C, and A allocates a stack frame for the arguments it's passing to B, and B reuses that same stack frame to call C, then C can return directly to A, skipping B, and then A can deallocate the stack frame. This works the same if the arguments are passed in some other heap-allocated activation record instead of on the stack.
but after reading a little bit i think what they mean is: if the convention is that the caller cleans up then, by that convention, B should be given the chance to do some cleanup after C returns, since conceptually what is happening is C returning to B and B returning to A. So, remember that tail-call optimization is only allowed when the call is in tail position; in this case the call from B to C isn't "really" in tail position because B has a hidden operation after the call to C (namely, to cleanup)
so the tail call elimination is being imagined to work like this: the compiler detects a call in tail position in B to C; to compile the call, instead of creating a new stack frame it overwrites its own, and instead of pushing the current address into the link register, and then jumping, it just jumps (so the link register still contains the return address from A).
compared to how i imagined it worked (with caller cleanup): the compiler detects that the call from B to C is in tail position, and also that B and C share the same stack frame layout; so instead of creating a new stack frame it overwrites its own, and instead of overwriting the link register it just jumps, and instead of cleaning up after the call, it doesn't.

perhaps the 'share the same stack frame layout' is at fault. Imagine that C takes more arguments than B. So eg to cleanup B's stack frame, maybe you have to bump up the stack pointer by 10, but to clean up C's stack frame, you have to bump it up by 20. Now if control is returned directly from C to A, and if A is supposed to do cleanup, all A knows is that it called B, so it'll bump up the stack pointer by 10, which is wrong, because the size 20 stack frame is in effect because of the call to C.

ok so this makes the vararg thing almost but not quite make sense, i think. if A decides to call B with a size 30 stack this time, and then B calls C, and we have callee cleanup, then C cleans up its own stack and jumps back to A, but no one cleans up B's stack, because B was supposed to clean that up but B got skipped.

i think we may need a bigger example to demonstrate. A->B->C->D. All in tail position, so we want to return directly to A. B->C is a vararg call. So the two ppl in a position to know about the size of the vararg stack frame are B and C, but both of them got skipped on the return.

this still doesn't quite make sense though, because what is preventing C from cleaning up the vararg stack frame before the call to D?

https://stackoverflow.com/questions/35626835/make-a-variable-argument-function-callee-cleanup maybe sheds some light here. it seems like the convention is to put the return address last (at the top of the stack/bottom of memory), below the arguments. So maybe you don't want to bump up the stack pointer before returning because then the return address is in the de-allocated-ish area below the stack pointer. And i guess the compiler can't be bothered to just manually move the return address up, overwriting earlier arguments. There is apparently an instruction in x86, 'RET #x', with #x immediate, that reads the return address, moves the stack pointer up by x, and then jumps to the return address which was read. But if it's a vararg then x is only known dynamically, not staticly, and it's an immediate, so it doesn't work.

so that last bit about the vararg sounds like it's something specific to limitations in the x86 convention, conveniences in the instruction set, and the way that compilers (and probably branch predictors) insist on using that.

more about vararg stuff [2]

so it sounds like callee cleanup is the way to go.

other notes i found while reading about this:

if the caller cleans up, then every call to a given function has to have cleanup instruction(s) after the call; whereas if the callee cleans up, the cleanup has to appear only once, in the callee, just before the return. Assuming that CALLs are (staticly) more frequent than RETs, it's more code dense to put the cleanup instructions in the callee, next to the RET [3] [4]
however, caller clean up can be more efficient in a similar way if it doesn't clean up after every call, but instead calls a bunch of things (bumping the stack down each time) and then cleans up all at once (so only one 'cleanup' stack bump up, at the end). [5] " Quote Originally Posted by gcc manual -fno-defer-pop Always pop the arguments to each function call as soon as that function returns. For machines which must pop arguments after a function call, the compiler normally lets arguments accumulate on the stack for several function calls and pops them all at once. " " The x86 core has a feature called "esp folding" which allows the processor to delay making visible changes to esp when it is altered implicitly by push and pop instructions. An explicit argument cleanup, by adding a constant value to esp, interferes with this deferral and causes pipeline stalls.

The fewer times you modify the value of esp explicitly, the better. There is more going on in these instructions than meets the eye. " -- https://cboard.cprogramming.com/tech-board/117965-callee-cleanup-variable-number-parameters.html - that's nice but it sounds fancy -- and if we're doing fancy stuff i'm betting tail call elimination is more important

ppl here suggest that the x86 convention gives the callee no definite way to even know how many varargs there are [6]
this again sounds like a deficiency of x86 conventions rather than something inherent in callee cleanup to me

strangely https://www.agner.org/optimize/calling_conventions.pdf doesn't have much to say about this.

https://github.com/microsoft/ChakraCore/issues/796 has a good point about another way to do varargs with TCO: with caller-cleanup, if A->B->C->D and B->C is vararg and D skips over B right back to A when returning, it's true that A doesn't know how much stack B allocated to hold the varargs, but A can still make things right just by remembering what its stack pointer was before the call (that is, treating SP as caller-save instead of callee-save) -- of course since it's the stack pointer itself that is being lost, it has to save it in some other callee-save register, rather than on the stack. They don't like doing this because it uses up another callee-save register on each such function call (so one less register, and also all those saves and restores). They seem to think a better alternative would be callee-save, but they have a lot of vararg code that would need to be rewritten or something. Sounds like they decided to do nothing in the end (not support TCO).

Q: "Everywhere it says that the called function doesn't know how many parameters have been passed, but in that case why isn't it possible to simply put that data into a register or push it onto the top of the stack for the called function to use?"

A: " There is no fundamental reason why the callee couldn't clean up space for variables. In most architectures, the standard calling convention doesn't handle variables this way, but that doesn't mean it's not possible to do so. For variable-length argument lists, you could either pass data about the number of arguments as a hidden parameter (like how this is handled in many object-oriented languages), or put a pointer on the stack that shows where the arguments end, etc. " -- https://stackoverflow.com/questions/19822045/what-prevents-the-callee-from-cleaning-up-the-stack

"The callee can clean variable arguments from the stack. I made it once, actually. But the code is pretty big.

Anyway, the main reason why in cdecl convention, the caller cleans the stack is other ...On some architectures (usually very small, as the old 8080 or 6800), there is no ret n instruction that to automate the stack cleaning and as a rule they can't make arithmetics with the stack pointer as well.

So, the callee have to first pop the return address from the stack in order to reach the arguments, then to pop all arguments and then push back the return address....When cdecl convention is used 2 instructions and one register use are spared:...And because for single language it is better to use single calling convention on all platforms, the CCALL as more simple and universal looks better. (C language is created in times when 6800 was a high tech).

But notice, that on these platforms, the assembly programs and the native languages (for example different kind of BASICs) usually use register argument passing which is of course much more faster on such small systems. " -- https://stackoverflow.com/questions/19822045/what-prevents-the-callee-from-cleaning-up-the-stack

some languages put (RAIIish?) destructors in their (caller?) cleanup routines, preventing TCO:

" Many Perl idioms rely on strict stack discipline. For example, implementing cleanup code as the destructor of an object held in a lexical variable in a particular stack frame. The result is that implicit tail-call optimisation would break all sorts of things. Hence tail-call semantics are only safe to invoke *explicitly*, which is what Sub::Call::Tail provides. " [7]

---

it sounds to me like callee cleanup is the way to go. TCO may be important, varargs probably not (can always store those in memory).

---

next question: store the first few regs in registers in Boot calling convention?

pro: faster, if regs are really regs

pro: this allows Boot syscalls with few arguments to avoid hitting 'main memory' (until we're in the actual implementation of the call)

con: in LOVM, seems cleaner to put args in smallstacks. That way they can be shifted similarly to 'input argument' and 'output argument' register window sections. if we have the first few args in registers and the rest in smallstacks, that's confusing

my leaning: at first i was leaning to not use regs at all (just stack) for cleaner transition to lovm, but it galls me to have to hit memory just to do a syscall. So now i'm leaning towards using regs. I kinda think there might be a way to may the register window thing work cleanly anyway with the regs.. after all if you need to preserve the input args while you are writing the output args you'd just move them onto the stack right? So mb it's not that much worse. So i kinda think the thing to do is try using regs and reconsider if necessary after implementing x86_64 assembly version of Boot and LOVM (in the far far distant future).

decision: for now, use regs

---

next question: how many regs?

so, should we pass 2 or 3 args in registers?

if 3, then no space for a temp, and maybe we should add instructions: push-int-to-SP-and-decrement-SP push-ptr-to-SP-and-decrement-SP

or at least add back in:

appm &dest &src1 #imm7: &dest = &src1 + #imm7*PTRD_SIZE
ap32m &dest &src1 #imm7: &dest = &src1 + #imm7*INT32_SIZE

because currently, without a free temp register, in Boot, you can't even push to stack (you can't predecrement the SP without loading an immediate into a register)

otoh mb most of the syscalls we want to make can use only 2 (remember we have 2 banks, so 4 spots in total, so as long as both integers and ptrs are being passed, you get more than 2 arguments), so maybe it's okay to just pass 2 args in registers and leave a scratch register.

also, it sucks to add appm and ap32m in Boot, b/c (a) you may want to add two versions of them b/c we don't have signed immediates in oot, and (b) in LOVM they are a duplication because we can have a general instruction like this with both an immediate and register addr mode for op2.

regarding (a), well i guess an upward-growing stack is POSTincrement, which isn't a problem (copy the argument to the stack, then overwrite that register with the immediate, then add that immediate to the stack (postincrement); now you have one free register); it's the predecrement convention in the downward-growing stack that's a problem. So you don't really need to add one version for each direction, even if the immediate imm7 is unsigned. So it's only 2 more instructions, and hey maybe they will be useful anyways b/c they will have a 7-bit unsigned immediate range rather than a 6-bit signed immediate range (so 5-bit unsigned).

so far i've looked a few syscalls tho and none really need more than 2 args of one type for the most basic functionality:

open read malloc

i'm leaning towards just using 2 regs for arguments:

my sampling of those syscalls makes it seem like you can go pretty far with just 2 regs of each type
if you care about efficiency, there's always LOVM, with a lot more regs
if you did fill all 3 regs of one type, there's not much you could do anyhow, because computation within the function would probably need to use one reg of that type, and then you end up pushing something onto the stack, so why not just put it on the stack in the first place?

decision: 2 regs for now

---

next question (which arose while researching the previous one):

what to do when, depending on platform, APIs might return an integer or a pointer?

For example, on Linux, the open() syscall returns an integer file descriptor. On Windows, the CreateFileA? syscall returns a ('file'?) handle, and the (older?) winbase.h OpenFile? also returns a file handle.

https://stackoverflow.com/questions/902967/what-is-a-windows-handle

https://www.codeproject.com/Articles/1044/A-Handy-Guide-To-Handling-Handles

what does windows API mean by 'handle'? I like Dan Moulding's stackoverflow answer:

" A HANDLE is a context-specific unique identifier. By context-specific, I mean that a handle obtained from one context cannot necessarily be used in any other aribtrary context that also works on HANDLEs.

For example, GetModuleHandle? returns a unique identifier to a currently loaded module. The returned handle can be used in other functions that accept module handles. It cannot be given to functions that require other types of handles. For example, you couldn't give a handle returned from GetModuleHandle? to HeapDestroy? and expect it to do something sensible.

The HANDLE itself is just an integral type. Usually, but not necessarily, it is a pointer to some underlying type or memory location. For example, the HANDLE returned by GetModuleHandle? is actually a pointer to the base virtual memory address of the module. But there is no rule stating that handles must be pointers. A handle could also just be a simple integer (which could possibly be used by some Win32 API as an index into an array). "

This resurfaces the motivation for the ANY type registers that i had in there awhile ago but removed. Here we have a something that:

can't be arithmetic'd
can't be dereferenced
might be as large as a pointer

interestingly, we already have something like this: codeptrs. If you think about it, in an interpreter a codeptr could be represented as an integer index into a list of instructions, rather than as an actual pointer into memory (with 'instructions' conceptually residing in a special 'code memory' which is not part of or accessible from data memory).

so should we (re)create a third register bank for these guys (what i used to call ANY type)? it might be sort of nice for the implementation to distinguish between these guys and dereferencable pointers: for garbage collection purposes. These guys could be 'boxed' with additional bits that say whether they are really integers or really pointers.

Disadvantages:

as we see, having all these register banks leads to inefficiency:
e.g. when trying to specify how many register spaces you need, eg. for ENTRY, you end up having to duplicate all of these config info, one for each bank
when these are mapped directly into physical registers (or into a fixed layout register bank in cached main memory), you end up having unused registers because some algorithms need more integers and less pointers, etc
could a sufficiently smart compiler fix this? Well.. maybe? I guess if these things were represented as just variables that needed to be stored, the compiler could transform to SSA (to separate the lifetimes of the distinct values being stored in the same Boot register), and then not waste space storing anything when the Boot register is unused.
and it's better in LOVM, when some of the registers are 'turned on' and 'turned off' by ENTRY
an in some ways, having 3 register banks is especially bad b/c it's not even a power of 2
could add an additional bank for labels/codeptrs tho, so then 4
one of the original reasons to separate ints and ptrs was just to allow them to be different sizes. That isn't needed here, as we could just assume that ptrs are at least as large as ints, and so everything fits within a pointer-space
counting all registers and smallstacks, we're up to ~256 words of register state (32*4 regs + 32*4 smallstacks). On a 64-bit system that's 2k of state!

Advantages:

encoding-wise, you get to encode more variables in the same amount of operand bits, with the cost in an inflation of opcodes
for adding an ANY type (mb we should call it 'handle' type?) this isn't so bad, b/c all we can really do with them is cp, ld, st
'type safety' for free, up to a point
as i noted above, the implementation can represent the ANY type differently, eg by boxing and providing further info about whether it's really a pointer (or, in the case of blind memory copies, that could itself be unknown). This could help in the OVM level, when the implementation will do GC
'cleaner' in the sense that it really just follows the same set of principals that led us to separate ints and ptrs in the first place -- and everything is better typed
we could do blind copying of memory without relying only on memcpy
actually, mb not: "Do not use byte-by-byte memory copies to copy pointer values. This may result in intermediate conditions where there is not a valid pointer, and if the gc pauses the thread in such a condition, it can corrupt memory. Most implementations of memcpy() will work since the internal implementation of it does the copy in aligned chunks greater than or equal to the pointer size, but since this kind of implementation is not guaranteed by the C standard, use memcpy() only with extreme caution. " -- https://dlang.org/spec/garbage.html
our 'memcpy' syscall would guarantee not to have this problem
otoh we could simply guarantee that loading/storeing to 'any' type "does the copy in aligned chunks greater than or equal to the pointer size". Not sure if that always works tho, what if the program asks for an unaligned ANY read?

out of these, i think the most important considerations are:

we lose any hope of naive implementations that map all of the registers directly to x86_64's 16 registers. Since with 4 banks of 8 we'd have 32 regs, we probably have a lot of trouble even mapping this into other ISAs' 32 registers.
we could reduce the number of registers further. But if it's not a power of two that doesn't help us fit in operands. Could we go down to 4 regs of each type? Possibly; the stack pointer would remain a pointer, but the link register would be a codeptr, so each of those two banks would have: a zero register, a TOS, a special register, and a free register. Between the TOS and the free register, that's two of them. That gives us 12 regs (b/c we don't have to store the zero regs) to put in the 16 x86_64 regs (some of which can be shared, eg stack pointer).
we could add the concept of turning on and turning off registers to Boot. In a 32-bit integer, we have 4 sets of 8 bits. So one 32-bit integer could specify how many of each register bank we need, from 0 to 255. To specify from 0 to 64, we need just 6*4 = 24, which is conveniently 3 bytes. To specify from 0 to 32, we need 5*4 = 20, which conveniently can fit within the 21 operand bits available for a LOVM immediate.
can't that just be an annotation?
for now it could be, but in LOVM, where we have ENTRY that pushes things onto the stack in a defined manner, the visible state of the stack could depend on this state
well, not necessarily; we could encode that stuff in ENTRY instead (and we probably will)
an annotation with a 20 bit immediate takes up too much of ANN's "opcode space"; we want to have room for many different types of annotations
what if we only allow registers in each bank to turn on and off in powers of 2?
we need to count from 0 to 5 in each of 4 banks. So 3 bits per bank. So 12 bits total.
this could fit in ANN's 'opcode space'
what would the sigils be? * for any? That might confuse C ppl, and we might want to use it later in Lo or Loot for dereferencing (is there the same consideration for & as a ptrregister sigil? Maybe not, because you can't take the address of a register; &(&3) wouldn't make sense). Maybe ! for any, or mb ~. For codeptrs (labels), maybe @.
what would the zero regs be? $0 is 0, &0 is null ptr, i don't know what ~0 is, and @0 is either the beginning of the program, or the PC.
does it make sense to have the '8 register' restriction in Boot if we're going to do this? Maybe not. The restriction should either be 4 registers, or no restriction (32 registers).
in that case, is Boot really that different from LOVM? All that we're holding back seems to be:
smallstacks
addr modes
variable-length instruction encoding
more instructions
this is actually a good thing, b/c it means a porter can more easily start with Boot and then expand their implementation incrementally to LOVM
we might want to just rename LOVM to BootX? in this case
altho i guess the smallstacks make them still pretty different

so the proposal is:

add 2 more register banks (codeptrs/labels and (handles? any? opaque?))
remove the 8 register restriction (so up to 32 regs in each bank, from 0 to 31)
add the additional instructions we need (cp ld st for each bank)
?add an instruction to specify how many registers of each type are currently alive? or an annotation? or a HINT instruction?

i'm leaning towards this.

---

we may want to introduce alignment restrictions to prevent the problem noted above about another thread seeing an invalid pointer in the middle of write? otoh if each thread has its own GC then the GC and the implementation can have a mutex when the implementation main thread is writing to memory. if ptrs are opaque, how to align? mb would need an instruction for that

done

the stack also has align restrictions on some targets (eg 16-bit on x86 at least on x64 windows)

eg i think this isnt even windows: https://research.csiro.au/tsblog/debugging-stories-stack-alignment-matters/
"GCC in MinGW? uses Linux ABI that requires stack to be 16-byte aligned" https://github.com/uTox/uTox/issues/1304
https://refspecs.linuxfoundation.org/elf/x86_64-abi-0.98.pdf says "The end of the input argument area shall be aligned on a 16 byte boundary."

i think that's an ABI thing though, so i guess we can ignore it in Boot (where we have Boot's own convention, rather than trying to interop with the underlying platform convention). And in LOVM we can abstract it.

---

what do garbage collectors do about pointers that may be invalid (e.g. a C pointer that points past the end of an array)?

https://www.hboehm.info/gc/04tutorial.pdf says

" Pointer validity check

Get page descriptor. Valid heap page? •About three memory references. Simple top level hashing scheme for 64-bit addresses. •Two with a small cache. If not first page of object, adjust. Valid offset in valid object? •Remainder computation on offset in page gives object start. •Remainder can be looked up in table of ìvalid offsetsî. •Allows pointers to only certain offsets in object to be considered valid. Check is constant time. •Small constant number of memory references "

elsewhere in those slides it's clear that memory allocation is done thru the GC, so the GC is aware of where the valid memory is

"Unlike [BoehmChase?92] we start with the assumptionthat the garbage collector recognizes all pointers to theinterior of an object, not just to the first byte of the object.Recent experience suggests that this is the right framework,particularly for typical C++ implementations whichimplicitly generate pointers to the interior of an object. Thetechniques of [Boehm93] can greatly reduce the danger ofspace leakage that we previously associated with thisapproach. This new assumption greatly simplifies our task.Unfortunately, it also usually invalidates assumption (A) of[BoehmChase?92], so our correctness argument has to bedifferent

...

We assume that the garbage collector recognizes anyaddress corresponding to some place inside a heap allocatedobject as a valid pointer. ([Boehm95] satisfies theseassumptions in its default configuration"

[8] "Simple Garbage-Collector-SafetyHans?-J. Boehm"

so it looks like it actually checks potential ptrs to see if they are in valid memory before it derefs them. oh man what a pain (but oh man so amazing)

---

"OurMagpietool? converts C source programs to cooperatewith precise GC ... Some constraints on the C program depend on the specific GCimplementation that is used with Magpie-converted code. The GCmay supportinterior pointers, i.e., pointers into the middle of a GC-allocated object, including to the end of the object. In practice, wefind that interior pointers are common in real C programs, whileprograms specifically adjusted to work with precise GC (such asthe PLT Scheme runtime system) can avoid them to help improveGC performance. Note that a pointer just past the end of an arrayis legal according to the C standard, and Magpie considers suchpointers to be valid interior pointers " [9]

---

also, mb i should allow <= comparison on pointer values. I didn't use to allow it because it may not make sense. But there are times (e.g. sort a list of pointers so that you can do binary search to determine the presence of a pointer in that list) when you really want to compare pointers, and you don't care if the comparison has any other meaning aside from being a total ordering.

what do other languages with opaque pointers/no pointer arithmetic do about pointer comparison? e.g. what does Golang do?

golang allows equality comparison only, but not greater-than, less-than: https://stackoverflow.com/questions/45818964/how-can-i-compare-pointers-in-go

---

later i decided to remove register/smallstack banking again -- see discussion in [[ootBootNotes1?]].

---

so what about unified register files with floating point as ordinary gprs?

https://stackoverflow.com/questions/51471978/is-there-any-architecture-that-uses-the-same-register-space-for-scalar-integer-a

sounds contentious but seems like today's best practice is to bank them, eg separate floating point

also most mechanical sympathy on popular archs is to bank fp

so bank the fp regs, but not the smallstacks (b/c fps on the stack are in the same linear memory substacks as everything else). btw https://stackoverflow.com/a/52175510/171761 suggests having more caller-save registers in the fp bank, and less callee-save regs, unlike his suggestion for integer

--- old

maybe we can get away with having the primary stack be above the stack pointer in the output stack be below the stack pointer, and pushing to the output stack all at once in a calling sequence. Alternately, since the primary stock can output stock capacities are fixed, If you save those capacities in a register then you can offset the beginning of the output stack from The frame pointer.

yeah I think we need to register to hold The frame allocation information in. At least this is cheaper than a fixed sized register windows where you always have to push a Certain window size onto the Stack. Again consider making those allocation quantities powers of two -- Then maybe we really can fit in All the variables we may need. I think we do need an extra register for holding the primary stack position. Maybe reserve a register for the PC. I'm still thinking if there's a way for the boot calling convention to put arguments in registers. I guess one way is just to use the out stack addr mode as registers! That would imply that the recipient not the caller is the one who shifts out to Local. Also maybe out should be registers not stack even in lovm. Consider however that a Nice thing About Forth was that function calling was cheap (eg no callee-saving etc to be done).

Maybe you can simplify all of this by having one stack and specify how many output args when you call th next function

mb don't even let the program Read the stack pointer that when the stock is not flushed that way we don't have to have an extra register for the tip of the cached stack

The thing is if it's one stack and that stack is interrupted by the return address and the frame pointer And any other callee saves, local variables, or other junk we have in the stack frame. So that makes it Inefficient to translate a stack position to an address. If it's two stacks then it's inefficient to push more to the argument input Stack, Because directly after that is the stuff in our stack frame. This suggests that the argument input stack Should be mapped to registers, And the only snack should be the out Stack

So Callee clean up means that the input arguments are cleaned up too right?? so our output arguments passed where the input arguments were on the stack.?

I think what you're looking for is that out could really be registers. We could rely on the callee to flush out to the stack if needed.

But in that case, how do we use the stack pointer in boot?

I guess maybe if the program never touches the stack pointer then this stuff is abstract

---

i can't think of a good way for the LOVM calling convention to match up with a Boot all-register calling convention without actually moving the smallstack addr modes from LOVM to Boot. In which case we could do this:

'input stack addr mode' addresses the things above the frame pointer. Note that this is also where the return arguments can go.
'output stack addr mode' addresses the things above the stack pointer
swapping output to input is done by the callee when the frame pointer is moved to the stack pointer, and the stack pointer is moved down to allocate the new frame
flushing to memory is done immediately prior to this, by the callee
sibling tail calls might not have to flush nor allocate a new frame nor swap
SS addr mode moves the stack pointer (and so affects the out stack, not the in stack)
caching non-guarantees and flushes need to be a part of Boot semantics

caching non-guarantees:

if the stack pointer is moved up or down, the smallstack items 'move' as appropriate, however the 'new' items (in the empty smallstack slots exposed by the move) may or may not take on the values they have in memory, until a stackFlushFromMemory is executed
if the memory above the various stack/frame pointers is modified, this may or many not be reflected in the corresponding smallstack locations, until a stackFlushFromMemory is executed
if the smallstacks are modified, this may or many not be reflected in memory, until a stackFlushToMemory is executed
if in between flushes, both the memory above the various stack/frame pointers is modified, and the smallstacks are modified, then upon flushing, each location may take either value (which may be mixed, eg some locs may take their values from memory and some from the smallstacks)
bank the FP regs, but not the FP smallstacks. Have almost all caller-save fp regs as per https://stackoverflow.com/a/52175510/171761 's suggestion.

some benefits to this:

the implementation can decide whether argument passing is actually in registers or actually in memory (same with return values)
the implementation doesn't need to do any additional tracking of the input and output stack tops beyond what is in the program-visible registers
using the naive stack cache implementation style, calls to leaf functions dont touch main memory to pass arguments

done

---

so is there a way to do the previous without actually importing stack cache addr mode semantics into Boot?

The issue is probably that we want to allow the implementation to choose to store the smallstacks in memory, or to store them in regs (or in a memory cache). But if they are in memory then that's visible to Boot. But they may or may not be in memory. This indeterminacy is visible and so must be mentioned in the spec.

Can we just tell Boot to put things on the stack in memory? If we want interoperability (programs mixing Boot and LOVM functions), that's hard for the implementation to detect and map into smallstacks. Imagine if a Boot function called a LOVM leaf function. The LOVM leaf function would not feel a need to flush anything (and we don't want to make it), because it wants to just use the output stack as its arguments, since it doesn't need fresh output stack for its own outgoing arguments. It can just not adjust the frame pointer, and offset its own locals off the stack pointer. But the Boot program put the input arguments onto the stack in memory.

Is there any point keeping LOVM separate from BootX? if we are putting all these complicated semantics into Boot?

There may be another way. It may be hard to detect a write to the stack pointer in general, but if we have a PUSH instruction, or even better a PUSHMEMSTACK instruction, and a POPMEMSTACK instruction, that is easier to detect. The implementation could intercept that instruction and have it write to the output smallstack cache instead. We could have flushStackToMemory and flushMemoryToStacks instructions and specify that until the are called, as above there are caching non-guarantees. The benefit of these gymnastics is that we maintain our load-store architecture and our address-mode-less architecture in Boot, rather than having these memory-associated locations be targettable by any operand.

why just push and pop? perhaps push, pop, read, and write. In fact, maybe just read and write -- then we can add the requirement that you flush before moving the stack pointer, which may make things even simpler for the implementation, b/c it doesn't have to keep track of stack ptr moves. Then we can drop this requirement for LOVM.

seems like this may make it harder to abstract stack frames later though? but maybe we are saving that for OVM.

i feel like there is a more abstract way to do this. But i can't think of anything that wouldn't be too abstract for Boot.

done (register-associated cache) ---

LOVM is looking more and more similar to Boot. Maybe just merge LOVM and BootX? (could use the name LOVM for something further up the stack later, as previously planned)

---

so i was thinking about calling conventions. I was hoping there was a way to do all of:

allow Boot not to have CALL, RET, etc instructions, but to have a calling convention
allow calling conventions to be abstract in LOVM (or, i guess i'm moving LOVM back into BootX?)
allow Boot and LOVM to interoperate
allow at least a few arguments to be passed in registers for syscalls
not have addressing modes in Boot

i don't think it's possible. In fact, i don't think even the first three are possible simultaneously:

if Boot has a calling convention, but not CALL, RET, etc, then that calling convention must be concrete
if LOVM must interoperate with Boot, then LOVM must adhere to the same calling convention as Boot
but if LOVM's calling convention is abstract, then CALL, RET, etc must be free to do something different than Boot's concrete calling convention

now, the main issue of them having different calling conventions is: what about the syscalls? Boot is supposed to be a subset of LOVM, so the syscalls should work the same in both cases.

which is also the reason we wanted to pass arguments in registers

so, we could just say that Boot passes stuff on the stack, but that means you have to have available stack memory to do a syscall, and also that you have to take the performance hit of hitting memory to do a syscall.

---

how does linux do syscalls on x86?

there are three types: legacy syscalls with the 'int' instruction, 'fast' 32-bit syscalls with the SYSENTER instruction, and 'fast' 32-bit system calls with the SYSCALL instruction.

With legacy syscalls, the syscall number and up to 6 args are passed in registers [10]

with fast 32-bit system calls, the syscall number and up to 5 args are passed in registers [11]

with fast 32-bit system calls, the syscall number and up to 6 args are passed in registers [12]

how does windows nt do syscalls on x86? presumably it uses the same instructions; but user code calls the windows API using ordinary library calling conventions. The Win32 API uses the stdcall calling convention: https://en.wikipedia.org/wiki/X86_calling_conventions#stdcall which passes arguments on the stack, not registers.

not sure if there is a 'win64' API or if it has a different calling convention. https://wiki.freepascal.org/Win64/AMD64_API#Calling_convention suggests that it does, and that this one passes the first 4 params in regs. https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions says the same. Although the win64 calling convention passes the first 4 args in registers, it also requires that (32 bytes?) of 'shadow space' be allocated for them on the stack.

so my conclusion is, yes, there does seem to be a need to pass syscall args in registers

---

so i think we're back to what i was planning above:

add read-stack-cache/write-stack-cache (readsc/writesc) instructions to Boot
go ahead and define them as cacheing the stack
...

my idea about a loadwithcache, storewithcache for every register, not just SP -- altho in Boot, to save operands (so that we could go back to a 16-bit Boot if we wanted), only require the implementation to accept these with register SP (so that they can be elided to their two-operand forms).

done (register-associated cache)

---

just some random thoughts on bitwise data manipulation after a dream i had

this first paragraph is just me writing down whatever nonsense was in my head right after waking up: Predication with two lanes (so each bit can specify a lane). Forwards and backwards in time? Mb not. Each instruction has an and and an or. Src/dst? Arc?. Not sure if this is right but 2 lanes * (src and or dst) where each of src dst and or is 4 bits would be 32 bits. Quantum computing with all qubits restricted to the standard basis?? Quantum computing with just one qubit??

Note that a qubit has two degrees of freedom. I think 16 bits can specify one 2-qubit quantum gate (64 bits would be needed to specify one three qubit quantum gate I think). I think four bits can specify one 1-qubit quantum gate (so can't achieve quantum universality this way). Note that in actual quantum computing The elements of the gate matrix are real numbers not binary, so to reiterate, we are only talking about a subset of this.

In addition the numbers in the matrix and quantum computing can be complex. And in addition the matrix must be unitary. So my accounts above seem not quite correct. What if we restrict to 1 0 i -i, and unitary? How many bits does it take to specify a one wire quantum gate?

Mb for each bit input (wire input) you can have 4 choices: f t x -x.

A classical gate on bits with two inputs and two outputs has four input conditions and for each of those specifies one of four possible output conditions. So. 4^4 possibile gates (and the table consists of four pairs of two output bits each so eight bits).

So to specify a gate with two bits of input and one bit of output is four bits. So if we have two predicate bits then we can specify any combined predication with four bits, and we can specify any gate on the bits themselves with 8 bits.

Src and dst could be registers for a conditional move, or they could be a choice of predication bits out of a larger set of predication bits

Okay but moving on from the dream and just thinking rationally, If you have two predication bits p0 p1 what would be some good things to do with that. you probably don't need to predicate each instruction with the full 4-bit comprehensive conditional over those two bits. Instead you might have two bits for each instruction to specify one of: if p0, if p1, if p0 or p1, if p0 and p1. then you might occasionally have an 8-bit predication gate on the predication bits? Not quite right yet probably

I guess if you have 16 predication bits then in 16 bits you can say: specify bit number one, specified bit number two, specify gate. 4 bits 4 bits 8 bits for a total of 16 bits.

Another thing you can do with 16-bit instructions and 16-1 bit registers is specify each of two source operands and one destination operand (12 bits total) and then specify one operation in 4 bits.

And remember that the original idea was two lanes was that each instruction has two stop bits which determine whether it can be executed in parallel with the previous instruction in that lane (an instruction with both stop bits set Waits on both lanes, so it synchronizes)

Other fun things that you can do with a single register of many bits is permute the bits, or permute-with-replacement. either of those takes many bits to specify the permutation though; for a 16-bit register, permute with replacement means making 16 choices 16 times, so 4*16 = 64 bits. Since we're certainly not going to have instructions greater than 64 bits at this level, that suggests that a 16-bit bit register is too many bits. What about a 4-bit bit register? Permute with replacement is now four choices four times, which is 8 bits. And the two-wire gate with source and destination operands out of four choices requires 8 bits for the gate and 4 bits for the operands for a total of 12 bits. A three-wire gate with two inputs and one output, and four bit registers, requires four bits for the gate and six bits for the operands for a total of 10 bits. so now we're getting to The sorts of instructions that we can comfortably fit in 16 bits.

Regarding the stop bits and the two lanes, If you apply this to all instructions without other constraints, you were providing redundant information because all you were doing is specifying dependencies between instructions, and the data dependencies already specify The necessary dependencies. but the time this would really be helpful would be in memory operations, because with memory operations you might want to specify ordering dependencies between operations on different memory locations, and that is not implied by the instructions themselves.

And now let's look at if we only have two bit registers. now we don't need permutation because the 2-bit (2-wire) gate operations contain this, nor do we need three operand instructions because again the 2-wire gate operations do everything. So in fact all we have to consider is the 8 bits to specify a two-wire gate. Furthermore to specify any function on these two bits with a one bit output takes four bits (one bit for each possible input condition). For predication however it really seems like there's some operations which are more useful than others. If bit 0, if bit 1, if bit 0 and bit 1, if bit zero or bit one, if true, if not bit 0, if not bit 1, ??. So three bit predication would be quite useful here, Even though it would take four bits to be complete. And of course two bit predication would be pretty good too: always, If bit 0, if bit 1, if bit 0 and bit 1. Or how about: always, If bit 0, if not bit 0, If bit 0 and bit 1. Yes, if you only have two predication-specification bits per instruction, the last one is more useful, because if you have an 'if' conditional that you are trying to express without branching, being able to do something exactly when bit 0 is 0 and also something else exactly when bit 0 is 1 is useful.

Source registers and destination registers with complete gate specification waste bits because many gates either ignore one or both source operands, or are the same as other gates except the order of the source operands are switched. It is less redundant to have OP codes and operands with multiple instruction encodings to support OP codes that don't need both source operands

Gates makes sense however if the operands are fixed like wires.

---

to summarize the most useful ideas from the above:

with 1 predication bit specifier per instruction, you could use it to encode one of:
'execute this instruction iff the predication bit-register is 0' vs. 'execute this instruction iff the predication bit-register is 1'.
'execute this instruction always' vs. 'execute this instruction iff the predication bit-register is 1'.
with 2 predication bit specifiers per instruction, and 2 predication bit registers, you could use it to encode one of:
always
iff first predication bit-register is 0
iff first predication bit-register is 1
iff first predication bit-register is 1 AND second predication bit-register is 1
with two 'stop bits' per instruction, you could express two 'lanes' of dependencies (in addition to data dependencies, which are implied by the ISA and so don't need to be specified in each program). A stop bit in some lane means 'before executing this instruction, wait until every other concurrently executing previous instruction with a stop bit in this lane has finished (and do not execute instructions after this one in the meantime, just wait)'.
or the stop bit could just mean a dependency between this instruction and every previous instruction with a stop bit in that lane, without implying that you have to wait for any following instructions without a stop bit in this lane. I think i like that better.
rather than applying these to all instructions, it would make sense to apply them only to memory instructions, because as noted above the implementation can handle concurrent scheduling for instruction-level parallelism given data dependencies, but the program might care about the concurrent ordering of updates across different memory locations as seen by other threads.
you can think about 'gates' that express some operation over some number of bits, with some number of input bits and some number of output bits. You can call each input and output a 'wire'. Each gate can be encoded by a n-dimensional table where n is the number of inputs and each table entry is a bit string of the outputs. The bits in this encoding is the concatenation of all table entries. This is the way to specify a comprehensive choice of a binary function.
However, if you also need to specify what the inputs and outputs are (source and destination operands), then this encoding contains redundancy, because some binary functions ignore some of their inputs, and also there exist pairs of binary functions where you don't need both pairs as separate instructions because they are symmetric (like bleq and bgeq on integers; just have bleq and flip the inputs when you need bgeq). So, encoding-wise it's more efficient to have opcodes, and just don't assign the second element of each symmetric pair of instructions to any opcode, and to have a few separate instruction formats depending on how many operands are really needed (so that you can leave out the bits for operands that aren't used).
The place where the full bit-specification of a binary function table really shines as an encoding is therefore when you don't need to further specify operands (for example, when the choice of operands is implied) -- that is, in a 'circuit' with 'wires'
2-input, 2-output function table (2-wire gate): to express the choice of any binary 2-input 2-output function takes (2^2*2) = 8 bits
2-input, 2-output function table plus 2 operands over 4 registers: 2^2*2 + 2*2 = 12 bits
2-input, 1-output function table: 2^2*1 = 4 bits
2-input, 1-output function table plus 3 operands over 4 registers: 2^2*1 + 3*2 = 10 bits
permutations on 4 registers: 4^4 = 256 = 8 bits to specify
permutations on 16 registers: 64 bits to specify
3-input, 3-output function table (3-wire gate): 2^3*8 = 64 bits
the following doesn't sound useful to me but it's the first thing i thought of when waking up from the dream: consider also a representation of (some subset of) functions over (?how many) registers with 2 x (4-bit source operand, 4-bit AND-specifier (?), 4-bit OR-specifier (?), 4-bit dest operand) = 32 bits total
for example, with 4 registers, each of which is holding 4 bits, you can specify:
a source register, and a selection of some subset of those 4 bits to be ANDed, and some subset of those 4 bits to be ORed, and a destination register
and a second of the same (possibly negated) or
a source register, and AND all 4 of those bits together, with pre-negations on any subset of these 4 bits, and also OR all 4 of them together, with pre-negations on any subset of these 4 bits
x2
if you have predication bits or stop bits, it might be nice to also specify gates on those bits to modify them from time to time
if you have a bit register of 4 bits, then in 16 bits you can express two source operands (4 bits), plus a 2-in, 2-out gate (8 bits), plus two destination operands (4 bits). There's some redundancy here, but at least it's uniform.

---

just to review my current ideas for the stack:

Boot: 32-bit assembly encoding, subset of BootX?. The easiest-to-implement thing that can bootstrap the toolchain. Unsafe/undefined behavior possible.
BootX?: A series of extensions to Boot, culminating in BootX?, which is a 32-bit portable assembly. Unsafe/undefined behavior possible. Full unrestricted indirect control flow.
OVM: Written in BootX?. Probably a variable-length encoding, eg the first byte is a 'header' and the rest contain an opcode and varying numbers of arguments, rather than packed bitfields. Possibly the arguments have addr modes or other modifiers a la my older ideas. BootX? instructions form a subset of OVM, sorta, but OVM is safe (no undefined behavior) and BootX? is unsafe, so the semantics are reinterpreted a bit. OVM is structured programming + powerful control flow primitives.
the rest: as before

---

i think we should reintroduce an LOVM layer. It will be more like WASM in that there are a lot of 'locals', rather than just 32 registers. Then:

push the variable-length encoding stuff up from BootX? into the new LOVM
maybe reserve a significant amount of opcodes in the 32-bit encoding for LOVM? i don't think we can give 128 tho.
push the stack push/pop and more detailed ROC semantics up from BootX? into LOVM
push R.O.C. stuff up from Boot into BootX? (b/c Boot is only for inefficient bootstrapping; LOVM is the first 'real' target)
push the unnecessary immediate ap instructions up from Boot into BootX?. Only about two programs need to be human-written in Boot: a BootX?->Boot compiler, and a BootX?->Boot interpreter.
So what is BootX? now? BootX? is the complete target, as well as a 'decent assembly language'. So does it have all the syscalls and instructions that Oot needs? Yes. So does it have addressing modes? Not sure; decent assembly languages do have at least small immediates for many instructions. Does it have stack push/pop? Maybe not. Does it have CALL, RET? I'm not sure. Does it have arrays? Not sure
Lo, the HLL, should maybe be saved for LOVM; dd

---

How does WASM handle limits on the number of function arguments?

There are no limits; WASM 'call' instruction consumes arguments from the WASM stack, which is afacit unbounded.

---

so what should the encoding format of LOVM be? Completely variable-length bytecode, or 64-bit? Leaning towards 64-bit.

i have 3 ideas for 64-bit:

4 encoding format bits + 4 * (3 addr mode bits + 12 operand bits)
4 encoding format bits + 4 * (3 addr mode bits + 10 operand bits) + 8 bit extra, or special, or type-specifier operand

nice things about 12 operand bits:

4k locals is probably really enough for almost everyone, don't know about 1k
12 operand bits is enough for an addressing mode that splits in two, then has 6 bits on each side of the split, which means it can do a sub-addressing mode on each side of the split and still have 5 bits for the operand to reach all 32 regs
4k constants is much better than 1k constants
extra 8 bit operand will be rarely used, and it complicates things.

nice things about 10 operand bits:

1k locals is probably enough
1k opcodes is already more than enough
10 operand bits is enough for an addr mode that splits in two, and has 5 bits on each side, which is enough to reach all 32 regs
we can do my polymorphism idea
we can deal better with instructions like CAS (and maybe checked/carrying arithmetic) that really could use 4 operands

otoh LOVM's goal is to be on the level of WASM/LLVM so maybe it should just be variable-length to do away with the remaining unusual 'limits' on #locals, #fn args, etc.

should LOVM (or its language, Lo) have wren-like efficient classes?

---

ideas about how 8 bits could be used to represent types for polymorphism:

2 bits to choose between:
2 bits specifying type of destination operand, 2 bits specifying type of 1st source operand, 2 bits specifying type of 2nd source operand
3 bits specifying type of destination operand, 3 bits specifying type of both source operands
3 bits specifying type of 1st source operand which is the same as destination, 3 bits specifying type of 2nd source operand
3 bits specifying type of 1st source operand, 3 bits specifying type of 2nd source operand which is the same as destination

but really we probably can limit polymorphism to having the destination operand type being determined once the two source operand types are determined. So now we can do:

4 bits for 1st source operand
4 bits for 2nd source operand

4 bits is enough to encode 16 types which is just enough to fit most languages' unboxed primitive types.

We could also just have one 'type' with 256 choices. Most of the type there needs to only be one type to polymorph on; e.g. ADD uint32 uint32 -> uint32 vs ADD int16 int16 -> int16; e.g. OOP 'receiver's. This lets us leave most of the types undefined as 'custom' in the spec and let the program interpret what they mean.

---

in Go, pointers can only be compared even for equality if they are the same type (or the null ptr):

https://go101.org/article/pointer.html

---

it's a good idea to recommend that the sp and fp r.o.c.s be the same size as each other, even though in leaf functions you don't need to use fp at all, because otherwise if the fp cache was smaller than the sp cache, then in non-leaf functions when you do 'cp fp sp' in the prolog, you'd have to spill the excess items from the sp cache to memory, negating the purpose of the ro.o.c.s (to give similar performance as storing args in registers).

---

many of the odd things about Boot/BootX? directly follow from the desire to make Boot/BootX? programs portable over different target platforms, while still allowing for the possibility of performant compiling implementations without a large runtime/memory footprint:

opaque pointer sizes are necessary to abstract over platforms with different word sizes. It's tempting to just standardize on 64-bit words, but note that on some platforms (e.g. HLLs with fat pointers), the word size may be even larger than 64 bits.
opaque integer sizes are necessary to abstract over platforms such as HLLs (where Boot memory is just an HLL array) in which both small and large integers both take up one storage location
restricting pointer arithmetic is necessary to abstract over platforms with different pointer sizes and representations
restricting handle arithmetic is necessary to abstract over platforms which provide various identifiers as integers vs as pointers
distinguishing between pointers and labels is necessary to abstract over platforms in which Boot/BootX? instructions are compiled to various differently-sized sequences of platform instructions
register-associated-cache is necessary to abstract over platforms whose calling convention passes some arguments in registers vs passing everything on the stack; where platforms implement ROCs, they allows Boot/BootX? implementations to gain the efficiency of passing some arguments in registers, without the Boot/BootX? program having to be rewritten to pass some things in registers and others on the stack

---

with a stack cache, when the stack is pushed, the shallowest item of the stack cache becomes dirty. If there were already dirty items at/near the top of the stack, they get pushed down, so there is a contiguous region of dirty items starting at the shallowest and possibly continuing down some ways. To summarize, when the stack is pushed, the shallowest items become dirty, down to some depth.

when the stack is popped, the deepest items become stale, up to some depth.

So to keep track of this requires enough state to hold two numbers, each of whose bits is the number of bits needed to encode the size of the stack.

so if the stack cache is 16 items, that's 2*4 = 8 bits total. And if there are two stack caches, that's 16 bits total.

is doing all that calculation upon stack cache access actually faster than just hitting main memory, which is presumably in L1 cache anyways? https://stackoverflow.com/questions/10274355/cycles-cost-for-l1-cache-hit-vs-register-on-x86 and https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory https://stackoverflow.com/questions/14504734/cache-or-registers-which-is-faster suggest it might be (these sources suggest about a 3 cycle penalty for reading from l1 cache (that is, in the case of a hit) vs a register, and i bet we can do whatever extra manipulations we need in 3 cycles or less) but it's unclear. But, it should allow optimizers to optimize better, since they can assume that modifications to the stack cache don't need to be visible in main memory before a flush (this allows them to model the stack cache as registers, and model the flushes and reads or writes to all of those registers at once).

note: if you were programming a cache like this, it would be useful to have a predicated-load-from-memory instruction; because if someone reads from the cache, and you want the cache to transparently read from memory when it contains garbage, then you'd ordinarily need to test for staleness and then branch; so you could avoid the branch if you had a non-branching-conditional-read

yknow actually, if there's a 3-cycle penalty for L1 cache hits, then maintaining a 4 item cache is actually a little slower than memory, because e.g. upon a popc, the implementation has to:

cp dest <- cache0 cp cache0 <- cache1 cp cache1 <- cache2 cp cache2 <- cache3 mark cache3 as stale (possibly just load the next value; or possibly multiple quick operations to increment a count packed into a register keeping track of dirty and stale cache stuff; or at least one operation to set a particular bit in that register)

that's at least 5 operations, vs. at least 3 for L1 cache hit.

there's other benefits to stack caching though (easier program optimization b/c can assume that changes to cache aren't visible in memory unless flushed), so it may still be worth it.

however this suggests that a sweet spot for # of items in the stack cache may be 4.

Also, if you had two caches (sp and fp), and for each cache you had one bit for each item to indicate staleness, and one bit for each item to indicate dirtiness, and you had 16 items, then that would be 64 bits. Which suggests that you don't want more than 16 items in the cache, because then that implementation strategy would take up multiple platform registers. And 16 is what the stack cache and frame cache addressing modes can address anyways.

---

here's an argument for a separate call stack: let's say you have a Forth-like function A that deals with 5 pieces of data on the stack. You want to split it into subroutines. In Forth, you can do that with a minimum of overhead, and you want to do that here. So, the first time you split A, into A' (caller) and B (callee), if the callee doesn't need to use any callee-save registers, it's efficient; the return address goes in the link register, the callee (B) is a leaf function so no frame pointer is needed, so you just branch to the callee (B); much like Forth. But, what if now you want to refactor the callee (B) in the same way, splitting off a further subroutine (so B -> B' and C)? Now B' isn't a leaf function anymore, and it will have to save the (A')'s return address somewhere so that it can put its own (B')'s return address into the link register before calling the new subroutine. So now upon entry from A', B' pushes a frame pointer and a return address, so now its input arguments are above the frame pointer. But it has to provide C with the same arguments. So it has to copy those 5 items (or move them around past the return address, one way or another). So we've lost the Forth-like efficiency. We can regain this efficiency by having a separate return address stack.
Note however that that only helps in the special case where some of the same arguments are being passed down the call chain. Costs include:
memory fragmentation from allocating a second memory stack
burning another register to point to the top of the return address stack

---

i think instead of using memcpy to copy blindly, we should go back to having an opaque type (what to call it? opaque? black-box? hidden? blind? byte? any? probably 'opaque' or 'hidden'). That lets us move the memcpy syscall into BootX?.

mb also move 'halt' into BootX?? The thing is, Boot programs are going to have to have implementation-defined conventions for argument and result passing anyhow. Otoh, for a compiler or interpreter, you already have an I/O stream, so mb not. Perhaps make 'halt' an instruction, then? I'm trying to avoid defining the availability of the stack. Then again, maybe having a stack is okay.

Doesn't a compiler need 'malloc' anyways? Or at least sbrk, or a data segment register?

one argument for memcpy is that imagine an implementation using capabilities (and perhaps tagged memory) where some words in memory hold special capabilities. But maybe if you load a capability into an ordinary register and then move it into a different register and then store it, it loses its special capability-ness. Maybe you have to do something special to move capabilities in memory. So our memcpy could do that.

eh, i guess it's not so clear. i'll put this aside.

---

yknow, as simple as it is, the 32-bit encoding is just too much work to implement for Boot. For me, it already took some extra time to implement encoding in basm, so imagine how hard it would be for someone who is not comforable with bitwise operations.

There is not really a way to make it simpler -- the need for the 3 format encoding bits in one place means that either at least one field must be split in an awkward way in between bytes, or we restrict the opcodes, or we restrict one of the operand ranges, or one of the operands gives up addr modes.

we already decided that we don't care if Boot code can interoperate with BootX? code because everything is going to be implemented not in Boot but rather in BootS?, the Boot subset of BootX?, and then compiled from BootS? to Boot. Which is why we were able to move the register-offset cache stuff out of Boot into BootX? -- we just decide that the way to compile BootS? to Boot is the uncached style of r.o.c. implementation (that is, roc instructions accessing the 'cache' just directly accesses stack memory instead).

so there is not much lost by saying that BootX? has a different encoding that Boot. Now we can make it simple.

---

 "
 Iron-code Summary•Section A.2—Use general-purpose registers with a load-store architecture. •Section A.3—Support these addressing modes: displacement (with an address offset size of 12 to 16 bits), immediate (size 8 to 16 bits), and register indirect. •Section A.4—Support these data sizes and types: 8-, 16-, 32-, and 64-bit integers and 64-bit IEEE 754 floating-point numbers. –Now we see 16-bit FP for deep learning in GPU•http://www.nextplatform.com/2016/09/13/nvidia-pushes-deep-learning-inference-new-pascal-gpus/•Section A.5—Support these simple instructions, since they will dominate the number of instructions executed: load, store, add, subtract, move register-register, and shift. •Section A.6—Compare equal, compare not equal, compare less, branch (with a PC-relative address at least 8 bits long), jump, call, and return. •Section A.7—Use fixed instruction encoding if interested in performance, and use variable instruction encoding if interested in code size. •Section A.8—Provide at least 16 general-purpose registers, be sure all addressing modes apply to all data transfer instructions, and aim for a minimalist IS–Often use separate floating-point registers. –The justification is to increase the total number of registers without raising problems in the instruction format or in the speed of the general-purpose register file. This compromise, however, is not orthogonal.
 "
 -- https://passlab.github.io/CSCE513/notes/lecture03_ISA_Principles.pdf

see also slide Top 10 Instructions for 80x86 in https://passlab.github.io/CSCE513/notes/lecture03_ISA_Principles.pdf

(from most to least freq: load, conditional branch, compare, store, add, and, sub, move register-register, call return)

---

"Most loops go from 0 to n." [13]

---

implemented Boot in Python again on Nov 26 (Thanksgiving) and Nov 27 (commit 39df6da; "pyboot3").

---

is it worth the more complicated thing i thought of that allows 32-bit integers, 64-bit integers, and pointers to have different sizes on the stack?

https://developer.arm.com/documentation/ihi0055/latest/ suggests they might use 64-bit 'stack slots'. https://techpubs.jurassic.nl/manuals/0630/developer/Mpro_n32_ABI/sgi_html/ch02.html uses fixed-size stack slots. Looks like here stack arguments are alighned to 32-bits: https://stackoverflow.com/questions/60959847/why-are-parameters-arranged-this-way-on-the-stack-when-a-function-is-called . Looks like here everything on the stack is 64 bits: http://www.windbg.xyz/windbg/article/111-64bit-Calling-Convention . Looks like here everything on the stack is 32-bits (except 64-bit quantities take up two 'slots'): https://www.tenouk.com/Bufferoverflowc/Bufferoverflow2a.html . This mentions 8-byte stack slots: https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms . https://www.lri.fr/~filliatr/ens/compil/x86-64.pdf says that "Thepushqandpopqcombine a move with an adjustment to%rsp. Note that the stack should stay 8-byte alignedat all times".

So it sounds like everyone else just has uniform-sized stack slots that are the same as the machine word size, even in native calling conventions. So maybe i should do that too.

mb i should also require that any Boot object size larger than the word size should be a multiple of the word size?

---

on using C# for low-level stuff: "since C# 7.0 or so: Span<T> for example is a revolution; But I use things like stackalloc, ref, fixed, Unsafe.CopyBlockUnaligned?, &MemoryMarshal?.GetReference? etc, on a daily basis to get that extra performance where it matters;"

---

a big issue with tradeoffs is bitwidth, slots, and padding. I'm going to write it down now while i still remember.

The most space-efficient (memory-efficient) thing to do would be to represent each item in memory with the minimal size needed to hold it. So an 8-bit value would occupy 1 byte, a 16-bit value occupies 2 bytes, a 32-bit value occupies 4 bytes, a 64-bit value occupies 8 bytes, and a pointer occupies the size of a pointer (usually 8 or 4 bytes).

However, how much the minimal space is, depends on the platform. The most common hardware platforms today always have 8/16/32/64-bit integers as 1/2/4/8, and have pointer sizes of 32- or 64-bit. Even the most common embedded/MCU platform, ARM, has 32-bit pointers. But we are also targeting HLL platforms here. So think of memory as a Python array. We want to be able to put an arbitrary Python reference in a cell of "memory". But in this case 1 single cell of memory can also hold an entire 64-bit value (or a 32-, 16-, or 8-bit one). So here the minimum is smaller than 1/2/4/8 cells of memory, and the ratios between 8-, 16-, 32- and 64-bit value storage spaces (the ratios of 16-, 32-, and 64-bit storage spaces to 8-bit storages spaces are all 1 on HLL platforms) differ from the hardware platform case (where those ratios are 2/4/8).

In Boots code we adapt to this by assuming that INT8_SIZE is 1, and storing the constants INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE in sinfo. So an interpreted Boots program must do run-time arithmetic to translate staticly known memory layout into bytes. A compiled Boots program can do this calculation at compile-time (if the compiler is smart enough to propagate concepts and do the constant arithmetic); but the key concept is that Boots itself can't represent memory layout in terms of a single unit of bytes, it can only represent it as a linear combination of these many size constants.

That's all well and good, but when we get to Boot, for better code density, we'd like to use addressing modes to encode immediate offsets to things in the instruction encoding. For example, we'd like to reference things like: "the 3rd item from the top of the stack", "the fourth item in the struct pointed to by register r10". In other assembly languages, we'd specify these offsets in bytes. But the offset, in bytes, from the start of a struct to the 4th item in that struct depends on the size in bytes of the 1st, 2nd, and 3rd items in that struct -- similarly, the offset, in bytes, of the third item from the top of the stack depends on the size in bytes of the 1st and 2nd items on the stack above it.

So instead we have to specify offsets as linear combinations of those size constants. But now we run into a problem, because in our Boot 32-bit instruction encoding, bits are at a premium. For example, you have 4 bits to specify an offset, you can specify a choice of 16 byte offsets, but you can't even specify one bit for each of INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE -- that would be 5 bits already.

1) One solution to this is to give up on size-efficiency and size-expressivity on some platforms; for example, the VM spec specifies that INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE = 1,2,4,8,4. That's great for 32-bit hardware platforms, but on 64-bit hardware platforms, only 4GB of memory can be addressed, and on HLL platforms, everything except INT8_SIZE uses way more memory cells than it needs.

2) Another solution to this is to give up on portability, and require the program to be specialized to a given target, with that target's own set of storage requirements or at least ratios; so for example when targeting a 32-bit hardware machine, the program assumes that INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE = 1,2,4,8,4; when targeting a 64-bit hardware machine, it assumes 1,2,4,8,8; and when targeting a HLL, it assumes 1,1,1,1,1.

3) A third solution is to pad small values into 'slots', and express everything in terms of slots. For example, the VM spec specifies that immediate offsets are in units of max(INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE). So, for the offsets to be useful, the program must layout the relevant memory into chunks of the slot-size; when a smaller-size value must be stored there, it needs to be padded to the slot-size. This is really just another way to say solution (1), but with a focus on have larger slots and a lot of padding; e.g. if slots are defined so that everything fits into 1 slot, then that's like specifying INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE = 1,1,1,1,1. Or if it's specified that everything fits into 1 slot except for 64-bit integers, that's like specifying INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE = 1,1,1,2,1. The key here is that there's only a single base constant, and a bunch of ratios, so you don't need to express immediate offsets in terms of a linear combination.

4) The third solution, slots, can be done in a half-way kind of way two; have multiple constant bases (slot types) but not as many as there could be. For example, instead of specifying offsets as a linear combination of INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE , specify them as a linear combination of INT32_SIZE, PTR_SIZE. If we have 4 bits for the offset, now each of these gets 2 bits, so we can specify at least 4 separate offsets in any situation, and in some cases (where the offsets are jumping over both pointers and ints) we can specify up to 8. In some cases, even if you were okay with using non-portable code, this could produce higher code density, because normally a 4-bit quantity would allow you to index at most 16 bytes, and if e.g. PTR_SIZE=8, then that would only be 2 PTRs, but this scheme lets you reach at least 4 PTRs (and at least 4 INT32s, but only 1 INT8)

Note that with this solution there should be a sinfo defining max(PTR_SIZE, INT32_SIZE). Might be good to have that anyways.

5) Provide a way to define struct layout in the language, with lexical scope, and have offsets refer to items in a particular struct (perhaps all offsets encountered within some linear region of code refer to one struct layout)

6) Provide an architectural register that maps offsets to linear combinations of INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE by essentially encoding a struct layout relative to offset 0. So, for example, if you want to compute 2*INT16_SIZE + INT32_SIZE + 3*PTR_SIZE, you might put "2*INT16_SIZE, INT32_SIZE, 3*PTR_SIZE" into that register and then set the offset to 6 (the idea is that you are telling the language the layout of items in memory and then telling it how many items in total to jump over, as if you defined a struct and were indexing into it). This is like (5) but with the struct definitions dynamically scoped rather than lexically scoped.

7) Give up on portability, but just for offsets

8) Like (4) but with INT8_SIZE (which is =1) and PTR_SIZE as the provided bases; if the runtime wants to use offsets to jump over INT32_SIZE, it must wastefully make use of the Boots guarantee that INT32_SIZE <= 4; but with 4 bit offsets, not even a single INT32 can be jumped over in this fashion. On platforms where INT32_SIZE is less than that (HLLs), this will waste space for INT32s by using more memory locations than needed (but, on such platforms, it probably matters less; and in places in the program where it does matter, just don't use offsets).

9) Like (4) but with: one bit for 1, one bit for 2, one bit for 4, one bit for PTR_SIZE. Using possibly wasteful encoding of INT16, INT32, this allows you to jump over: 7 INT8s, 3 INT16s, 1 INT32, 1 PTR, or any combination of these. This may be the best if a mixture of all 4 of these types is likely. Note that we can't hop over even a single INT64.

10) Of course even if you have a flat offset (4 bits specifies an offset of up to 16 memory locations), you can do what solution 9 does when you only are hopping over ints, but not PTRs.

11) Dont use the offsets. RESERVE them for lovm/lava's use (where they can be an index into fields of lexically defined structs).

Solution 4 seems like a good tradeoff between simplicity and efficiency, and between efficiency of instruction encoding (code density) and efficiency of memory storage representation. But it has problems:

integers smaller than the special integer bit-width must be padded (at least if you want to use offsets) -- in the above example, int16s take up the same amount of space as int32s
integers larger than the special integer bit-width must conservatively assume a ratio, which on some platforms wastes a lot of space -- in the above example, int64s must be assumed to take up 2x the space of int32s, which isn't true on a 64-bit or HLL platform where int64s occupy the same amount of space as 32s (one word)

This means that in solution 4, it's important to choose the special integer bitwidth right. All 3 obvious choices (int16, int32, int64) have their merits:

int16: i'm trying to make a VM suitable for expressing brain-like, massively concurrent computation. In those situations, we'd probably only have 16-bit hardware (and very little memory per CPU)
int32: this seems to me the most common integer bitwidth used in today's software. It prevents having to worry about overflow too much because it fits a lto
int64: a lot of today's 64-bit conventions pad everything (or at least lots of things) to 8 byte slots (using solution 3; 8 bytes because 64-bit hardware architectures have 8-byte pointers)

Choice int8 is equivalent to solution #8.

What do other systems do?

Seems like on the stack, the x86_64 linux calling convention pads everything to 8 bytes (so solution 3; 8,8,8,8,8). The JVM stores everything in 1 slot except for int64s (and float64s) which take 2 slots (so 1,1,1,2,1).

windows appears to pad to 8 bytes:

(from https://www.gamasutra.com/view/news/178446/Indepth_Windows_x64_ABI_Stack_frames.php)

Even the ARM calling convention pads everything on the stack to 4 bytes (so 4,4,4,4,4):

" If the argument is an integral Fundamental Data Type that is smaller than a word, then it is zero- or sign-extended to a full word and its size is set to 4 bytes. If the argument is a Half-precision Floating Point Type its size is set to 4 bytes as if it had been copied to the least significant bits of a 32-bit register and the remaining bits filled with unspecified values.

B.3.cp If the argument is a CPRC then any preparation rules for that co-processor register class are applied.

B.4 If the argument is a Composite Type whose size is not a multiple of 4 bytes, then its size is rounded up to the nearest multiple of 4. "

-- Procedure Call Standard for the Arm Architecture - ABI 2020Q2 documentation

. So everyone seems to be prioritizing uniformity over space efficiency, at least one the stack. Although their reasons may be different from ours; they want things to be aligned, for perf reasons (which perhaps should concern us too).

however, Apple, which is famously memory frugal, apparently permits even 1-byte arguments on the stack:

https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms says: "Function arguments may consume slots on the stack that are not multiples of 8 bytes. If the total number of bytes for stack-based arguments is not a multiple of 8 bytes, insert padding on the stack to maintain the 8-byte alignment requirements. ... The caller of a function is responsible for signing or zero-extending any argument with fewer than 32 bits. The standard ABI expects the callee to sign or zero-extend those arguments. ... The following example illustrates how Apple platforms specify stack-based arguments that are not multiples of 8 bytes. On entry to the function, s0 occupies one byte at the current stack pointer (sp), and s1 occupies one byte at sp+1. The compiler still adds padding after s1 to satisfy the stack’s 16-byte alignment requirements.

void two_stack_args(char w0, char w1, char w2, char w3, char w4, char w5, char w6, char w7, char s0, char s1) {} ...(in a section on variadic arguments:) The C language requires the promotion of arguments smaller than int before a call. Beyond that, the Apple platforms ABI doesn’t add unused bytes to the stack. ... "

is it important to care about memory efficiency on the stack? Yes, sorta:

https://stackoverflow.com/questions/23472055/is-it-ok-to-allocate-big-structures-on-stac "It depends on your software architecture and your target.

If you are writing kernel code you don't want to use a lot of stack space; Linux kernel stacks are small.

User programs often have a 1 Mb or larger stack, so unless you have a lot of recursive routines, allocating a couple of kB on the stack usually isn't a problem."

how small are Linux kernel stacks? 8k (there was an attempt to move to 4k but it failed: https://elinux.org/Kernel_Small_Stacks#ARM_4K_Stacks "simply not enough space for the deep 60+ function call stacks we see with even trivial storage stack configurations"):

https://stackoverflow.com/questions/6270945/linux-stack-sizes

x86's 'push' instruction basically only supports the word size (sorta): https://stackoverflow.com/questions/45127993/how-many-bytes-does-the-push-instruction-push-onto-the-stack-when-i-dont-specif

Leaning towards solution 4, with int32 (so int32 and ptrs; so 1,1,1,2,?, because you still have two bases). This makes me sad though, because it means that we can't make good use of memory in a massively parallel 16-bit machine (or, we could not use those special immediate offset addr modes, and sacrifice code density, which is also memory). But, practically speaking, that's many years away, so i should probably be practical about it and give up on 16-bit. I mean, almost nothing on the market now is 16-bit; ARM and RISC-V are both 32-bit. And a toolchain with 16-bit addressing can't easily deal with many of today's programs, which tend to have executables (and in-memory data structures during runtime) larger than 64k.

If we do what everyone else does, though, it would look more like either 1,1,1,1,1, or 1,1,1,2,1; note the single base, which makes things simpler, and squares the number of guaranteed immediate offset choices for a given bit size (so, if you have 4 bits total, you can have 16 choices guaranteed instead of 4 guaranteed). If pointers are 64-bits, though, then this means that int32s take up twice as much space as they need to. Otoh maybe that's good anyways for alignment reasons. This is also significantly simpler. And, with 1,1,1,2,1, some implementations could do like WASM and have 32-bit VM pointers even if the underlying hardware has 64-bit pointers.

so now i'm leaning towards stack slots (one base), with 1,1,1,2,1.

in some cases, when you expect an offset into a homogenous array but different arrays can have items of different sizes, you can have an offset and a scale factor. This is more efficient to encode than a linear combination of many things b/c the scale factor only appears once and only needs 2 or 3 bits.

now i'm leaning towards one of solutions #4 (2 kinds of slots with int32,ptr), #8 (2 kinds of slots with int8,ptr), or #3 (stack slots) with 1,1,1,2,1. Note that with #3:1,1,1,2,1 there should be a sinfo defining max(PTR_SIZE, INT32_SIZE).

8 is in a way more flexible because the program can just use offsets to talk about single locations if it wishes (with a reduced range of 4 instead of 16; otoh if it needs to jump over ptrs it can do so).

Note that even with #4, #8, #3, the program doesn't have to use offsets, so it doesn't have to use stack slots. Programs that wish to be more frugal with memory can do so, and just not use offsets.

I guess, pragmatically, #4 is better than #8 because realistically we will have INT32s more often than INT8s and INT16s (and when int8s and int16s are present, offsets don't have to be used). So the contest is between #3:1,1,1,2,1 and #4. In typical situations (both hardware and HLL), #3:1,1,1,2,1 in practice means that the slot size is the size of one PTR. So on most platforms the practical difference between #3:1,1,1,2,1 and #4 is that #3:1,1,1,2,1 can express ranges of 0 to 15 PTRs (and the program is free to use INT32s in PTR-sized slots), whereas #4 can express ranges of 0 to 3 PTRs, intermixed with 0 to 3 INT32s.

 (wrong:
  well actually, it's not correct to say that no one will be required to use stack slots if they don't want to if we do this, because the Boot calling convention mandates the use of the ROC and this would be used to index into the ROC. Is there a better way?

  Can/should we get rid of the ROC and just let advanced implementations cache it? Perhaps keep the semantic non-guarantees around seeing accesses to stack in memory / access to reads/writes when offsetting from SP, and keep the flushes but get rid of the requirement to use the special ROC load/store instructions?

  actually i don't think it's a problem; for the ROC you could provide multiple types of load/store instructions that index in different ways. As long as you have one of them that indexes using single memory locations, you're good. So it's incorrect that this would force anyone to use stack slots.
  )

leaning towards #4. This is all just an optimization anyways, the simplest compilers won't even use the offsets.

---

another consideration related to the previous is that we want a program to be able to choose how they represent stuff on the stack, and to be able to potentially use offsets and stack ops.

So let's consider supporting the following choices:

put everything on the stack without padding (corresponding to solution #2 in the previous section, byte stack indices, and to linear combinations of INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE)
pad int8 and int16 to max(int32, ptr); int64 consumes two stack slots (corresponding to indices of stack slots, where stack slots are of size max(int32, ptr) and everything is in one stack slot except int64 which is in 2; this is #3:1,1,1,2,1 in the previous section)
pad everything to max(int64, ptr); 8 byte stack slots (this is #3:1,1,1,1,1 in the previous section)
pad int8 and int16 to int32; int64 consumes 2x int32 (this is #4 in the previous section; indices which are linear combinations of INT32_SIZE, PTR_SIZE)
stack slots are PTR_SIZE; guarantee that a ptr fits at least an int32; so int8 and int16 padded to int32, int64 is two stack slots

We can only choose one of these for offsets; can stack ops still support the others? Yes, because we can have multiple sets of stack op instructions, one for each combination of sizes (although this would waste a lot of opcodes).

On different systems, the sizes of int8, int16, int32, 2*int32, max(int32, ptr), max(int64, ptr), int64, ptr are all different quantities so you might really need many different stack ops. Although, i guess i only see 8 items there, so i guess we can have an immediate argument to the stack op that specifies two sizes.

however all this (including the prospect of having a zillion stack op variants; somehow JVM gets away with less? see https://stackoverflow.com/questions/57398474/is-there-a-way-to-swap-long-or-double-and-reference-values-on-jvm-stack ) makes me think that perhaps we should KISS and just pad everything to max(int64, ptr). If people don't want to do that, fine, they don't have to, but they won't be able to use the stack ops or the offsets.

or, we could just say that stack slots are always the size of PTR (the presumptive 'word size'). Reject 16-bit systems, guarantee that ptrs are at least 32-bits, and make 64-bit quantities take up 2 stack slots. Using this system we would go back to a simple displacement (16 choices of # of stack slots) and we would need 2 variants of stack ops (one for single stack slot, one for double). I guess this is actually just the same as #3:1,1,1,2,1, except i guess i'm realizing there is a distinction between guaranteeing that PTR is at least 32-bits, and saying stack slots are max(int32, PTR). Maybe the latter is better...

leaning towards #3:1,1,1,2,1 (stack slots of size max(INT32_SIZE, PTR_SIZE))

---

tangentially relevant to the previous but very good blog post (gives methods of dealing with AArch64's requirement that SP be 16-byte aligned when implementing VMs that expect to be able to push or pop single registers): https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop

---

yknow, stack slots just aren't good enough for displacement, b/c displacement should be able to be used to index into structs. But allowing only structs with one fixed slot size is kind of crippling. So, in displacements, i think we really need at least INT32_SIZE and PTR_SIZE (solution #4). So should we just do that in the stack also? Probably

leaning towards #4

---

this (#4) seems overly complicated but the alternatives seem to be fixed-size slots everywhere, or not using offsets, or even more complicated stuff like lexically-scope struct definitions, or giving up portability to some extent.

can't we just give up portability a little? NO, because then Boot is really only an intermediate language, not suitable as a source language (you want a human to manually rewrite and separately maintain the entire Boot program for each port?) -- and for bootstrapping, we need a source language.

any individual program could give up displacement offsets, however. And it could give up on ROC addressing modes, and just use the ROC instructions directly (a single program can't totally give up on ROC unless it doesn't need to be compliant with the Boot calling convention, but that's okay, because it can use the ROC instructions directly instead of using stack cache and frame cache addressing mode).

so:

displacement offsets and ROC addressing modes don't need to be usable by all programs, or for every data structure within a single program -- therefore it's okay if, in order to use them, a program is forced to adopt certain conventions for the data structures it uses them on
these conventions should be ones that are efficient for 32-bit and 64-bit (and maybe 16-bit) hardware platforms (where INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE = 1,2,4,8,{4,8, or maybe 2}, at the expense of inefficency for HLL platforms (where INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE = 1,1,1,1,1)
we can assume that the most common integer bitwidth is 32-bits

this suggests that something similar to #4 is pretty good. There are a few choices. Consider a 4-bit offset field:

3 bits * INT8_SIZE + 1 bits * PTR_SIZE
can represent 7 byte offsets
can represent 3 16-bit offset
can represent 1 32-byte offset
can represent 1 PTR offset
2 bits * INT8_SIZE + 2 bits * PTR_SIZE
can represent 3 byte offsets
can represent 1 16-bit offset
can't represent even a single 4-byte offset
can represent 3 PTR offsets
2 bits * INT16_SIZE + 2 bits * PTR_SIZE
can't represent byte offsets
can represent 3 16-bit offsets
can only represent a single 4-byte offset
can represent 3 PTR offsets
1 bits * INT16_SIZE + 1 bit * INT32_SIZE + 2 bits * PTR_SIZE
same as above (2 bits * INT16_SIZE + 2 bits * PTR_SIZE)
can represent 3 PTR offsets
2 bits * INT32_SIZE + 2 bits * PTR_SIZE
can't represent byte offsets
can't represent 16-bit offsets
can represent 3 4-byte offsets
can represent 1 8-byte offsets
can represent 3 PTR offsets

So even supporting any 16-bit offsets at all comes at a significant price to the number of popular 32-bit offsets you can support.

For example, the first choice on the above list can support at least one offset of any size up to 32-bit, but can only support 1 offset of the most popular sizes, and can't support any 64-bit offset. The last choice maximizes the most popular sizes (3 instead of 1), and can support one 64-bit offset, but can't support smaller sizes at all.

It's clear that with any of these choices, few programs will be able to use only these offsets -- most structs will require larger offsets. This is in contrast to the fixed slot-size world, in which case 4 bits gives you 16 slots, which is more than enough for most uses.

If you had an 8-bit offset field, it would be easier to support a variety of field sizes while still supporting multiple offsets in every size:

4 bits * INT8_SIZE + 4 bits * PTR_SIZE
can represent 15 byte offsets
can represent 7 16-bit offset
can represent 3 32-bit offset
can represent 15 pointer offsets
5 bits * INT8_SIZE + 3 bits * PTR_SIZE
can represent 31 byte offsets
can represent 15 16-bit offset
can represent 7 32-bit offset
can represent 3 64-bit offset
can represent 7 pointer offsets

Here, we can support every size and still get 7 of the most popular sizes.

What about the 6 bits that we'll probably actually have?

4 bits * INT8_SIZE + 2 bits * PTR_SIZE
can represent 15 byte offsets
can represent 7 16-bit offset
can represent 3 32-bit offset
can represent 1 64-bit offset
can represent 3 pointer offsets

This could probably address into a sizable minority, or perhaps majority, of structs.

Given that it is becoming clear that in-address mode offsets and displacements will not be able to handle a lot of common use cases, perhaps we should optimize for the common case (32-bit ints and pointers) rather than for comprehensiveness. Also, i think C pads structures at the end so that the total size of the struct is aligned with its largest member -- since this will often be at least 32-bit alignment, this is even more reason for wanting more 32-bit range.

So this is a long-winded analysis that just concludes what we had already started with; INT32_SIZE is the best integer size basis, and it should have half the bits.

So, for 4-bit offsets:

2 bits * INT32_SIZE + 2 bits * PTR_SIZE
can't represent byte offsets
can't represent 16-bit offsets
can represent 3 4-byte offsets
can represent 1 8-byte offsets
can represent 3 PTR offsets

and for 6-bit displacement:

3 bits * INT32_SIZE + 3 bits * PTR_SIZE
can't represent byte offsets
can't represent 16-bit offsets
can represent 7 4-byte offsets
can represent 3 8-byte offsets
can represent 7 PTR offsets

And we resign ourselves to the fact that this within-instruction, addressing mode offsets/displacments can only be used in some cases, as an optimization (which can be added in peephole analysis), rather than as the main facility for indexing into structs.

This does have a conceptual coherency to it, though. The main thing that you can't get around, for portability, is that we don't know the pointer size. So, we must have at least two bases/fields, one for pointer sizes and one for integer sizes. And our 'default' integer size is 32-bit, because it's the first one large enough to not make the programmer think about it in most cases.

---

should we guarantee that a PTR_SIZE can fit an INT32_SIZE (In Boot, not in Boots)? I think so.

---

in general, i think we should move towards tailoring this stuff, efficiency-wise, for hardware platform sizes (e.g. INT32_SIZE is 4, not 1). The HLL sizes have to exist for the sake of interop with native data structures. But if a Boot program can better utilize offsets and displacement instructions by assuming in its own internal data structures that INT32s take up 4 memory locations, maybe do that. So, for example, if we only provide INT32_SIZE as a basis within stack offsets and displacements, then programs must multiply that by two to guarantee enough space for an int64; this is wasteful on HLLs, where INT32_SIZE = INT64_SIZE = 1, but that's okay b/c it makes sense on hardware platforms, which is where you care the most about efficiency anyhow.

---

while working on the Boot addressing modes:

i'm managing to fit in a lot of good stuff (stack cache, frame cache, register-memory displacement, postinc, predec), but it's getting a bid fiddly
immediate mode either can't have many useful constants, or can't be reused as the offset for register-memory displacement mode, or is restricted to range about 0..16
the stack frame cache, frame cache, and displacement addressing modes are not only fiddly, but don't have enough bits to address very much, given the need to have linear combinations of INT32_SIZE and PTR_SIZE. So they're pure optimizations rather than core concepts.
this makes me wonder if maybe we shouldn't just punt on this complexity until LOVM
in which case what's left for Boot over Boots isn't too much, and we would probably just merge them

or, at least, maybe most of this fancy stuff for code density (like addressing modes) should be split off and go in a Boot extension. People writing tooling targeting Boot want to know about instructions like 64-bit int and float arithmetic, various comparison and conversion ops. It might not want to know about that stuff (which will probably just be added transparently by peephole optimizers in most cases).

otoh each of these little complexities seems to be orthogonal to the others, and people targetting Boot can just ignore them if they want.

---

Consider label size distinct from pointer size

later: no, i think we already have too many size types, as you recently saw that makes it hard to encode offsets into small numbers of bits

---

Does WASM have a limit on function call number of arguments?

not really, it's just a stack