proj-oot-ootAssemblyNotes29

ok i'm trying to understand which calling convention we should have, and tail calls, etc.

i've read two things:

ok, on the callee cleanup:

perhaps the 'share the same stack frame layout' is at fault. Imagine that C takes more arguments than B. So eg to cleanup B's stack frame, maybe you have to bump up the stack pointer by 10, but to clean up C's stack frame, you have to bump it up by 20. Now if control is returned directly from C to A, and if A is supposed to do cleanup, all A knows is that it called B, so it'll bump up the stack pointer by 10, which is wrong, because the size 20 stack frame is in effect because of the call to C.

ok so this makes the vararg thing almost but not quite make sense, i think. if A decides to call B with a size 30 stack this time, and then B calls C, and we have callee cleanup, then C cleans up its own stack and jumps back to A, but no one cleans up B's stack, because B was supposed to clean that up but B got skipped.

i think we may need a bigger example to demonstrate. A->B->C->D. All in tail position, so we want to return directly to A. B->C is a vararg call. So the two ppl in a position to know about the size of the vararg stack frame are B and C, but both of them got skipped on the return.

this still doesn't quite make sense though, because what is preventing C from cleaning up the vararg stack frame before the call to D?

https://stackoverflow.com/questions/35626835/make-a-variable-argument-function-callee-cleanup maybe sheds some light here. it seems like the convention is to put the return address last (at the top of the stack/bottom of memory), below the arguments. So maybe you don't want to bump up the stack pointer before returning because then the return address is in the de-allocated-ish area below the stack pointer. And i guess the compiler can't be bothered to just manually move the return address up, overwriting earlier arguments. There is apparently an instruction in x86, 'RET #x', with #x immediate, that reads the return address, moves the stack pointer up by x, and then jumps to the return address which was read. But if it's a vararg then x is only known dynamically, not staticly, and it's an immediate, so it doesn't work.

so that last bit about the vararg sounds like it's something specific to limitations in the x86 convention, conveniences in the instruction set, and the way that compilers (and probably branch predictors) insist on using that.

more about vararg stuff [2]

so it sounds like callee cleanup is the way to go.

other notes i found while reading about this:

The fewer times you modify the value of esp explicitly, the better. There is more going on in these instructions than meets the eye. " -- https://cboard.cprogramming.com/tech-board/117965-callee-cleanup-variable-number-parameters.html - that's nice but it sounds fancy -- and if we're doing fancy stuff i'm betting tail call elimination is more important

strangely https://www.agner.org/optimize/calling_conventions.pdf doesn't have much to say about this.

https://github.com/microsoft/ChakraCore/issues/796 has a good point about another way to do varargs with TCO: with caller-cleanup, if A->B->C->D and B->C is vararg and D skips over B right back to A when returning, it's true that A doesn't know how much stack B allocated to hold the varargs, but A can still make things right just by remembering what its stack pointer was before the call (that is, treating SP as caller-save instead of callee-save) -- of course since it's the stack pointer itself that is being lost, it has to save it in some other callee-save register, rather than on the stack. They don't like doing this because it uses up another callee-save register on each such function call (so one less register, and also all those saves and restores). They seem to think a better alternative would be callee-save, but they have a lot of vararg code that would need to be rewritten or something. Sounds like they decided to do nothing in the end (not support TCO).

Q: "Everywhere it says that the called function doesn't know how many parameters have been passed, but in that case why isn't it possible to simply put that data into a register or push it onto the top of the stack for the called function to use?"

A: " There is no fundamental reason why the callee couldn't clean up space for variables. In most architectures, the standard calling convention doesn't handle variables this way, but that doesn't mean it's not possible to do so. For variable-length argument lists, you could either pass data about the number of arguments as a hidden parameter (like how this is handled in many object-oriented languages), or put a pointer on the stack that shows where the arguments end, etc. " -- https://stackoverflow.com/questions/19822045/what-prevents-the-callee-from-cleaning-up-the-stack

"The callee can clean variable arguments from the stack. I made it once, actually. But the code is pretty big.

Anyway, the main reason why in cdecl convention, the caller cleans the stack is other ...On some architectures (usually very small, as the old 8080 or 6800), there is no ret n instruction that to automate the stack cleaning and as a rule they can't make arithmetics with the stack pointer as well.

So, the callee have to first pop the return address from the stack in order to reach the arguments, then to pop all arguments and then push back the return address....When cdecl convention is used 2 instructions and one register use are spared:...And because for single language it is better to use single calling convention on all platforms, the CCALL as more simple and universal looks better. (C language is created in times when 6800 was a high tech).

But notice, that on these platforms, the assembly programs and the native languages (for example different kind of BASICs) usually use register argument passing which is of course much more faster on such small systems. " -- https://stackoverflow.com/questions/19822045/what-prevents-the-callee-from-cleaning-up-the-stack

some languages put (RAIIish?) destructors in their (caller?) cleanup routines, preventing TCO:

" Many Perl idioms rely on strict stack discipline. For example, implementing cleanup code as the destructor of an object held in a lexical variable in a particular stack frame. The result is that implicit tail-call optimisation would break all sorts of things. Hence tail-call semantics are only safe to invoke *explicitly*, which is what Sub::Call::Tail provides. " [7]

---

it sounds to me like callee cleanup is the way to go. TCO may be important, varargs probably not (can always store those in memory).

---

next question: store the first few regs in registers in Boot calling convention?

pro: faster, if regs are really regs

pro: this allows Boot syscalls with few arguments to avoid hitting 'main memory' (until we're in the actual implementation of the call)

con: in LOVM, seems cleaner to put args in smallstacks. That way they can be shifted similarly to 'input argument' and 'output argument' register window sections. if we have the first few args in registers and the rest in smallstacks, that's confusing

my leaning: at first i was leaning to not use regs at all (just stack) for cleaner transition to lovm, but it galls me to have to hit memory just to do a syscall. So now i'm leaning towards using regs. I kinda think there might be a way to may the register window thing work cleanly anyway with the regs.. after all if you need to preserve the input args while you are writing the output args you'd just move them onto the stack right? So mb it's not that much worse. So i kinda think the thing to do is try using regs and reconsider if necessary after implementing x86_64 assembly version of Boot and LOVM (in the far far distant future).

decision: for now, use regs

---

next question: how many regs?

so, should we pass 2 or 3 args in registers?

if 3, then no space for a temp, and maybe we should add instructions: push-int-to-SP-and-decrement-SP push-ptr-to-SP-and-decrement-SP

or at least add back in:

because currently, without a free temp register, in Boot, you can't even push to stack (you can't predecrement the SP without loading an immediate into a register)

otoh mb most of the syscalls we want to make can use only 2 (remember we have 2 banks, so 4 spots in total, so as long as both integers and ptrs are being passed, you get more than 2 arguments), so maybe it's okay to just pass 2 args in registers and leave a scratch register.

also, it sucks to add appm and ap32m in Boot, b/c (a) you may want to add two versions of them b/c we don't have signed immediates in oot, and (b) in LOVM they are a duplication because we can have a general instruction like this with both an immediate and register addr mode for op2.

regarding (a), well i guess an upward-growing stack is POSTincrement, which isn't a problem (copy the argument to the stack, then overwrite that register with the immediate, then add that immediate to the stack (postincrement); now you have one free register); it's the predecrement convention in the downward-growing stack that's a problem. So you don't really need to add one version for each direction, even if the immediate imm7 is unsigned. So it's only 2 more instructions, and hey maybe they will be useful anyways b/c they will have a 7-bit unsigned immediate range rather than a 6-bit signed immediate range (so 5-bit unsigned).

so far i've looked a few syscalls tho and none really need more than 2 args of one type for the most basic functionality:

open read malloc

i'm leaning towards just using 2 regs for arguments:

decision: 2 regs for now

---

next question (which arose while researching the previous one):

what to do when, depending on platform, APIs might return an integer or a pointer?

For example, on Linux, the open() syscall returns an integer file descriptor. On Windows, the CreateFileA? syscall returns a ('file'?) handle, and the (older?) winbase.h OpenFile? also returns a file handle.

https://stackoverflow.com/questions/902967/what-is-a-windows-handle

https://www.codeproject.com/Articles/1044/A-Handy-Guide-To-Handling-Handles

what does windows API mean by 'handle'? I like Dan Moulding's stackoverflow answer:

" A HANDLE is a context-specific unique identifier. By context-specific, I mean that a handle obtained from one context cannot necessarily be used in any other aribtrary context that also works on HANDLEs.

For example, GetModuleHandle? returns a unique identifier to a currently loaded module. The returned handle can be used in other functions that accept module handles. It cannot be given to functions that require other types of handles. For example, you couldn't give a handle returned from GetModuleHandle? to HeapDestroy? and expect it to do something sensible.

The HANDLE itself is just an integral type. Usually, but not necessarily, it is a pointer to some underlying type or memory location. For example, the HANDLE returned by GetModuleHandle? is actually a pointer to the base virtual memory address of the module. But there is no rule stating that handles must be pointers. A handle could also just be a simple integer (which could possibly be used by some Win32 API as an index into an array). "

This resurfaces the motivation for the ANY type registers that i had in there awhile ago but removed. Here we have a something that:

interestingly, we already have something like this: codeptrs. If you think about it, in an interpreter a codeptr could be represented as an integer index into a list of instructions, rather than as an actual pointer into memory (with 'instructions' conceptually residing in a special 'code memory' which is not part of or accessible from data memory).

so should we (re)create a third register bank for these guys (what i used to call ANY type)? it might be sort of nice for the implementation to distinguish between these guys and dereferencable pointers: for garbage collection purposes. These guys could be 'boxed' with additional bits that say whether they are really integers or really pointers.

Disadvantages:

Advantages:

out of these, i think the most important considerations are:

so the proposal is:

i'm leaning towards this.

---

we may want to introduce alignment restrictions to prevent the problem noted above about another thread seeing an invalid pointer in the middle of write? otoh if each thread has its own GC then the GC and the implementation can have a mutex when the implementation main thread is writing to memory. if ptrs are opaque, how to align? mb would need an instruction for that

done

the stack also has align restrictions on some targets (eg 16-bit on x86 at least on x64 windows)

i think that's an ABI thing though, so i guess we can ignore it in Boot (where we have Boot's own convention, rather than trying to interop with the underlying platform convention). And in LOVM we can abstract it.

---

what do garbage collectors do about pointers that may be invalid (e.g. a C pointer that points past the end of an array)?

https://www.hboehm.info/gc/04tutorial.pdf says

" Pointer validity check

elsewhere in those slides it's clear that memory allocation is done thru the GC, so the GC is aware of where the valid memory is

"Unlike [BoehmChase?92] we start with the assumptionthat the garbage collector recognizes all pointers to theinterior of an object, not just to the first byte of the object.Recent experience suggests that this is the right framework,particularly for typical C++ implementations whichimplicitly generate pointers to the interior of an object. Thetechniques of [Boehm93] can greatly reduce the danger ofspace leakage that we previously associated with thisapproach. This new assumption greatly simplifies our task.Unfortunately, it also usually invalidates assumption (A) of[BoehmChase?92], so our correctness argument has to bedifferent

...

We assume that the garbage collector recognizes anyaddress corresponding to some place inside a heap allocatedobject as a valid pointer. ([Boehm95] satisfies theseassumptions in its default configuration"

[8] "Simple Garbage-Collector-SafetyHans?-J. Boehm"

so it looks like it actually checks potential ptrs to see if they are in valid memory before it derefs them. oh man what a pain (but oh man so amazing)

---

"OurMagpietool? converts C source programs to cooperatewith precise GC ... Some constraints on the C program depend on the specific GCimplementation that is used with Magpie-converted code. The GCmay supportinterior pointers, i.e., pointers into the middle of a GC-allocated object, including to the end of the object. In practice, wefind that interior pointers are common in real C programs, whileprograms specifically adjusted to work with precise GC (such asthe PLT Scheme runtime system) can avoid them to help improveGC performance. Note that a pointer just past the end of an arrayis legal according to the C standard, and Magpie considers suchpointers to be valid interior pointers " [9]

---

also, mb i should allow <= comparison on pointer values. I didn't use to allow it because it may not make sense. But there are times (e.g. sort a list of pointers so that you can do binary search to determine the presence of a pointer in that list) when you really want to compare pointers, and you don't care if the comparison has any other meaning aside from being a total ordering.

what do other languages with opaque pointers/no pointer arithmetic do about pointer comparison? e.g. what does Golang do?

golang allows equality comparison only, but not greater-than, less-than: https://stackoverflow.com/questions/45818964/how-can-i-compare-pointers-in-go

---

later i decided to remove register/smallstack banking again -- see discussion in [[ootBootNotes1?]].

---

so what about unified register files with floating point as ordinary gprs?

https://stackoverflow.com/questions/51471978/is-there-any-architecture-that-uses-the-same-register-space-for-scalar-integer-a

sounds contentious but seems like today's best practice is to bank them, eg separate floating point

also most mechanical sympathy on popular archs is to bank fp

so bank the fp regs, but not the smallstacks (b/c fps on the stack are in the same linear memory substacks as everything else). btw https://stackoverflow.com/a/52175510/171761 suggests having more caller-save registers in the fp bank, and less callee-save regs, unlike his suggestion for integer

--- old

maybe we can get away with having the primary stack be above the stack pointer in the output stack be below the stack pointer, and pushing to the output stack all at once in a calling sequence. Alternately, since the primary stock can output stock capacities are fixed, If you save those capacities in a register then you can offset the beginning of the output stack from The frame pointer.

yeah I think we need to register to hold The frame allocation information in. At least this is cheaper than a fixed sized register windows where you always have to push a Certain window size onto the Stack. Again consider making those allocation quantities powers of two -- Then maybe we really can fit in All the variables we may need. I think we do need an extra register for holding the primary stack position. Maybe reserve a register for the PC. I'm still thinking if there's a way for the boot calling convention to put arguments in registers. I guess one way is just to use the out stack addr mode as registers! That would imply that the recipient not the caller is the one who shifts out to Local. Also maybe out should be registers not stack even in lovm. Consider however that a Nice thing About Forth was that function calling was cheap (eg no callee-saving etc to be done).

Maybe you can simplify all of this by having one stack and specify how many output args when you call th next function

mb don't even let the program Read the stack pointer that when the stock is not flushed that way we don't have to have an extra register for the tip of the cached stack

The thing is if it's one stack and that stack is interrupted by the return address and the frame pointer And any other callee saves, local variables, or other junk we have in the stack frame. So that makes it Inefficient to translate a stack position to an address. If it's two stacks then it's inefficient to push more to the argument input Stack, Because directly after that is the stuff in our stack frame. This suggests that the argument input stack Should be mapped to registers, And the only snack should be the out Stack

So Callee clean up means that the input arguments are cleaned up too right?? so our output arguments passed where the input arguments were on the stack.?

I think what you're looking for is that out could really be registers. We could rely on the callee to flush out to the stack if needed.

But in that case, how do we use the stack pointer in boot?

I guess maybe if the program never touches the stack pointer then this stuff is abstract

---

i can't think of a good way for the LOVM calling convention to match up with a Boot all-register calling convention without actually moving the smallstack addr modes from LOVM to Boot. In which case we could do this:

caching non-guarantees:

some benefits to this:

done

---

so is there a way to do the previous without actually importing stack cache addr mode semantics into Boot?

The issue is probably that we want to allow the implementation to choose to store the smallstacks in memory, or to store them in regs (or in a memory cache). But if they are in memory then that's visible to Boot. But they may or may not be in memory. This indeterminacy is visible and so must be mentioned in the spec.

Can we just tell Boot to put things on the stack in memory? If we want interoperability (programs mixing Boot and LOVM functions), that's hard for the implementation to detect and map into smallstacks. Imagine if a Boot function called a LOVM leaf function. The LOVM leaf function would not feel a need to flush anything (and we don't want to make it), because it wants to just use the output stack as its arguments, since it doesn't need fresh output stack for its own outgoing arguments. It can just not adjust the frame pointer, and offset its own locals off the stack pointer. But the Boot program put the input arguments onto the stack in memory.

Is there any point keeping LOVM separate from BootX? if we are putting all these complicated semantics into Boot?

There may be another way. It may be hard to detect a write to the stack pointer in general, but if we have a PUSH instruction, or even better a PUSHMEMSTACK instruction, and a POPMEMSTACK instruction, that is easier to detect. The implementation could intercept that instruction and have it write to the output smallstack cache instead. We could have flushStackToMemory and flushMemoryToStacks instructions and specify that until the are called, as above there are caching non-guarantees. The benefit of these gymnastics is that we maintain our load-store architecture and our address-mode-less architecture in Boot, rather than having these memory-associated locations be targettable by any operand.

why just push and pop? perhaps push, pop, read, and write. In fact, maybe just read and write -- then we can add the requirement that you flush before moving the stack pointer, which may make things even simpler for the implementation, b/c it doesn't have to keep track of stack ptr moves. Then we can drop this requirement for LOVM.

seems like this may make it harder to abstract stack frames later though? but maybe we are saving that for OVM.

i feel like there is a more abstract way to do this. But i can't think of anything that wouldn't be too abstract for Boot.

done (register-associated cache) ---

LOVM is looking more and more similar to Boot. Maybe just merge LOVM and BootX? (could use the name LOVM for something further up the stack later, as previously planned)

---

so i was thinking about calling conventions. I was hoping there was a way to do all of:

i don't think it's possible. In fact, i don't think even the first three are possible simultaneously:

now, the main issue of them having different calling conventions is: what about the syscalls? Boot is supposed to be a subset of LOVM, so the syscalls should work the same in both cases.

which is also the reason we wanted to pass arguments in registers

so, we could just say that Boot passes stuff on the stack, but that means you have to have available stack memory to do a syscall, and also that you have to take the performance hit of hitting memory to do a syscall.

---

how does linux do syscalls on x86?

there are three types: legacy syscalls with the 'int' instruction, 'fast' 32-bit syscalls with the SYSENTER instruction, and 'fast' 32-bit system calls with the SYSCALL instruction.

With legacy syscalls, the syscall number and up to 6 args are passed in registers [10]

with fast 32-bit system calls, the syscall number and up to 5 args are passed in registers [11]

with fast 32-bit system calls, the syscall number and up to 6 args are passed in registers [12]

how does windows nt do syscalls on x86? presumably it uses the same instructions; but user code calls the windows API using ordinary library calling conventions. The Win32 API uses the stdcall calling convention: https://en.wikipedia.org/wiki/X86_calling_conventions#stdcall which passes arguments on the stack, not registers.

not sure if there is a 'win64' API or if it has a different calling convention. https://wiki.freepascal.org/Win64/AMD64_API#Calling_convention suggests that it does, and that this one passes the first 4 params in regs. https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions says the same. Although the win64 calling convention passes the first 4 args in registers, it also requires that (32 bytes?) of 'shadow space' be allocated for them on the stack.

so my conclusion is, yes, there does seem to be a need to pass syscall args in registers

---

so i think we're back to what i was planning above:

my idea about a loadwithcache, storewithcache for every register, not just SP -- altho in Boot, to save operands (so that we could go back to a 16-bit Boot if we wanted), only require the implementation to accept these with register SP (so that they can be elided to their two-operand forms).

done (register-associated cache)

---

just some random thoughts on bitwise data manipulation after a dream i had

this first paragraph is just me writing down whatever nonsense was in my head right after waking up: Predication with two lanes (so each bit can specify a lane). Forwards and backwards in time? Mb not. Each instruction has an and and an or. Src/dst? Arc?. Not sure if this is right but 2 lanes * (src and or dst) where each of src dst and or is 4 bits would be 32 bits. Quantum computing with all qubits restricted to the standard basis?? Quantum computing with just one qubit??

Note that a qubit has two degrees of freedom. I think 16 bits can specify one 2-qubit quantum gate (64 bits would be needed to specify one three qubit quantum gate I think). I think four bits can specify one 1-qubit quantum gate (so can't achieve quantum universality this way). Note that in actual quantum computing The elements of the gate matrix are real numbers not binary, so to reiterate, we are only talking about a subset of this.

In addition the numbers in the matrix and quantum computing can be complex. And in addition the matrix must be unitary. So my accounts above seem not quite correct. What if we restrict to 1 0 i -i, and unitary? How many bits does it take to specify a one wire quantum gate?

Mb for each bit input (wire input) you can have 4 choices: f t x -x.

A classical gate on bits with two inputs and two outputs has four input conditions and for each of those specifies one of four possible output conditions. So. 4^4 possibile gates (and the table consists of four pairs of two output bits each so eight bits).

So to specify a gate with two bits of input and one bit of output is four bits. So if we have two predicate bits then we can specify any combined predication with four bits, and we can specify any gate on the bits themselves with 8 bits.

Src and dst could be registers for a conditional move, or they could be a choice of predication bits out of a larger set of predication bits

Okay but moving on from the dream and just thinking rationally, If you have two predication bits p0 p1 what would be some good things to do with that. you probably don't need to predicate each instruction with the full 4-bit comprehensive conditional over those two bits. Instead you might have two bits for each instruction to specify one of: if p0, if p1, if p0 or p1, if p0 and p1. then you might occasionally have an 8-bit predication gate on the predication bits? Not quite right yet probably

I guess if you have 16 predication bits then in 16 bits you can say: specify bit number one, specified bit number two, specify gate. 4 bits 4 bits 8 bits for a total of 16 bits.

Another thing you can do with 16-bit instructions and 16-1 bit registers is specify each of two source operands and one destination operand (12 bits total) and then specify one operation in 4 bits.

And remember that the original idea was two lanes was that each instruction has two stop bits which determine whether it can be executed in parallel with the previous instruction in that lane (an instruction with both stop bits set Waits on both lanes, so it synchronizes)

Other fun things that you can do with a single register of many bits is permute the bits, or permute-with-replacement. either of those takes many bits to specify the permutation though; for a 16-bit register, permute with replacement means making 16 choices 16 times, so 4*16 = 64 bits. Since we're certainly not going to have instructions greater than 64 bits at this level, that suggests that a 16-bit bit register is too many bits. What about a 4-bit bit register? Permute with replacement is now four choices four times, which is 8 bits. And the two-wire gate with source and destination operands out of four choices requires 8 bits for the gate and 4 bits for the operands for a total of 12 bits. A three-wire gate with two inputs and one output, and four bit registers, requires four bits for the gate and six bits for the operands for a total of 10 bits. so now we're getting to The sorts of instructions that we can comfortably fit in 16 bits.

Regarding the stop bits and the two lanes, If you apply this to all instructions without other constraints, you were providing redundant information because all you were doing is specifying dependencies between instructions, and the data dependencies already specify The necessary dependencies. but the time this would really be helpful would be in memory operations, because with memory operations you might want to specify ordering dependencies between operations on different memory locations, and that is not implied by the instructions themselves.

And now let's look at if we only have two bit registers. now we don't need permutation because the 2-bit (2-wire) gate operations contain this, nor do we need three operand instructions because again the 2-wire gate operations do everything. So in fact all we have to consider is the 8 bits to specify a two-wire gate. Furthermore to specify any function on these two bits with a one bit output takes four bits (one bit for each possible input condition). For predication however it really seems like there's some operations which are more useful than others. If bit 0, if bit 1, if bit 0 and bit 1, if bit zero or bit one, if true, if not bit 0, if not bit 1, ??. So three bit predication would be quite useful here, Even though it would take four bits to be complete. And of course two bit predication would be pretty good too: always, If bit 0, if bit 1, if bit 0 and bit 1. Or how about: always, If bit 0, if not bit 0, If bit 0 and bit 1. Yes, if you only have two predication-specification bits per instruction, the last one is more useful, because if you have an 'if' conditional that you are trying to express without branching, being able to do something exactly when bit 0 is 0 and also something else exactly when bit 0 is 1 is useful.

Source registers and destination registers with complete gate specification waste bits because many gates either ignore one or both source operands, or are the same as other gates except the order of the source operands are switched. It is less redundant to have OP codes and operands with multiple instruction encodings to support OP codes that don't need both source operands

Gates makes sense however if the operands are fixed like wires.

---

to summarize the most useful ideas from the above:

---

just to review my current ideas for the stack:

---

i think we should reintroduce an LOVM layer. It will be more like WASM in that there are a lot of 'locals', rather than just 32 registers. Then:

---

How does WASM handle limits on the number of function arguments?

There are no limits; WASM 'call' instruction consumes arguments from the WASM stack, which is afacit unbounded.

---

so what should the encoding format of LOVM be? Completely variable-length bytecode, or 64-bit? Leaning towards 64-bit.

i have 3 ideas for 64-bit:

nice things about 12 operand bits:

nice things about 10 operand bits:

otoh LOVM's goal is to be on the level of WASM/LLVM so maybe it should just be variable-length to do away with the remaining unusual 'limits' on #locals, #fn args, etc.

should LOVM (or its language, Lo) have wren-like efficient classes?

---

ideas about how 8 bits could be used to represent types for polymorphism:

but really we probably can limit polymorphism to having the destination operand type being determined once the two source operand types are determined. So now we can do:

4 bits is enough to encode 16 types which is just enough to fit most languages' unboxed primitive types.

We could also just have one 'type' with 256 choices. Most of the type there needs to only be one type to polymorph on; e.g. ADD uint32 uint32 -> uint32 vs ADD int16 int16 -> int16; e.g. OOP 'receiver's. This lets us leave most of the types undefined as 'custom' in the spec and let the program interpret what they mean.

---

in Go, pointers can only be compared even for equality if they are the same type (or the null ptr):

https://go101.org/article/pointer.html

---

it's a good idea to recommend that the sp and fp r.o.c.s be the same size as each other, even though in leaf functions you don't need to use fp at all, because otherwise if the fp cache was smaller than the sp cache, then in non-leaf functions when you do 'cp fp sp' in the prolog, you'd have to spill the excess items from the sp cache to memory, negating the purpose of the ro.o.c.s (to give similar performance as storing args in registers).

---

many of the odd things about Boot/BootX? directly follow from the desire to make Boot/BootX? programs portable over different target platforms, while still allowing for the possibility of performant compiling implementations without a large runtime/memory footprint:

---

with a stack cache, when the stack is pushed, the shallowest item of the stack cache becomes dirty. If there were already dirty items at/near the top of the stack, they get pushed down, so there is a contiguous region of dirty items starting at the shallowest and possibly continuing down some ways. To summarize, when the stack is pushed, the shallowest items become dirty, down to some depth.

when the stack is popped, the deepest items become stale, up to some depth.

So to keep track of this requires enough state to hold two numbers, each of whose bits is the number of bits needed to encode the size of the stack.

so if the stack cache is 16 items, that's 2*4 = 8 bits total. And if there are two stack caches, that's 16 bits total.

is doing all that calculation upon stack cache access actually faster than just hitting main memory, which is presumably in L1 cache anyways? https://stackoverflow.com/questions/10274355/cycles-cost-for-l1-cache-hit-vs-register-on-x86 and https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory https://stackoverflow.com/questions/14504734/cache-or-registers-which-is-faster suggest it might be (these sources suggest about a 3 cycle penalty for reading from l1 cache (that is, in the case of a hit) vs a register, and i bet we can do whatever extra manipulations we need in 3 cycles or less) but it's unclear. But, it should allow optimizers to optimize better, since they can assume that modifications to the stack cache don't need to be visible in main memory before a flush (this allows them to model the stack cache as registers, and model the flushes and reads or writes to all of those registers at once).

note: if you were programming a cache like this, it would be useful to have a predicated-load-from-memory instruction; because if someone reads from the cache, and you want the cache to transparently read from memory when it contains garbage, then you'd ordinarily need to test for staleness and then branch; so you could avoid the branch if you had a non-branching-conditional-read

yknow actually, if there's a 3-cycle penalty for L1 cache hits, then maintaining a 4 item cache is actually a little slower than memory, because e.g. upon a popc, the implementation has to:

cp dest <- cache0 cp cache0 <- cache1 cp cache1 <- cache2 cp cache2 <- cache3 mark cache3 as stale (possibly just load the next value; or possibly multiple quick operations to increment a count packed into a register keeping track of dirty and stale cache stuff; or at least one operation to set a particular bit in that register)

that's at least 5 operations, vs. at least 3 for L1 cache hit.

there's other benefits to stack caching though (easier program optimization b/c can assume that changes to cache aren't visible in memory unless flushed), so it may still be worth it.

however this suggests that a sweet spot for # of items in the stack cache may be 4.

Also, if you had two caches (sp and fp), and for each cache you had one bit for each item to indicate staleness, and one bit for each item to indicate dirtiness, and you had 16 items, then that would be 64 bits. Which suggests that you don't want more than 16 items in the cache, because then that implementation strategy would take up multiple platform registers. And 16 is what the stack cache and frame cache addressing modes can address anyways.

---

---

i think instead of using memcpy to copy blindly, we should go back to having an opaque type (what to call it? opaque? black-box? hidden? blind? byte? any? probably 'opaque' or 'hidden'). That lets us move the memcpy syscall into BootX?.

mb also move 'halt' into BootX?? The thing is, Boot programs are going to have to have implementation-defined conventions for argument and result passing anyhow. Otoh, for a compiler or interpreter, you already have an I/O stream, so mb not. Perhaps make 'halt' an instruction, then? I'm trying to avoid defining the availability of the stack. Then again, maybe having a stack is okay.

Doesn't a compiler need 'malloc' anyways? Or at least sbrk, or a data segment register?

one argument for memcpy is that imagine an implementation using capabilities (and perhaps tagged memory) where some words in memory hold special capabilities. But maybe if you load a capability into an ordinary register and then move it into a different register and then store it, it loses its special capability-ness. Maybe you have to do something special to move capabilities in memory. So our memcpy could do that.

eh, i guess it's not so clear. i'll put this aside.

---

yknow, as simple as it is, the 32-bit encoding is just too much work to implement for Boot. For me, it already took some extra time to implement encoding in basm, so imagine how hard it would be for someone who is not comforable with bitwise operations.

There is not really a way to make it simpler -- the need for the 3 format encoding bits in one place means that either at least one field must be split in an awkward way in between bytes, or we restrict the opcodes, or we restrict one of the operand ranges, or one of the operands gives up addr modes.

we already decided that we don't care if Boot code can interoperate with BootX? code because everything is going to be implemented not in Boot but rather in BootS?, the Boot subset of BootX?, and then compiled from BootS? to Boot. Which is why we were able to move the register-offset cache stuff out of Boot into BootX? -- we just decide that the way to compile BootS? to Boot is the uncached style of r.o.c. implementation (that is, roc instructions accessing the 'cache' just directly accesses stack memory instead).

so there is not much lost by saying that BootX? has a different encoding that Boot. Now we can make it simple.

---

 "
 Iron-code Summary•Section A.2—Use general-purpose registers with a load-store architecture. •Section A.3—Support these addressing modes: displacement (with an address offset size of 12 to 16 bits), immediate (size 8 to 16 bits), and register indirect. •Section A.4—Support these data sizes and types: 8-, 16-, 32-, and 64-bit integers and 64-bit IEEE 754 floating-point numbers. –Now we see 16-bit FP for deep learning in GPU•http://www.nextplatform.com/2016/09/13/nvidia-pushes-deep-learning-inference-new-pascal-gpus/•Section A.5—Support these simple instructions, since they will dominate the number of instructions executed: load, store, add, subtract, move register-register, and shift. •Section A.6—Compare equal, compare not equal, compare less, branch (with a PC-relative address at least 8 bits long), jump, call, and return. •Section A.7—Use fixed instruction encoding if interested in performance, and use variable instruction encoding if interested in code size. •Section A.8—Provide at least 16 general-purpose registers, be sure all addressing modes apply to all data transfer instructions, and aim for a minimalist IS–Often use separate floating-point registers. –The justification is to increase the total number of registers without raising problems in the instruction format or in the speed of the general-purpose register file. This compromise, however, is not orthogonal.
 "
 -- https://passlab.github.io/CSCE513/notes/lecture03_ISA_Principles.pdf

see also slide Top 10 Instructions for 80x86 in https://passlab.github.io/CSCE513/notes/lecture03_ISA_Principles.pdf

(from most to least freq: load, conditional branch, compare, store, add, and, sub, move register-register, call return)

---

"Most loops go from 0 to n." [13]

---

implemented Boot in Python again on Nov 26 (Thanksgiving) and Nov 27 (commit 39df6da; "pyboot3").

---

is it worth the more complicated thing i thought of that allows 32-bit integers, 64-bit integers, and pointers to have different sizes on the stack?

https://developer.arm.com/documentation/ihi0055/latest/ suggests they might use 64-bit 'stack slots'. https://techpubs.jurassic.nl/manuals/0630/developer/Mpro_n32_ABI/sgi_html/ch02.html uses fixed-size stack slots. Looks like here stack arguments are alighned to 32-bits: https://stackoverflow.com/questions/60959847/why-are-parameters-arranged-this-way-on-the-stack-when-a-function-is-called . Looks like here everything on the stack is 64 bits: http://www.windbg.xyz/windbg/article/111-64bit-Calling-Convention . Looks like here everything on the stack is 32-bits (except 64-bit quantities take up two 'slots'): https://www.tenouk.com/Bufferoverflowc/Bufferoverflow2a.html . This mentions 8-byte stack slots: https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms . https://www.lri.fr/~filliatr/ens/compil/x86-64.pdf says that "Thepushqandpopqcombine a move with an adjustment to%rsp. Note that the stack should stay 8-byte alignedat all times".

So it sounds like everyone else just has uniform-sized stack slots that are the same as the machine word size, even in native calling conventions. So maybe i should do that too.

mb i should also require that any Boot object size larger than the word size should be a multiple of the word size?

---

on using C# for low-level stuff: "since C# 7.0 or so: Span<T> for example is a revolution; But I use things like stackalloc, ref, fixed, Unsafe.CopyBlockUnaligned?, &MemoryMarshal?.GetReference? etc, on a daily basis to get that extra performance where it matters;"

---

a big issue with tradeoffs is bitwidth, slots, and padding. I'm going to write it down now while i still remember.

The most space-efficient (memory-efficient) thing to do would be to represent each item in memory with the minimal size needed to hold it. So an 8-bit value would occupy 1 byte, a 16-bit value occupies 2 bytes, a 32-bit value occupies 4 bytes, a 64-bit value occupies 8 bytes, and a pointer occupies the size of a pointer (usually 8 or 4 bytes).

However, how much the minimal space is, depends on the platform. The most common hardware platforms today always have 8/16/32/64-bit integers as 1/2/4/8, and have pointer sizes of 32- or 64-bit. Even the most common embedded/MCU platform, ARM, has 32-bit pointers. But we are also targeting HLL platforms here. So think of memory as a Python array. We want to be able to put an arbitrary Python reference in a cell of "memory". But in this case 1 single cell of memory can also hold an entire 64-bit value (or a 32-, 16-, or 8-bit one). So here the minimum is smaller than 1/2/4/8 cells of memory, and the ratios between 8-, 16-, 32- and 64-bit value storage spaces (the ratios of 16-, 32-, and 64-bit storage spaces to 8-bit storages spaces are all 1 on HLL platforms) differ from the hardware platform case (where those ratios are 2/4/8).

In Boots code we adapt to this by assuming that INT8_SIZE is 1, and storing the constants INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE in sinfo. So an interpreted Boots program must do run-time arithmetic to translate staticly known memory layout into bytes. A compiled Boots program can do this calculation at compile-time (if the compiler is smart enough to propagate concepts and do the constant arithmetic); but the key concept is that Boots itself can't represent memory layout in terms of a single unit of bytes, it can only represent it as a linear combination of these many size constants.

That's all well and good, but when we get to Boot, for better code density, we'd like to use addressing modes to encode immediate offsets to things in the instruction encoding. For example, we'd like to reference things like: "the 3rd item from the top of the stack", "the fourth item in the struct pointed to by register r10". In other assembly languages, we'd specify these offsets in bytes. But the offset, in bytes, from the start of a struct to the 4th item in that struct depends on the size in bytes of the 1st, 2nd, and 3rd items in that struct -- similarly, the offset, in bytes, of the third item from the top of the stack depends on the size in bytes of the 1st and 2nd items on the stack above it.

So instead we have to specify offsets as linear combinations of those size constants. But now we run into a problem, because in our Boot 32-bit instruction encoding, bits are at a premium. For example, you have 4 bits to specify an offset, you can specify a choice of 16 byte offsets, but you can't even specify one bit for each of INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE -- that would be 5 bits already.

1) One solution to this is to give up on size-efficiency and size-expressivity on some platforms; for example, the VM spec specifies that INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE = 1,2,4,8,4. That's great for 32-bit hardware platforms, but on 64-bit hardware platforms, only 4GB of memory can be addressed, and on HLL platforms, everything except INT8_SIZE uses way more memory cells than it needs.

2) Another solution to this is to give up on portability, and require the program to be specialized to a given target, with that target's own set of storage requirements or at least ratios; so for example when targeting a 32-bit hardware machine, the program assumes that INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE = 1,2,4,8,4; when targeting a 64-bit hardware machine, it assumes 1,2,4,8,8; and when targeting a HLL, it assumes 1,1,1,1,1.

3) A third solution is to pad small values into 'slots', and express everything in terms of slots. For example, the VM spec specifies that immediate offsets are in units of max(INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE). So, for the offsets to be useful, the program must layout the relevant memory into chunks of the slot-size; when a smaller-size value must be stored there, it needs to be padded to the slot-size. This is really just another way to say solution (1), but with a focus on have larger slots and a lot of padding; e.g. if slots are defined so that everything fits into 1 slot, then that's like specifying INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE = 1,1,1,1,1. Or if it's specified that everything fits into 1 slot except for 64-bit integers, that's like specifying INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE = 1,1,1,2,1. The key here is that there's only a single base constant, and a bunch of ratios, so you don't need to express immediate offsets in terms of a linear combination.

4) The third solution, slots, can be done in a half-way kind of way two; have multiple constant bases (slot types) but not as many as there could be. For example, instead of specifying offsets as a linear combination of INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE , specify them as a linear combination of INT32_SIZE, PTR_SIZE. If we have 4 bits for the offset, now each of these gets 2 bits, so we can specify at least 4 separate offsets in any situation, and in some cases (where the offsets are jumping over both pointers and ints) we can specify up to 8. In some cases, even if you were okay with using non-portable code, this could produce higher code density, because normally a 4-bit quantity would allow you to index at most 16 bytes, and if e.g. PTR_SIZE=8, then that would only be 2 PTRs, but this scheme lets you reach at least 4 PTRs (and at least 4 INT32s, but only 1 INT8)

Note that with this solution there should be a sinfo defining max(PTR_SIZE, INT32_SIZE). Might be good to have that anyways.

5) Provide a way to define struct layout in the language, with lexical scope, and have offsets refer to items in a particular struct (perhaps all offsets encountered within some linear region of code refer to one struct layout)

6) Provide an architectural register that maps offsets to linear combinations of INT8_SIZE, INT16_SIZE, INT32_SIZE, INT64_SIZE, PTR_SIZE by essentially encoding a struct layout relative to offset 0. So, for example, if you want to compute 2*INT16_SIZE + INT32_SIZE + 3*PTR_SIZE, you might put "2*INT16_SIZE, INT32_SIZE, 3*PTR_SIZE" into that register and then set the offset to 6 (the idea is that you are telling the language the layout of items in memory and then telling it how many items in total to jump over, as if you defined a struct and were indexing into it). This is like (5) but with the struct definitions dynamically scoped rather than lexically scoped.

7) Give up on portability, but just for offsets

8) Like (4) but with INT8_SIZE (which is =1) and PTR_SIZE as the provided bases; if the runtime wants to use offsets to jump over INT32_SIZE, it must wastefully make use of the Boots guarantee that INT32_SIZE <= 4; but with 4 bit offsets, not even a single INT32 can be jumped over in this fashion. On platforms where INT32_SIZE is less than that (HLLs), this will waste space for INT32s by using more memory locations than needed (but, on such platforms, it probably matters less; and in places in the program where it does matter, just don't use offsets).

9) Like (4) but with: one bit for 1, one bit for 2, one bit for 4, one bit for PTR_SIZE. Using possibly wasteful encoding of INT16, INT32, this allows you to jump over: 7 INT8s, 3 INT16s, 1 INT32, 1 PTR, or any combination of these. This may be the best if a mixture of all 4 of these types is likely. Note that we can't hop over even a single INT64.

10) Of course even if you have a flat offset (4 bits specifies an offset of up to 16 memory locations), you can do what solution 9 does when you only are hopping over ints, but not PTRs.

11) Dont use the offsets. RESERVE them for lovm/lava's use (where they can be an index into fields of lexically defined structs).

Solution 4 seems like a good tradeoff between simplicity and efficiency, and between efficiency of instruction encoding (code density) and efficiency of memory storage representation. But it has problems:

This means that in solution 4, it's important to choose the special integer bitwidth right. All 3 obvious choices (int16, int32, int64) have their merits:

Choice int8 is equivalent to solution #8.

What do other systems do?

Seems like on the stack, the x86_64 linux calling convention pads everything to 8 bytes (so solution 3; 8,8,8,8,8). The JVM stores everything in 1 slot except for int64s (and float64s) which take 2 slots (so 1,1,1,2,1).

windows appears to pad to 8 bytes:

http://www.gamesetwatch.com/StackFrameAnatomy.png (from https://www.gamasutra.com/view/news/178446/Indepth_Windows_x64_ABI_Stack_frames.php)

Even the ARM calling convention pads everything on the stack to 4 bytes (so 4,4,4,4,4):

" If the argument is an integral Fundamental Data Type that is smaller than a word, then it is zero- or sign-extended to a full word and its size is set to 4 bytes. If the argument is a Half-precision Floating Point Type its size is set to 4 bytes as if it had been copied to the least significant bits of a 32-bit register and the remaining bits filled with unspecified values.

B.3.cp If the argument is a CPRC then any preparation rules for that co-processor register class are applied.

B.4 If the argument is a Composite Type whose size is not a multiple of 4 bytes, then its size is rounded up to the nearest multiple of 4. "

-- Procedure Call Standard for the Arm Architecture - ABI 2020Q2 documentation

. So everyone seems to be prioritizing uniformity over space efficiency, at least one the stack. Although their reasons may be different from ours; they want things to be aligned, for perf reasons (which perhaps should concern us too).

however, Apple, which is famously memory frugal, apparently permits even 1-byte arguments on the stack:

https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms says: "Function arguments may consume slots on the stack that are not multiples of 8 bytes. If the total number of bytes for stack-based arguments is not a multiple of 8 bytes, insert padding on the stack to maintain the 8-byte alignment requirements. ... The caller of a function is responsible for signing or zero-extending any argument with fewer than 32 bits. The standard ABI expects the callee to sign or zero-extend those arguments. ... The following example illustrates how Apple platforms specify stack-based arguments that are not multiples of 8 bytes. On entry to the function, s0 occupies one byte at the current stack pointer (sp), and s1 occupies one byte at sp+1. The compiler still adds padding after s1 to satisfy the stack’s 16-byte alignment requirements.

void two_stack_args(char w0, char w1, char w2, char w3, char w4, char w5, char w6, char w7, char s0, char s1) {} ...(in a section on variadic arguments:) The C language requires the promotion of arguments smaller than int before a call. Beyond that, the Apple platforms ABI doesn’t add unused bytes to the stack. ... "

is it important to care about memory efficiency on the stack? Yes, sorta:

https://stackoverflow.com/questions/23472055/is-it-ok-to-allocate-big-structures-on-stac "It depends on your software architecture and your target.

If you are writing kernel code you don't want to use a lot of stack space; Linux kernel stacks are small.

User programs often have a 1 Mb or larger stack, so unless you have a lot of recursive routines, allocating a couple of kB on the stack usually isn't a problem."

how small are Linux kernel stacks? 8k (there was an attempt to move to 4k but it failed: https://elinux.org/Kernel_Small_Stacks#ARM_4K_Stacks "simply not enough space for the deep 60+ function call stacks we see with even trivial storage stack configurations"):

https://stackoverflow.com/questions/6270945/linux-stack-sizes

x86's 'push' instruction basically only supports the word size (sorta): https://stackoverflow.com/questions/45127993/how-many-bytes-does-the-push-instruction-push-onto-the-stack-when-i-dont-specif

Leaning towards solution 4, with int32 (so int32 and ptrs; so 1,1,1,2,?, because you still have two bases). This makes me sad though, because it means that we can't make good use of memory in a massively parallel 16-bit machine (or, we could not use those special immediate offset addr modes, and sacrifice code density, which is also memory). But, practically speaking, that's many years away, so i should probably be practical about it and give up on 16-bit. I mean, almost nothing on the market now is 16-bit; ARM and RISC-V are both 32-bit. And a toolchain with 16-bit addressing can't easily deal with many of today's programs, which tend to have executables (and in-memory data structures during runtime) larger than 64k.

If we do what everyone else does, though, it would look more like either 1,1,1,1,1, or 1,1,1,2,1; note the single base, which makes things simpler, and squares the number of guaranteed immediate offset choices for a given bit size (so, if you have 4 bits total, you can have 16 choices guaranteed instead of 4 guaranteed). If pointers are 64-bits, though, then this means that int32s take up twice as much space as they need to. Otoh maybe that's good anyways for alignment reasons. This is also significantly simpler. And, with 1,1,1,2,1, some implementations could do like WASM and have 32-bit VM pointers even if the underlying hardware has 64-bit pointers.

so now i'm leaning towards stack slots (one base), with 1,1,1,2,1.

in some cases, when you expect an offset into a homogenous array but different arrays can have items of different sizes, you can have an offset and a scale factor. This is more efficient to encode than a linear combination of many things b/c the scale factor only appears once and only needs 2 or 3 bits.

now i'm leaning towards one of solutions #4 (2 kinds of slots with int32,ptr), #8 (2 kinds of slots with int8,ptr), or #3 (stack slots) with 1,1,1,2,1. Note that with #3:1,1,1,2,1 there should be a sinfo defining max(PTR_SIZE, INT32_SIZE).

  1. 8 is in a way more flexible because the program can just use offsets to talk about single locations if it wishes (with a reduced range of 4 instead of 16; otoh if it needs to jump over ptrs it can do so).

Note that even with #4, #8, #3, the program doesn't have to use offsets, so it doesn't have to use stack slots. Programs that wish to be more frugal with memory can do so, and just not use offsets.

I guess, pragmatically, #4 is better than #8 because realistically we will have INT32s more often than INT8s and INT16s (and when int8s and int16s are present, offsets don't have to be used). So the contest is between #3:1,1,1,2,1 and #4. In typical situations (both hardware and HLL), #3:1,1,1,2,1 in practice means that the slot size is the size of one PTR. So on most platforms the practical difference between #3:1,1,1,2,1 and #4 is that #3:1,1,1,2,1 can express ranges of 0 to 15 PTRs (and the program is free to use INT32s in PTR-sized slots), whereas #4 can express ranges of 0 to 3 PTRs, intermixed with 0 to 3 INT32s.

 (wrong:
  well actually, it's not correct to say that no one will be required to use stack slots if they don't want to if we do this, because the Boot calling convention mandates the use of the ROC and this would be used to index into the ROC. Is there a better way?
  Can/should we get rid of the ROC and just let advanced implementations cache it? Perhaps keep the semantic non-guarantees around seeing accesses to stack in memory / access to reads/writes when offsetting from SP, and keep the flushes but get rid of the requirement to use the special ROC load/store instructions?
  actually i don't think it's a problem; for the ROC you could provide multiple types of load/store instructions that index in different ways. As long as you have one of them that indexes using single memory locations, you're good. So it's incorrect that this would force anyone to use stack slots.
  )

leaning towards #4. This is all just an optimization anyways, the simplest compilers won't even use the offsets.

---

another consideration related to the previous is that we want a program to be able to choose how they represent stuff on the stack, and to be able to potentially use offsets and stack ops.

So let's consider supporting the following choices:

We can only choose one of these for offsets; can stack ops still support the others? Yes, because we can have multiple sets of stack op instructions, one for each combination of sizes (although this would waste a lot of opcodes).

On different systems, the sizes of int8, int16, int32, 2*int32, max(int32, ptr), max(int64, ptr), int64, ptr are all different quantities so you might really need many different stack ops. Although, i guess i only see 8 items there, so i guess we can have an immediate argument to the stack op that specifies two sizes.

however all this (including the prospect of having a zillion stack op variants; somehow JVM gets away with less? see https://stackoverflow.com/questions/57398474/is-there-a-way-to-swap-long-or-double-and-reference-values-on-jvm-stack ) makes me think that perhaps we should KISS and just pad everything to max(int64, ptr). If people don't want to do that, fine, they don't have to, but they won't be able to use the stack ops or the offsets.

or, we could just say that stack slots are always the size of PTR (the presumptive 'word size'). Reject 16-bit systems, guarantee that ptrs are at least 32-bits, and make 64-bit quantities take up 2 stack slots. Using this system we would go back to a simple displacement (16 choices of # of stack slots) and we would need 2 variants of stack ops (one for single stack slot, one for double). I guess this is actually just the same as #3:1,1,1,2,1, except i guess i'm realizing there is a distinction between guaranteeing that PTR is at least 32-bits, and saying stack slots are max(int32, PTR). Maybe the latter is better...

leaning towards #3:1,1,1,2,1 (stack slots of size max(INT32_SIZE, PTR_SIZE))

---

tangentially relevant to the previous but very good blog post (gives methods of dealing with AArch64's requirement that SP be 16-byte aligned when implementing VMs that expect to be able to push or pop single registers): https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop

---

yknow, stack slots just aren't good enough for displacement, b/c displacement should be able to be used to index into structs. But allowing only structs with one fixed slot size is kind of crippling. So, in displacements, i think we really need at least INT32_SIZE and PTR_SIZE (solution #4). So should we just do that in the stack also? Probably

leaning towards #4

---

this (#4) seems overly complicated but the alternatives seem to be fixed-size slots everywhere, or not using offsets, or even more complicated stuff like lexically-scope struct definitions, or giving up portability to some extent.

can't we just give up portability a little? NO, because then Boot is really only an intermediate language, not suitable as a source language (you want a human to manually rewrite and separately maintain the entire Boot program for each port?) -- and for bootstrapping, we need a source language.

any individual program could give up displacement offsets, however. And it could give up on ROC addressing modes, and just use the ROC instructions directly (a single program can't totally give up on ROC unless it doesn't need to be compliant with the Boot calling convention, but that's okay, because it can use the ROC instructions directly instead of using stack cache and frame cache addressing mode).

so:

this suggests that something similar to #4 is pretty good. There are a few choices. Consider a 4-bit offset field:

So even supporting any 16-bit offsets at all comes at a significant price to the number of popular 32-bit offsets you can support.

For example, the first choice on the above list can support at least one offset of any size up to 32-bit, but can only support 1 offset of the most popular sizes, and can't support any 64-bit offset. The last choice maximizes the most popular sizes (3 instead of 1), and can support one 64-bit offset, but can't support smaller sizes at all.

It's clear that with any of these choices, few programs will be able to use only these offsets -- most structs will require larger offsets. This is in contrast to the fixed slot-size world, in which case 4 bits gives you 16 slots, which is more than enough for most uses.

If you had an 8-bit offset field, it would be easier to support a variety of field sizes while still supporting multiple offsets in every size:

Here, we can support every size and still get 7 of the most popular sizes.

What about the 6 bits that we'll probably actually have?

This could probably address into a sizable minority, or perhaps majority, of structs.

Given that it is becoming clear that in-address mode offsets and displacements will not be able to handle a lot of common use cases, perhaps we should optimize for the common case (32-bit ints and pointers) rather than for comprehensiveness. Also, i think C pads structures at the end so that the total size of the struct is aligned with its largest member -- since this will often be at least 32-bit alignment, this is even more reason for wanting more 32-bit range.

So this is a long-winded analysis that just concludes what we had already started with; INT32_SIZE is the best integer size basis, and it should have half the bits.

So, for 4-bit offsets:

and for 6-bit displacement:

And we resign ourselves to the fact that this within-instruction, addressing mode offsets/displacments can only be used in some cases, as an optimization (which can be added in peephole analysis), rather than as the main facility for indexing into structs.

This does have a conceptual coherency to it, though. The main thing that you can't get around, for portability, is that we don't know the pointer size. So, we must have at least two bases/fields, one for pointer sizes and one for integer sizes. And our 'default' integer size is 32-bit, because it's the first one large enough to not make the programmer think about it in most cases.

---

should we guarantee that a PTR_SIZE can fit an INT32_SIZE (In Boot, not in Boots)? I think so.

---

in general, i think we should move towards tailoring this stuff, efficiency-wise, for hardware platform sizes (e.g. INT32_SIZE is 4, not 1). The HLL sizes have to exist for the sake of interop with native data structures. But if a Boot program can better utilize offsets and displacement instructions by assuming in its own internal data structures that INT32s take up 4 memory locations, maybe do that. So, for example, if we only provide INT32_SIZE as a basis within stack offsets and displacements, then programs must multiply that by two to guarantee enough space for an int64; this is wasteful on HLLs, where INT32_SIZE = INT64_SIZE = 1, but that's okay b/c it makes sense on hardware platforms, which is where you care the most about efficiency anyhow.

---

while working on the Boot addressing modes:

or, at least, maybe most of this fancy stuff for code density (like addressing modes) should be split off and go in a Boot extension. People writing tooling targeting Boot want to know about instructions like 64-bit int and float arithmetic, various comparison and conversion ops. It might not want to know about that stuff (which will probably just be added transparently by peephole optimizers in most cases).

otoh each of these little complexities seems to be orthogonal to the others, and people targetting Boot can just ignore them if they want.

---

Consider label size distinct from pointer size

later: no, i think we already have too many size types, as you recently saw that makes it hard to encode offsets into small numbers of bits

---

Does WASM have a limit on function call number of arguments?

not really, it's just a stack