proj-oot-ootAssemblyNotes29

ok i'm trying to understand which calling convention we should have, and tail calls, etc.

i've read two things:

ok, on the callee cleanup:

perhaps the 'share the same stack frame layout' is at fault. Imagine that C takes more arguments than B. So eg to cleanup B's stack frame, maybe you have to bump up the stack pointer by 10, but to clean up C's stack frame, you have to bump it up by 20. Now if control is returned directly from C to A, and if A is supposed to do cleanup, all A knows is that it called B, so it'll bump up the stack pointer by 10, which is wrong, because the size 20 stack frame is in effect because of the call to C.

ok so this makes the vararg thing almost but not quite make sense, i think. if A decides to call B with a size 30 stack this time, and then B calls C, and we have callee cleanup, then C cleans up its own stack and jumps back to A, but no one cleans up B's stack, because B was supposed to clean that up but B got skipped.

i think we may need a bigger example to demonstrate. A->B->C->D. All in tail position, so we want to return directly to A. B->C is a vararg call. So the two ppl in a position to know about the size of the vararg stack frame are B and C, but both of them got skipped on the return.

this still doesn't quite make sense though, because what is preventing C from cleaning up the vararg stack frame before the call to D?

https://stackoverflow.com/questions/35626835/make-a-variable-argument-function-callee-cleanup maybe sheds some light here. it seems like the convention is to put the return address last (at the top of the stack/bottom of memory), below the arguments. So maybe you don't want to bump up the stack pointer before returning because then the return address is in the de-allocated-ish area below the stack pointer. And i guess the compiler can't be bothered to just manually move the return address up, overwriting earlier arguments. There is apparently an instruction in x86, 'RET #x', with #x immediate, that reads the return address, moves the stack pointer up by x, and then jumps to the return address which was read. But if it's a vararg then x is only known dynamically, not staticly, and it's an immediate, so it doesn't work.

so that last bit about the vararg sounds like it's something specific to limitations in the x86 convention, conveniences in the instruction set, and the way that compilers (and probably branch predictors) insist on using that.

more about vararg stuff [2]

so it sounds like callee cleanup is the way to go.

other notes i found while reading about this:

The fewer times you modify the value of esp explicitly, the better. There is more going on in these instructions than meets the eye. " -- https://cboard.cprogramming.com/tech-board/117965-callee-cleanup-variable-number-parameters.html - that's nice but it sounds fancy -- and if we're doing fancy stuff i'm betting tail call elimination is more important

strangely https://www.agner.org/optimize/calling_conventions.pdf doesn't have much to say about this.

https://github.com/microsoft/ChakraCore/issues/796 has a good point about another way to do varargs with TCO: with caller-cleanup, if A->B->C->D and B->C is vararg and D skips over B right back to A when returning, it's true that A doesn't know how much stack B allocated to hold the varargs, but A can still make things right just by remembering what its stack pointer was before the call (that is, treating SP as caller-save instead of callee-save) -- of course since it's the stack pointer itself that is being lost, it has to save it in some other callee-save register, rather than on the stack. They don't like doing this because it uses up another callee-save register on each such function call (so one less register, and also all those saves and restores). They seem to think a better alternative would be callee-save, but they have a lot of vararg code that would need to be rewritten or something. Sounds like they decided to do nothing in the end (not support TCO).

Q: "Everywhere it says that the called function doesn't know how many parameters have been passed, but in that case why isn't it possible to simply put that data into a register or push it onto the top of the stack for the called function to use?"

A: " There is no fundamental reason why the callee couldn't clean up space for variables. In most architectures, the standard calling convention doesn't handle variables this way, but that doesn't mean it's not possible to do so. For variable-length argument lists, you could either pass data about the number of arguments as a hidden parameter (like how this is handled in many object-oriented languages), or put a pointer on the stack that shows where the arguments end, etc. " -- https://stackoverflow.com/questions/19822045/what-prevents-the-callee-from-cleaning-up-the-stack

"The callee can clean variable arguments from the stack. I made it once, actually. But the code is pretty big.

Anyway, the main reason why in cdecl convention, the caller cleans the stack is other ...On some architectures (usually very small, as the old 8080 or 6800), there is no ret n instruction that to automate the stack cleaning and as a rule they can't make arithmetics with the stack pointer as well.

So, the callee have to first pop the return address from the stack in order to reach the arguments, then to pop all arguments and then push back the return address....When cdecl convention is used 2 instructions and one register use are spared:...And because for single language it is better to use single calling convention on all platforms, the CCALL as more simple and universal looks better. (C language is created in times when 6800 was a high tech).

But notice, that on these platforms, the assembly programs and the native languages (for example different kind of BASICs) usually use register argument passing which is of course much more faster on such small systems. " -- https://stackoverflow.com/questions/19822045/what-prevents-the-callee-from-cleaning-up-the-stack

some languages put (RAIIish?) destructors in their (caller?) cleanup routines, preventing TCO:

" Many Perl idioms rely on strict stack discipline. For example, implementing cleanup code as the destructor of an object held in a lexical variable in a particular stack frame. The result is that implicit tail-call optimisation would break all sorts of things. Hence tail-call semantics are only safe to invoke *explicitly*, which is what Sub::Call::Tail provides. " [7]

---

it sounds to me like callee cleanup is the way to go. TCO may be important, varargs probably not (can always store those in memory).

---

next question: store the first few regs in registers in Boot calling convention?

pro: faster, if regs are really regs

pro: this allows Boot syscalls with few arguments to avoid hitting 'main memory' (until we're in the actual implementation of the call)

con: in LOVM, seems cleaner to put args in smallstacks. That way they can be shifted similarly to 'input argument' and 'output argument' register window sections. if we have the first few args in registers and the rest in smallstacks, that's confusing

my leaning: at first i was leaning to not use regs at all (just stack) for cleaner transition to lovm, but it galls me to have to hit memory just to do a syscall. So now i'm leaning towards using regs. I kinda think there might be a way to may the register window thing work cleanly anyway with the regs.. after all if you need to preserve the input args while you are writing the output args you'd just move them onto the stack right? So mb it's not that much worse. So i kinda think the thing to do is try using regs and reconsider if necessary after implementing x86_64 assembly version of Boot and LOVM (in the far far distant future).

decision: for now, use regs

---

next question: how many regs?

so, should we pass 2 or 3 args in registers?

if 3, then no space for a temp, and maybe we should add instructions: push-int-to-SP-and-decrement-SP push-ptr-to-SP-and-decrement-SP

or at least add back in:

because currently, without a free temp register, in Boot, you can't even push to stack (you can't predecrement the SP without loading an immediate into a register)

otoh mb most of the syscalls we want to make can use only 2 (remember we have 2 banks, so 4 spots in total, so as long as both integers and ptrs are being passed, you get more than 2 arguments), so maybe it's okay to just pass 2 args in registers and leave a scratch register.

also, it sucks to add appm and ap32m in Boot, b/c (a) you may want to add two versions of them b/c we don't have signed immediates in oot, and (b) in LOVM they are a duplication because we can have a general instruction like this with both an immediate and register addr mode for op2.

regarding (a), well i guess an upward-growing stack is POSTincrement, which isn't a problem (copy the argument to the stack, then overwrite that register with the immediate, then add that immediate to the stack (postincrement); now you have one free register); it's the predecrement convention in the downward-growing stack that's a problem. So you don't really need to add one version for each direction, even if the immediate imm7 is unsigned. So it's only 2 more instructions, and hey maybe they will be useful anyways b/c they will have a 7-bit unsigned immediate range rather than a 6-bit signed immediate range (so 5-bit unsigned).

so far i've looked a few syscalls tho and none really need more than 2 args of one type for the most basic functionality:

open read malloc

i'm leaning towards just using 2 regs for arguments:

decision: 2 regs for now

---

next question (which arose while researching the previous one):

what to do when, depending on platform, APIs might return an integer or a pointer?

For example, on Linux, the open() syscall returns an integer file descriptor. On Windows, the CreateFileA? syscall returns a ('file'?) handle, and the (older?) winbase.h OpenFile? also returns a file handle.

https://stackoverflow.com/questions/902967/what-is-a-windows-handle

https://www.codeproject.com/Articles/1044/A-Handy-Guide-To-Handling-Handles

what does windows API mean by 'handle'? I like Dan Moulding's stackoverflow answer:

" A HANDLE is a context-specific unique identifier. By context-specific, I mean that a handle obtained from one context cannot necessarily be used in any other aribtrary context that also works on HANDLEs.

For example, GetModuleHandle? returns a unique identifier to a currently loaded module. The returned handle can be used in other functions that accept module handles. It cannot be given to functions that require other types of handles. For example, you couldn't give a handle returned from GetModuleHandle? to HeapDestroy? and expect it to do something sensible.

The HANDLE itself is just an integral type. Usually, but not necessarily, it is a pointer to some underlying type or memory location. For example, the HANDLE returned by GetModuleHandle? is actually a pointer to the base virtual memory address of the module. But there is no rule stating that handles must be pointers. A handle could also just be a simple integer (which could possibly be used by some Win32 API as an index into an array). "

This resurfaces the motivation for the ANY type registers that i had in there awhile ago but removed. Here we have a something that:

interestingly, we already have something like this: codeptrs. If you think about it, in an interpreter a codeptr could be represented as an integer index into a list of instructions, rather than as an actual pointer into memory (with 'instructions' conceptually residing in a special 'code memory' which is not part of or accessible from data memory).

so should we (re)create a third register bank for these guys (what i used to call ANY type)? it might be sort of nice for the implementation to distinguish between these guys and dereferencable pointers: for garbage collection purposes. These guys could be 'boxed' with additional bits that say whether they are really integers or really pointers.

Disadvantages:

Advantages:

out of these, i think the most important considerations are:

so the proposal is:

i'm leaning towards this.

---

we may want to introduce alignment restrictions to prevent the problem noted above about another thread seeing an invalid pointer in the middle of write? otoh if each thread has its own GC then the GC and the implementation can have a mutex when the implementation main thread is writing to memory. if ptrs are opaque, how to align? mb would need an instruction for that

done

the stack also has align restrictions on some targets (eg 16-bit on x86 at least on x64 windows)

i think that's an ABI thing though, so i guess we can ignore it in Boot (where we have Boot's own convention, rather than trying to interop with the underlying platform convention). And in LOVM we can abstract it.

---

what do garbage collectors do about pointers that may be invalid (e.g. a C pointer that points past the end of an array)?

https://www.hboehm.info/gc/04tutorial.pdf says

" Pointer validity check

elsewhere in those slides it's clear that memory allocation is done thru the GC, so the GC is aware of where the valid memory is

"Unlike [BoehmChase?92] we start with the assumptionthat the garbage collector recognizes all pointers to theinterior of an object, not just to the first byte of the object.Recent experience suggests that this is the right framework,particularly for typical C++ implementations whichimplicitly generate pointers to the interior of an object. Thetechniques of [Boehm93] can greatly reduce the danger ofspace leakage that we previously associated with thisapproach. This new assumption greatly simplifies our task.Unfortunately, it also usually invalidates assumption (A) of[BoehmChase?92], so our correctness argument has to bedifferent

...

We assume that the garbage collector recognizes anyaddress corresponding to some place inside a heap allocatedobject as a valid pointer. ([Boehm95] satisfies theseassumptions in its default configuration"

[8] "Simple Garbage-Collector-SafetyHans?-J. Boehm"

so it looks like it actually checks potential ptrs to see if they are in valid memory before it derefs them. oh man what a pain (but oh man so amazing)

---

"OurMagpietool? converts C source programs to cooperatewith precise GC ... Some constraints on the C program depend on the specific GCimplementation that is used with Magpie-converted code. The GCmay supportinterior pointers, i.e., pointers into the middle of a GC-allocated object, including to the end of the object. In practice, wefind that interior pointers are common in real C programs, whileprograms specifically adjusted to work with precise GC (such asthe PLT Scheme runtime system) can avoid them to help improveGC performance. Note that a pointer just past the end of an arrayis legal according to the C standard, and Magpie considers suchpointers to be valid interior pointers " [9]

---

also, mb i should allow <= comparison on pointer values. I didn't use to allow it because it may not make sense. But there are times (e.g. sort a list of pointers so that you can do binary search to determine the presence of a pointer in that list) when you really want to compare pointers, and you don't care if the comparison has any other meaning aside from being a total ordering.

what do other languages with opaque pointers/no pointer arithmetic do about pointer comparison? e.g. what does Golang do?

golang allows equality comparison only, but not greater-than, less-than: https://stackoverflow.com/questions/45818964/how-can-i-compare-pointers-in-go

---

later i decided to remove register/smallstack banking again -- see discussion in [[ootBootNotes1?]].

---

so what about unified register files with floating point as ordinary gprs?

https://stackoverflow.com/questions/51471978/is-there-any-architecture-that-uses-the-same-register-space-for-scalar-integer-a

sounds contentious but seems like today's best practice is to bank them, eg separate floating point

also most mechanical sympathy on popular archs is to bank fp

so bank the fp regs, but not the smallstacks (b/c fps on the stack are in the same linear memory substacks as everything else). btw https://stackoverflow.com/a/52175510/171761 suggests having more caller-save registers in the fp bank, and less callee-save regs, unlike his suggestion for integer

--- old

maybe we can get away with having the primary stack be above the stack pointer in the output stack be below the stack pointer, and pushing to the output stack all at once in a calling sequence. Alternately, since the primary stock can output stock capacities are fixed, If you save those capacities in a register then you can offset the beginning of the output stack from The frame pointer.

yeah I think we need to register to hold The frame allocation information in. At least this is cheaper than a fixed sized register windows where you always have to push a Certain window size onto the Stack. Again consider making those allocation quantities powers of two -- Then maybe we really can fit in All the variables we may need. I think we do need an extra register for holding the primary stack position. Maybe reserve a register for the PC. I'm still thinking if there's a way for the boot calling convention to put arguments in registers. I guess one way is just to use the out stack addr mode as registers! That would imply that the recipient not the caller is the one who shifts out to Local. Also maybe out should be registers not stack even in lovm. Consider however that a Nice thing About Forth was that function calling was cheap (eg no callee-saving etc to be done).

Maybe you can simplify all of this by having one stack and specify how many output args when you call th next function

mb don't even let the program Read the stack pointer that when the stock is not flushed that way we don't have to have an extra register for the tip of the cached stack

The thing is if it's one stack and that stack is interrupted by the return address and the frame pointer And any other callee saves, local variables, or other junk we have in the stack frame. So that makes it Inefficient to translate a stack position to an address. If it's two stacks then it's inefficient to push more to the argument input Stack, Because directly after that is the stuff in our stack frame. This suggests that the argument input stack Should be mapped to registers, And the only snack should be the out Stack

So Callee clean up means that the input arguments are cleaned up too right?? so our output arguments passed where the input arguments were on the stack.?

I think what you're looking for is that out could really be registers. We could rely on the callee to flush out to the stack if needed.

But in that case, how do we use the stack pointer in boot?

I guess maybe if the program never touches the stack pointer then this stuff is abstract

---

i can't think of a good way for the LOVM calling convention to match up with a Boot all-register calling convention without actually moving the smallstack addr modes from LOVM to Boot. In which case we could do this:

caching non-guarantees:

some benefits to this:

done

---

so is there a way to do the previous without actually importing stack cache addr mode semantics into Boot?

The issue is probably that we want to allow the implementation to choose to store the smallstacks in memory, or to store them in regs (or in a memory cache). But if they are in memory then that's visible to Boot. But they may or may not be in memory. This indeterminacy is visible and so must be mentioned in the spec.

Can we just tell Boot to put things on the stack in memory? If we want interoperability (programs mixing Boot and LOVM functions), that's hard for the implementation to detect and map into smallstacks. Imagine if a Boot function called a LOVM leaf function. The LOVM leaf function would not feel a need to flush anything (and we don't want to make it), because it wants to just use the output stack as its arguments, since it doesn't need fresh output stack for its own outgoing arguments. It can just not adjust the frame pointer, and offset its own locals off the stack pointer. But the Boot program put the input arguments onto the stack in memory.

Is there any point keeping LOVM separate from BootX? if we are putting all these complicated semantics into Boot?

There may be another way. It may be hard to detect a write to the stack pointer in general, but if we have a PUSH instruction, or even better a PUSHMEMSTACK instruction, and a POPMEMSTACK instruction, that is easier to detect. The implementation could intercept that instruction and have it write to the output smallstack cache instead. We could have flushStackToMemory and flushMemoryToStacks instructions and specify that until the are called, as above there are caching non-guarantees. The benefit of these gymnastics is that we maintain our load-store architecture and our address-mode-less architecture in Boot, rather than having these memory-associated locations be targettable by any operand.

why just push and pop? perhaps push, pop, read, and write. In fact, maybe just read and write -- then we can add the requirement that you flush before moving the stack pointer, which may make things even simpler for the implementation, b/c it doesn't have to keep track of stack ptr moves. Then we can drop this requirement for LOVM.

seems like this may make it harder to abstract stack frames later though? but maybe we are saving that for OVM.

i feel like there is a more abstract way to do this. But i can't think of anything that wouldn't be too abstract for Boot.

done (register-associated cache) ---

LOVM is looking more and more similar to Boot. Maybe just merge LOVM and BootX? (could use the name LOVM for something further up the stack later, as previously planned)

---

so i was thinking about calling conventions. I was hoping there was a way to do all of:

i don't think it's possible. In fact, i don't think even the first three are possible simultaneously:

now, the main issue of them having different calling conventions is: what about the syscalls? Boot is supposed to be a subset of LOVM, so the syscalls should work the same in both cases.

which is also the reason we wanted to pass arguments in registers

so, we could just say that Boot passes stuff on the stack, but that means you have to have available stack memory to do a syscall, and also that you have to take the performance hit of hitting memory to do a syscall.

---

how does linux do syscalls on x86?

there are three types: legacy syscalls with the 'int' instruction, 'fast' 32-bit syscalls with the SYSENTER instruction, and 'fast' 32-bit system calls with the SYSCALL instruction.

With legacy syscalls, the syscall number and up to 6 args are passed in registers [10]

with fast 32-bit system calls, the syscall number and up to 5 args are passed in registers [11]

with fast 32-bit system calls, the syscall number and up to 6 args are passed in registers [12]

how does windows nt do syscalls on x86? presumably it uses the same instructions; but user code calls the windows API using ordinary library calling conventions. The Win32 API uses the stdcall calling convention: https://en.wikipedia.org/wiki/X86_calling_conventions#stdcall which passes arguments on the stack, not registers.

not sure if there is a 'win64' API or if it has a different calling convention. https://wiki.freepascal.org/Win64/AMD64_API#Calling_convention suggests that it does, and that this one passes the first 4 params in regs. https://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_calling_conventions says the same. Although the win64 calling convention passes the first 4 args in registers, it also requires that (32 bytes?) of 'shadow space' be allocated for them on the stack.

so my conclusion is, yes, there does seem to be a need to pass syscall args in registers

---

so i think we're back to what i was planning above:

my idea about a loadwithcache, storewithcache for every register, not just SP -- altho in Boot, to save operands (so that we could go back to a 16-bit Boot if we wanted), only require the implementation to accept these with register SP (so that they can be elided to their two-operand forms).

done (register-associated cache)

---

just some random thoughts on bitwise data manipulation after a dream i had

this first paragraph is just me writing down whatever nonsense was in my head right after waking up: Predication with two lanes (so each bit can specify a lane). Forwards and backwards in time? Mb not. Each instruction has an and and an or. Src/dst? Arc?. Not sure if this is right but 2 lanes * (src and or dst) where each of src dst and or is 4 bits would be 32 bits. Quantum computing with all qubits restricted to the standard basis?? Quantum computing with just one qubit??

Note that a qubit has two degrees of freedom. I think 16 bits can specify one 2-qubit quantum gate (64 bits would be needed to specify one three qubit quantum gate I think). I think four bits can specify one 1-qubit quantum gate (so can't achieve quantum universality this way). Note that in actual quantum computing The elements of the gate matrix are real numbers not binary, so to reiterate, we are only talking about a subset of this.

In addition the numbers in the matrix and quantum computing can be complex. And in addition the matrix must be unitary. So my accounts above seem not quite correct. What if we restrict to 1 0 i -i, and unitary? How many bits does it take to specify a one wire quantum gate?

Mb for each bit input (wire input) you can have 4 choices: f t x -x.

A classical gate on bits with two inputs and two outputs has four input conditions and for each of those specifies one of four possible output conditions. So. 4^4 possibile gates (and the table consists of four pairs of two output bits each so eight bits).

So to specify a gate with two bits of input and one bit of output is four bits. So if we have two predicate bits then we can specify any combined predication with four bits, and we can specify any gate on the bits themselves with 8 bits.

Src and dst could be registers for a conditional move, or they could be a choice of predication bits out of a larger set of predication bits

Okay but moving on from the dream and just thinking rationally, If you have two predication bits p0 p1 what would be some good things to do with that. you probably don't need to predicate each instruction with the full 4-bit comprehensive conditional over those two bits. Instead you might have two bits for each instruction to specify one of: if p0, if p1, if p0 or p1, if p0 and p1. then you might occasionally have an 8-bit predication gate on the predication bits? Not quite right yet probably

I guess if you have 16 predication bits then in 16 bits you can say: specify bit number one, specified bit number two, specify gate. 4 bits 4 bits 8 bits for a total of 16 bits.

Another thing you can do with 16-bit instructions and 16-1 bit registers is specify each of two source operands and one destination operand (12 bits total) and then specify one operation in 4 bits.

And remember that the original idea was two lanes was that each instruction has two stop bits which determine whether it can be executed in parallel with the previous instruction in that lane (an instruction with both stop bits set Waits on both lanes, so it synchronizes)

Other fun things that you can do with a single register of many bits is permute the bits, or permute-with-replacement. either of those takes many bits to specify the permutation though; for a 16-bit register, permute with replacement means making 16 choices 16 times, so 4*16 = 64 bits. Since we're certainly not going to have instructions greater than 64 bits at this level, that suggests that a 16-bit bit register is too many bits. What about a 4-bit bit register? Permute with replacement is now four choices four times, which is 8 bits. And the two-wire gate with source and destination operands out of four choices requires 8 bits for the gate and 4 bits for the operands for a total of 12 bits. A three-wire gate with two inputs and one output, and four bit registers, requires four bits for the gate and six bits for the operands for a total of 10 bits. so now we're getting to The sorts of instructions that we can comfortably fit in 16 bits.

Regarding the stop bits and the two lanes, If you apply this to all instructions without other constraints, you were providing redundant information because all you were doing is specifying dependencies between instructions, and the data dependencies already specify The necessary dependencies. but the time this would really be helpful would be in memory operations, because with memory operations you might want to specify ordering dependencies between operations on different memory locations, and that is not implied by the instructions themselves.

And now let's look at if we only have two bit registers. now we don't need permutation because the 2-bit (2-wire) gate operations contain this, nor do we need three operand instructions because again the 2-wire gate operations do everything. So in fact all we have to consider is the 8 bits to specify a two-wire gate. Furthermore to specify any function on these two bits with a one bit output takes four bits (one bit for each possible input condition). For predication however it really seems like there's some operations which are more useful than others. If bit 0, if bit 1, if bit 0 and bit 1, if bit zero or bit one, if true, if not bit 0, if not bit 1, ??. So three bit predication would be quite useful here, Even though it would take four bits to be complete. And of course two bit predication would be pretty good too: always, If bit 0, if bit 1, if bit 0 and bit 1. Or how about: always, If bit 0, if not bit 0, If bit 0 and bit 1. Yes, if you only have two predication-specification bits per instruction, the last one is more useful, because if you have an 'if' conditional that you are trying to express without branching, being able to do something exactly when bit 0 is 0 and also something else exactly when bit 0 is 1 is useful.

Source registers and destination registers with complete gate specification waste bits because many gates either ignore one or both source operands, or are the same as other gates except the order of the source operands are switched. It is less redundant to have OP codes and operands with multiple instruction encodings to support OP codes that don't need both source operands

Gates makes sense however if the operands are fixed like wires.

---

to summarize the most useful ideas from the above:

---

just to review my current ideas for the stack:

---

i think we should reintroduce an LOVM layer. It will be more like WASM in that there are a lot of 'locals', rather than just 32 registers. Then:

---

How does WASM handle limits on the number of function arguments?

There are no limits; WASM 'call' instruction consumes arguments from the WASM stack, which is afacit unbounded.

---

so what should the encoding format of LOVM be? Completely variable-length bytecode, or 64-bit? Leaning towards 64-bit.

i have 3 ideas for 64-bit:

nice things about 12 operand bits:

nice things about 10 operand bits:

otoh LOVM's goal is to be on the level of WASM/LLVM so maybe it should just be variable-length to do away with the remaining unusual 'limits' on #locals, #fn args, etc.

should LOVM (or its language, Lo) have wren-like efficient classes?

---

ideas about how 8 bits could be used to represent types for polymorphism:

but really we probably can limit polymorphism to having the destination operand type being determined once the two source operand types are determined. So now we can do:

4 bits is enough to encode 16 types which is just enough to fit most languages' unboxed primitive types.

We could also just have one 'type' with 256 choices. Most of the type there needs to only be one type to polymorph on; e.g. ADD uint32 uint32 -> uint32 vs ADD int16 int16 -> int16; e.g. OOP 'receiver's. This lets us leave most of the types undefined as 'custom' in the spec and let the program interpret what they mean.

---

in Go, pointers can only be compared even for equality if they are the same type (or the null ptr):

https://go101.org/article/pointer.html

---

it's a good idea to recommend that the sp and fp r.o.c.s be the same size as each other, even though in leaf functions you don't need to use fp at all, because otherwise if the fp cache was smaller than the sp cache, then in non-leaf functions when you do 'cp fp sp' in the prolog, you'd have to spill the excess items from the sp cache to memory, negating the purpose of the ro.o.c.s (to give similar performance as storing args in registers).

---

many of the odd things about Boot/BootX? directly follow from the desire to make Boot/BootX? programs portable over different target platforms, while still allowing for the possibility of performant compiling implementations without a large runtime/memory footprint: