proj-oot-ootAssemblyNotes29

ok i'm trying to understand which calling convention we should have, and tail calls, etc.

i've read two things:

ok, on the callee cleanup:

perhaps the 'share the same stack frame layout' is at fault. Imagine that C takes more arguments than B. So eg to cleanup B's stack frame, maybe you have to bump up the stack pointer by 10, but to clean up C's stack frame, you have to bump it up by 20. Now if control is returned directly from C to A, and if A is supposed to do cleanup, all A knows is that it called B, so it'll bump up the stack pointer by 10, which is wrong, because the size 20 stack frame is in effect because of the call to C.

ok so this makes the vararg thing almost but not quite make sense, i think. if A decides to call B with a size 30 stack this time, and then B calls C, and we have callee cleanup, then C cleans up its own stack and jumps back to A, but no one cleans up B's stack, because B was supposed to clean that up but B got skipped.

i think we may need a bigger example to demonstrate. A->B->C->D. All in tail position, so we want to return directly to A. B->C is a vararg call. So the two ppl in a position to know about the size of the vararg stack frame are B and C, but both of them got skipped on the return.

this still doesn't quite make sense though, because what is preventing C from cleaning up the vararg stack frame before the call to D?

https://stackoverflow.com/questions/35626835/make-a-variable-argument-function-callee-cleanup maybe sheds some light here. it seems like the convention is to put the return address last (at the top of the stack/bottom of memory), below the arguments. So maybe you don't want to bump up the stack pointer before returning because then the return address is in the de-allocated-ish area below the stack pointer. And i guess the compiler can't be bothered to just manually move the return address up, overwriting earlier arguments. There is apparently an instruction in x86, 'RET #x', with #x immediate, that reads the return address, moves the stack pointer up by x, and then jumps to the return address which was read. But if it's a vararg then x is only known dynamically, not staticly, and it's an immediate, so it doesn't work.

so that last bit about the vararg sounds like it's something specific to limitations in the x86 convention, conveniences in the instruction set, and the way that compilers (and probably branch predictors) insist on using that.

more about vararg stuff [2]

so it sounds like callee cleanup is the way to go.

other notes i found while reading about this:

The fewer times you modify the value of esp explicitly, the better. There is more going on in these instructions than meets the eye. " -- https://cboard.cprogramming.com/tech-board/117965-callee-cleanup-variable-number-parameters.html - that's nice but it sounds fancy -- and if we're doing fancy stuff i'm betting tail call elimination is more important

strangely https://www.agner.org/optimize/calling_conventions.pdf doesn't have much to say about this.

https://github.com/microsoft/ChakraCore/issues/796 has a good point about another way to do varargs with TCO: with caller-cleanup, if A->B->C->D and B->C is vararg and D skips over B right back to A when returning, it's true that A doesn't know how much stack B allocated to hold the varargs, but A can still make things right just by remembering what its stack pointer was before the call (that is, treating SP as caller-save instead of callee-save) -- of course since it's the stack pointer itself that is being lost, it has to save it in some other callee-save register, rather than on the stack. They don't like doing this because it uses up another callee-save register on each such function call (so one less register, and also all those saves and restores). They seem to think a better alternative would be callee-save, but they have a lot of vararg code that would need to be rewritten or something. Sounds like they decided to do nothing in the end (not support TCO).

Q: "Everywhere it says that the called function doesn't know how many parameters have been passed, but in that case why isn't it possible to simply put that data into a register or push it onto the top of the stack for the called function to use?"

A: " There is no fundamental reason why the callee couldn't clean up space for variables. In most architectures, the standard calling convention doesn't handle variables this way, but that doesn't mean it's not possible to do so. For variable-length argument lists, you could either pass data about the number of arguments as a hidden parameter (like how this is handled in many object-oriented languages), or put a pointer on the stack that shows where the arguments end, etc. " -- https://stackoverflow.com/questions/19822045/what-prevents-the-callee-from-cleaning-up-the-stack

"The callee can clean variable arguments from the stack. I made it once, actually. But the code is pretty big.

Anyway, the main reason why in cdecl convention, the caller cleans the stack is other ...On some architectures (usually very small, as the old 8080 or 6800), there is no ret n instruction that to automate the stack cleaning and as a rule they can't make arithmetics with the stack pointer as well.

So, the callee have to first pop the return address from the stack in order to reach the arguments, then to pop all arguments and then push back the return address....When cdecl convention is used 2 instructions and one register use are spared:...And because for single language it is better to use single calling convention on all platforms, the CCALL as more simple and universal looks better. (C language is created in times when 6800 was a high tech).

But notice, that on these platforms, the assembly programs and the native languages (for example different kind of BASICs) usually use register argument passing which is of course much more faster on such small systems. " -- https://stackoverflow.com/questions/19822045/what-prevents-the-callee-from-cleaning-up-the-stack

some languages put (RAIIish?) destructors in their (caller?) cleanup routines, preventing TCO:

" Many Perl idioms rely on strict stack discipline. For example, implementing cleanup code as the destructor of an object held in a lexical variable in a particular stack frame. The result is that implicit tail-call optimisation would break all sorts of things. Hence tail-call semantics are only safe to invoke *explicitly*, which is what Sub::Call::Tail provides. " [7]

---

it sounds to me like callee cleanup is the way to go. TCO may be important, varargs probably not (can always store those in memory).

---

next question: store the first few regs in registers in Boot calling convention?

pro: faster, if regs are really regs

pro: this allows Boot syscalls with few arguments to avoid hitting 'main memory' (until we're in the actual implementation of the call)

con: in LOVM, seems cleaner to put args in smallstacks. That way they can be shifted similarly to 'input argument' and 'output argument' register window sections. if we have the first few args in registers and the rest in smallstacks, that's confusing

my leaning: at first i was leaning to not use regs at all (just stack) for cleaner transition to lovm, but it galls me to have to hit memory just to do a syscall. So now i'm leaning towards using regs. I kinda think there might be a way to may the register window thing work cleanly anyway with the regs.. after all if you need to preserve the input args while you are writing the output args you'd just move them onto the stack right? So mb it's not that much worse. So i kinda think the thing to do is try using regs and reconsider if necessary after implementing x86_64 assembly version of Boot and LOVM (in the far far distant future).

decision: for now, use regs

---

next question: how many regs?

so, should we pass 2 or 3 args in registers?

if 3, then no space for a temp, and maybe we should add instructions: push-int-to-SP-and-decrement-SP push-ptr-to-SP-and-decrement-SP

or at least add back in:

because currently, without a free temp register, in Boot, you can't even push to stack (you can't predecrement the SP without loading an immediate into a register)

otoh mb most of the syscalls we want to make can use only 2 (remember we have 2 banks, so 4 spots in total, so as long as both integers and ptrs are being passed, you get more than 2 arguments), so maybe it's okay to just pass 2 args in registers and leave a scratch register.

also, it sucks to add appm and ap32m in Boot, b/c (a) you may want to add two versions of them b/c we don't have signed immediates in oot, and (b) in LOVM they are a duplication because we can have a general instruction like this with both an immediate and register addr mode for op2.

regarding (a), well i guess an upward-growing stack is POSTincrement, which isn't a problem (copy the argument to the stack, then overwrite that register with the immediate, then add that immediate to the stack (postincrement); now you have one free register); it's the predecrement convention in the downward-growing stack that's a problem. So you don't really need to add one version for each direction, even if the immediate imm7 is unsigned. So it's only 2 more instructions, and hey maybe they will be useful anyways b/c they will have a 7-bit unsigned immediate range rather than a 6-bit signed immediate range (so 5-bit unsigned).

so far i've looked a few syscalls tho and none really need more than 2 args of one type for the most basic functionality:

open read malloc

i'm leaning towards just using 2 regs for arguments:

decision: 2 regs for now

---

next question (which arose while researching the previous one):

what to do when, depending on platform, APIs might return an integer or a pointer?

For example, on Linux, the open() syscall returns an integer file descriptor. On Windows, the CreateFileA? syscall returns a ('file'?) handle, and the (older?) winbase.h OpenFile? also returns a file handle.

https://stackoverflow.com/questions/902967/what-is-a-windows-handle

https://www.codeproject.com/Articles/1044/A-Handy-Guide-To-Handling-Handles

what does windows API mean by 'handle'? I like Dan Moulding's stackoverflow answer:

" A HANDLE is a context-specific unique identifier. By context-specific, I mean that a handle obtained from one context cannot necessarily be used in any other aribtrary context that also works on HANDLEs.

For example, GetModuleHandle? returns a unique identifier to a currently loaded module. The returned handle can be used in other functions that accept module handles. It cannot be given to functions that require other types of handles. For example, you couldn't give a handle returned from GetModuleHandle? to HeapDestroy? and expect it to do something sensible.

The HANDLE itself is just an integral type. Usually, but not necessarily, it is a pointer to some underlying type or memory location. For example, the HANDLE returned by GetModuleHandle? is actually a pointer to the base virtual memory address of the module. But there is no rule stating that handles must be pointers. A handle could also just be a simple integer (which could possibly be used by some Win32 API as an index into an array). "

This resurfaces the motivation for the ANY type registers that i had in there awhile ago but removed. Here we have a something that:

interestingly, we already have something like this: codeptrs. If you think about it, in an interpreter a codeptr could be represented as an integer index into a list of instructions, rather than as an actual pointer into memory (with 'instructions' conceptually residing in a special 'code memory' which is not part of or accessible from data memory).

so should we (re)create a third register bank for these guys (what i used to call ANY type)? it might be sort of nice for the implementation to distinguish between these guys and dereferencable pointers: for garbage collection purposes. These guys could be 'boxed' with additional bits that say whether they are really integers or really pointers.

Disadvantages:

Advantages:

out of these, i think the most important considerations are:

so the proposal is:

i'm leaning towards this.

---

we may want to introduce alignment restrictions to prevent the problem noted above about another thread seeing an invalid pointer in the middle of write? otoh if each thread has its own GC then the GC and the implementation can have a mutex when the implementation main thread is writing to memory. if ptrs are opaque, how to align? mb would need an instruction for that

done

the stack also has align restrictions on some targets (eg 16-bit on x86 at least on x64 windows)

i think that's an ABI thing though, so i guess we can ignore it in Boot (where we have Boot's own convention, rather than trying to interop with the underlying platform convention). And in LOVM we can abstract it.

---

what do garbage collectors do about pointers that may be invalid (e.g. a C pointer that points past the end of an array)?

https://www.hboehm.info/gc/04tutorial.pdf says

" Pointer validity check

elsewhere in those slides it's clear that memory allocation is done thru the GC, so the GC is aware of where the valid memory is

"Unlike [BoehmChase?92] we start with the assumptionthat the garbage collector recognizes all pointers to theinterior of an object, not just to the first byte of the object.Recent experience suggests that this is the right framework,particularly for typical C++ implementations whichimplicitly generate pointers to the interior of an object. Thetechniques of [Boehm93] can greatly reduce the danger ofspace leakage that we previously associated with thisapproach. This new assumption greatly simplifies our task.Unfortunately, it also usually invalidates assumption (A) of[BoehmChase?92], so our correctness argument has to bedifferent

...

We assume that the garbage collector recognizes anyaddress corresponding to some place inside a heap allocatedobject as a valid pointer. ([Boehm95] satisfies theseassumptions in its default configuration"

[8] "Simple Garbage-Collector-SafetyHans?-J. Boehm"

so it looks like it actually checks potential ptrs to see if they are in valid memory before it derefs them. oh man what a pain (but oh man so amazing)

---

"OurMagpietool? converts C source programs to cooperatewith precise GC ... Some constraints on the C program depend on the specific GCimplementation that is used with Magpie-converted code. The GCmay supportinterior pointers, i.e., pointers into the middle of a GC-allocated object, including to the end of the object. In practice, wefind that interior pointers are common in real C programs, whileprograms specifically adjusted to work with precise GC (such asthe PLT Scheme runtime system) can avoid them to help improveGC performance. Note that a pointer just past the end of an arrayis legal according to the C standard, and Magpie considers suchpointers to be valid interior pointers " [9]

---

also, mb i should allow <= comparison on pointer values. I didn't use to allow it because it may not make sense. But there are times (e.g. sort a list of pointers so that you can do binary search to determine the presence of a pointer in that list) when you really want to compare pointers, and you don't care if the comparison has any other meaning aside from being a total ordering.

what do other languages with opaque pointers/no pointer arithmetic do about pointer comparison? e.g. what does Golang do?

golang allows equality comparison only, but not greater-than, less-than: https://stackoverflow.com/questions/45818964/how-can-i-compare-pointers-in-go

---

later i decided to remove register/smallstack banking again -- see discussion in [[ootBootNotes1?]]