ok i'm trying to understand which calling convention we should have, and tail calls, etc.
i've read two things:
ok, on the callee cleanup:
perhaps the 'share the same stack frame layout' is at fault. Imagine that C takes more arguments than B. So eg to cleanup B's stack frame, maybe you have to bump up the stack pointer by 10, but to clean up C's stack frame, you have to bump it up by 20. Now if control is returned directly from C to A, and if A is supposed to do cleanup, all A knows is that it called B, so it'll bump up the stack pointer by 10, which is wrong, because the size 20 stack frame is in effect because of the call to C.
ok so this makes the vararg thing almost but not quite make sense, i think. if A decides to call B with a size 30 stack this time, and then B calls C, and we have callee cleanup, then C cleans up its own stack and jumps back to A, but no one cleans up B's stack, because B was supposed to clean that up but B got skipped.
i think we may need a bigger example to demonstrate. A->B->C->D. All in tail position, so we want to return directly to A. B->C is a vararg call. So the two ppl in a position to know about the size of the vararg stack frame are B and C, but both of them got skipped on the return.
this still doesn't quite make sense though, because what is preventing C from cleaning up the vararg stack frame before the call to D?
https://stackoverflow.com/questions/35626835/make-a-variable-argument-function-callee-cleanup maybe sheds some light here. it seems like the convention is to put the return address last (at the top of the stack/bottom of memory), below the arguments. So maybe you don't want to bump up the stack pointer before returning because then the return address is in the de-allocated-ish area below the stack pointer. And i guess the compiler can't be bothered to just manually move the return address up, overwriting earlier arguments. There is apparently an instruction in x86, 'RET #x', with #x immediate, that reads the return address, moves the stack pointer up by x, and then jumps to the return address which was read. But if it's a vararg then x is only known dynamically, not staticly, and it's an immediate, so it doesn't work.
so that last bit about the vararg sounds like it's something specific to limitations in the x86 convention, conveniences in the instruction set, and the way that compilers (and probably branch predictors) insist on using that.
more about vararg stuff [2]
so it sounds like callee cleanup is the way to go.
other notes i found while reading about this:
The fewer times you modify the value of esp explicitly, the better. There is more going on in these instructions than meets the eye. " -- https://cboard.cprogramming.com/tech-board/117965-callee-cleanup-variable-number-parameters.html - that's nice but it sounds fancy -- and if we're doing fancy stuff i'm betting tail call elimination is more important
strangely https://www.agner.org/optimize/calling_conventions.pdf doesn't have much to say about this.
https://github.com/microsoft/ChakraCore/issues/796 has a good point about another way to do varargs with TCO: with caller-cleanup, if A->B->C->D and B->C is vararg and D skips over B right back to A when returning, it's true that A doesn't know how much stack B allocated to hold the varargs, but A can still make things right just by remembering what its stack pointer was before the call (that is, treating SP as caller-save instead of callee-save) -- of course since it's the stack pointer itself that is being lost, it has to save it in some other callee-save register, rather than on the stack. They don't like doing this because it uses up another callee-save register on each such function call (so one less register, and also all those saves and restores). They seem to think a better alternative would be callee-save, but they have a lot of vararg code that would need to be rewritten or something. Sounds like they decided to do nothing in the end (not support TCO).
Q: "Everywhere it says that the called function doesn't know how many parameters have been passed, but in that case why isn't it possible to simply put that data into a register or push it onto the top of the stack for the called function to use?"
A: " There is no fundamental reason why the callee couldn't clean up space for variables. In most architectures, the standard calling convention doesn't handle variables this way, but that doesn't mean it's not possible to do so. For variable-length argument lists, you could either pass data about the number of arguments as a hidden parameter (like how this is handled in many object-oriented languages), or put a pointer on the stack that shows where the arguments end, etc. " -- https://stackoverflow.com/questions/19822045/what-prevents-the-callee-from-cleaning-up-the-stack
"The callee can clean variable arguments from the stack. I made it once, actually. But the code is pretty big.
Anyway, the main reason why in cdecl convention, the caller cleans the stack is other ...On some architectures (usually very small, as the old 8080 or 6800), there is no ret n instruction that to automate the stack cleaning and as a rule they can't make arithmetics with the stack pointer as well.
So, the callee have to first pop the return address from the stack in order to reach the arguments, then to pop all arguments and then push back the return address....When cdecl convention is used 2 instructions and one register use are spared:...And because for single language it is better to use single calling convention on all platforms, the CCALL as more simple and universal looks better. (C language is created in times when 6800 was a high tech).
But notice, that on these platforms, the assembly programs and the native languages (for example different kind of BASICs) usually use register argument passing which is of course much more faster on such small systems. " -- https://stackoverflow.com/questions/19822045/what-prevents-the-callee-from-cleaning-up-the-stack
some languages put (RAIIish?) destructors in their (caller?) cleanup routines, preventing TCO:
" Many Perl idioms rely on strict stack discipline. For example, implementing cleanup code as the destructor of an object held in a lexical variable in a particular stack frame. The result is that implicit tail-call optimisation would break all sorts of things. Hence tail-call semantics are only safe to invoke *explicitly*, which is what Sub::Call::Tail provides. " [7]
---
it sounds to me like callee cleanup is the way to go. TCO may be important, varargs probably not (can always store those in memory).
---
next question: store the first few regs in registers in Boot calling convention?
pro: faster, if regs are really regs
pro: this allows Boot syscalls with few arguments to avoid hitting 'main memory' (until we're in the actual implementation of the call)
con: in LOVM, seems cleaner to put args in smallstacks. That way they can be shifted similarly to 'input argument' and 'output argument' register window sections. if we have the first few args in registers and the rest in smallstacks, that's confusing
my leaning: at first i was leaning to not use regs at all (just stack) for cleaner transition to lovm, but it galls me to have to hit memory just to do a syscall. So now i'm leaning towards using regs. I kinda think there might be a way to may the register window thing work cleanly anyway with the regs.. after all if you need to preserve the input args while you are writing the output args you'd just move them onto the stack right? So mb it's not that much worse. So i kinda think the thing to do is try using regs and reconsider if necessary after implementing x86_64 assembly version of Boot and LOVM (in the far far distant future).
decision: for now, use regs
---
next question: how many regs?
so, should we pass 2 or 3 args in registers?
if 3, then no space for a temp, and maybe we should add instructions: push-int-to-SP-and-decrement-SP push-ptr-to-SP-and-decrement-SP
or at least add back in:
because currently, without a free temp register, in Boot, you can't even push to stack (you can't predecrement the SP without loading an immediate into a register)
otoh mb most of the syscalls we want to make can use only 2 (remember we have 2 banks, so 4 spots in total, so as long as both integers and ptrs are being passed, you get more than 2 arguments), so maybe it's okay to just pass 2 args in registers and leave a scratch register.
also, it sucks to add appm and ap32m in Boot, b/c (a) you may want to add two versions of them b/c we don't have signed immediates in oot, and (b) in LOVM they are a duplication because we can have a general instruction like this with both an immediate and register addr mode for op2.
regarding (a), well i guess an upward-growing stack is POSTincrement, which isn't a problem (copy the argument to the stack, then overwrite that register with the immediate, then add that immediate to the stack (postincrement); now you have one free register); it's the predecrement convention in the downward-growing stack that's a problem. So you don't really need to add one version for each direction, even if the immediate imm7 is unsigned. So it's only 2 more instructions, and hey maybe they will be useful anyways b/c they will have a 7-bit unsigned immediate range rather than a 6-bit signed immediate range (so 5-bit unsigned).
so far i've looked a few syscalls tho and none really need more than 2 args of one type for the most basic functionality:
open read malloc
i'm leaning towards just using 2 regs for arguments:
decision: 2 regs for now
---
next question (which arose while researching the previous one):
what to do when, depending on platform, APIs might return an integer or a pointer?
For example, on Linux, the open() syscall returns an integer file descriptor. On Windows, the CreateFileA? syscall returns a ('file'?) handle, and the (older?) winbase.h OpenFile? also returns a file handle.
https://stackoverflow.com/questions/902967/what-is-a-windows-handle
https://www.codeproject.com/Articles/1044/A-Handy-Guide-To-Handling-Handles
what does windows API mean by 'handle'? I like Dan Moulding's stackoverflow answer:
" A HANDLE is a context-specific unique identifier. By context-specific, I mean that a handle obtained from one context cannot necessarily be used in any other aribtrary context that also works on HANDLEs.
For example, GetModuleHandle? returns a unique identifier to a currently loaded module. The returned handle can be used in other functions that accept module handles. It cannot be given to functions that require other types of handles. For example, you couldn't give a handle returned from GetModuleHandle? to HeapDestroy? and expect it to do something sensible.
The HANDLE itself is just an integral type. Usually, but not necessarily, it is a pointer to some underlying type or memory location. For example, the HANDLE returned by GetModuleHandle? is actually a pointer to the base virtual memory address of the module. But there is no rule stating that handles must be pointers. A handle could also just be a simple integer (which could possibly be used by some Win32 API as an index into an array). "
This resurfaces the motivation for the ANY type registers that i had in there awhile ago but removed. Here we have a something that:
interestingly, we already have something like this: codeptrs. If you think about it, in an interpreter a codeptr could be represented as an integer index into a list of instructions, rather than as an actual pointer into memory (with 'instructions' conceptually residing in a special 'code memory' which is not part of or accessible from data memory).
so should we (re)create a third register bank for these guys (what i used to call ANY type)? it might be sort of nice for the implementation to distinguish between these guys and dereferencable pointers: for garbage collection purposes. These guys could be 'boxed' with additional bits that say whether they are really integers or really pointers.
Disadvantages:
Advantages:
out of these, i think the most important considerations are:
so the proposal is:
i'm leaning towards this.
---
we may want to introduce alignment restrictions to prevent the problem noted above about another thread seeing an invalid pointer in the middle of write? otoh if each thread has its own GC then the GC and the implementation can have a mutex when the implementation main thread is writing to memory. if ptrs are opaque, how to align? mb would need an instruction for that
done
the stack also has align restrictions on some targets (eg 16-bit on x86 at least on x64 windows)
i think that's an ABI thing though, so i guess we can ignore it in Boot (where we have Boot's own convention, rather than trying to interop with the underlying platform convention). And in LOVM we can abstract it.
---
what do garbage collectors do about pointers that may be invalid (e.g. a C pointer that points past the end of an array)?
https://www.hboehm.info/gc/04tutorial.pdf says
" Pointer validity check
elsewhere in those slides it's clear that memory allocation is done thru the GC, so the GC is aware of where the valid memory is
"Unlike [BoehmChase?92] we start with the assumptionthat the garbage collector recognizes all pointers to theinterior of an object, not just to the first byte of the object.Recent experience suggests that this is the right framework,particularly for typical C++ implementations whichimplicitly generate pointers to the interior of an object. Thetechniques of [Boehm93] can greatly reduce the danger ofspace leakage that we previously associated with thisapproach. This new assumption greatly simplifies our task.Unfortunately, it also usually invalidates assumption (A) of[BoehmChase?92], so our correctness argument has to bedifferent
...
We assume that the garbage collector recognizes anyaddress corresponding to some place inside a heap allocatedobject as a valid pointer. ([Boehm95] satisfies theseassumptions in its default configuration"
[8] "Simple Garbage-Collector-SafetyHans?-J. Boehm"
so it looks like it actually checks potential ptrs to see if they are in valid memory before it derefs them. oh man what a pain (but oh man so amazing)
---
"OurMagpietool? converts C source programs to cooperatewith precise GC ... Some constraints on the C program depend on the specific GCimplementation that is used with Magpie-converted code. The GCmay supportinterior pointers, i.e., pointers into the middle of a GC-allocated object, including to the end of the object. In practice, wefind that interior pointers are common in real C programs, whileprograms specifically adjusted to work with precise GC (such asthe PLT Scheme runtime system) can avoid them to help improveGC performance. Note that a pointer just past the end of an arrayis legal according to the C standard, and Magpie considers suchpointers to be valid interior pointers " [9]
---
also, mb i should allow <= comparison on pointer values. I didn't use to allow it because it may not make sense. But there are times (e.g. sort a list of pointers so that you can do binary search to determine the presence of a pointer in that list) when you really want to compare pointers, and you don't care if the comparison has any other meaning aside from being a total ordering.
what do other languages with opaque pointers/no pointer arithmetic do about pointer comparison? e.g. what does Golang do?
golang allows equality comparison only, but not greater-than, less-than: https://stackoverflow.com/questions/45818964/how-can-i-compare-pointers-in-go
---
later i decided to remove register/smallstack banking again -- see discussion in [[ootBootNotes1?]].
---
so what about unified register files with floating point as ordinary gprs?
sounds contentious but seems like today's best practice is to bank them, eg separate floating point
also most mechanical sympathy on popular archs is to bank fp
so bank the fp regs, but not the smallstacks (b/c fps on the stack are in the same linear memory substacks as everything else). btw https://stackoverflow.com/a/52175510/171761 suggests having more caller-save registers in the fp bank, and less callee-save regs, unlike his suggestion for integer
--- old
maybe we can get away with having the primary stack be above the stack pointer in the output stack be below the stack pointer, and pushing to the output stack all at once in a calling sequence. Alternately, since the primary stock can output stock capacities are fixed, If you save those capacities in a register then you can offset the beginning of the output stack from The frame pointer.
yeah I think we need to register to hold The frame allocation information in. At least this is cheaper than a fixed sized register windows where you always have to push a Certain window size onto the Stack. Again consider making those allocation quantities powers of two -- Then maybe we really can fit in All the variables we may need. I think we do need an extra register for holding the primary stack position. Maybe reserve a register for the PC. I'm still thinking if there's a way for the boot calling convention to put arguments in registers. I guess one way is just to use the out stack addr mode as registers! That would imply that the recipient not the caller is the one who shifts out to Local. Also maybe out should be registers not stack even in lovm. Consider however that a Nice thing About Forth was that function calling was cheap (eg no callee-saving etc to be done).
Maybe you can simplify all of this by having one stack and specify how many output args when you call th next function
mb don't even let the program Read the stack pointer that when the stock is not flushed that way we don't have to have an extra register for the tip of the cached stack
The thing is if it's one stack and that stack is interrupted by the return address and the frame pointer And any other callee saves, local variables, or other junk we have in the stack frame. So that makes it Inefficient to translate a stack position to an address. If it's two stacks then it's inefficient to push more to the argument input Stack, Because directly after that is the stuff in our stack frame. This suggests that the argument input stack Should be mapped to registers, And the only snack should be the out Stack
So Callee clean up means that the input arguments are cleaned up too right?? so our output arguments passed where the input arguments were on the stack.?
I think what you're looking for is that out could really be registers. We could rely on the callee to flush out to the stack if needed.
But in that case, how do we use the stack pointer in boot?
I guess maybe if the program never touches the stack pointer then this stuff is abstract
---
i can't think of a good way for the LOVM calling convention to match up with a Boot all-register calling convention without actually moving the smallstack addr modes from LOVM to Boot. In which case we could do this:
caching non-guarantees:
some benefits to this:
done
---
so is there a way to do the previous without actually importing stack cache addr mode semantics into Boot?
The issue is probably that we want to allow the implementation to choose to store the smallstacks in memory, or to store them in regs (or in a memory cache). But if they are in memory then that's visible to Boot. But they may or may not be in memory. This indeterminacy is visible and so must be mentioned in the spec.
Can we just tell Boot to put things on the stack in memory? If we want interoperability (programs mixing Boot and LOVM functions), that's hard for the implementation to detect and map into smallstacks. Imagine if a Boot function called a LOVM leaf function. The LOVM leaf function would not feel a need to flush anything (and we don't want to make it), because it wants to just use the output stack as its arguments, since it doesn't need fresh output stack for its own outgoing arguments. It can just not adjust the frame pointer, and offset its own locals off the stack pointer. But the Boot program put the input arguments onto the stack in memory.
Is there any point keeping LOVM separate from BootX? if we are putting all these complicated semantics into Boot?
There may be another way. It may be hard to detect a write to the stack pointer in general, but if we have a PUSH instruction, or even better a PUSHMEMSTACK instruction, and a POPMEMSTACK instruction, that is easier to detect. The implementation could intercept that instruction and have it write to the output smallstack cache instead. We could have flushStackToMemory and flushMemoryToStacks instructions and specify that until the are called, as above there are caching non-guarantees. The benefit of these gymnastics is that we maintain our load-store architecture and our address-mode-less architecture in Boot, rather than having these memory-associated locations be targettable by any operand.
why just push and pop? perhaps push, pop, read, and write. In fact, maybe just read and write -- then we can add the requirement that you flush before moving the stack pointer, which may make things even simpler for the implementation, b/c it doesn't have to keep track of stack ptr moves. Then we can drop this requirement for LOVM.
seems like this may make it harder to abstract stack frames later though? but maybe we are saving that for OVM.
i feel like there is a more abstract way to do this. But i can't think of anything that wouldn't be too abstract for Boot.
done (register-associated cache) ---
LOVM is looking more and more similar to Boot. Maybe just merge LOVM and BootX? (could use the name LOVM for something further up the stack later, as previously planned)
---
so i was thinking about calling conventions. I was hoping there was a way to do all of:
i don't think it's possible. In fact, i don't think even the first three are possible simultaneously:
now, the main issue of them having different calling conventions is: what about the syscalls? Boot is supposed to be a subset of LOVM, so the syscalls should work the same in both cases.
which is also the reason we wanted to pass arguments in registers
so, we could just say that Boot passes stuff on the stack, but that means you have to have available stack memory to do a syscall, and also that you have to take the performance hit of hitting memory to do a syscall.
---
how does linux do syscalls on x86?
there are three types: legacy syscalls with the 'int' instruction, 'fast' 32-bit syscalls with the SYSENTER instruction, and 'fast' 32-bit system calls with the SYSCALL instruction.
With legacy syscalls, the syscall number and up to 6 args are passed in registers [10]
with fast 32-bit system calls, the syscall number and up to 5 args are passed in registers [11]
with fast 32-bit system calls, the syscall number and up to 6 args are passed in registers [12]