SBCL has a way to write assembly called VOPs, so that you can use Lisp macros with your assembly:
https://news.ycombinator.com/item?id=378581 https://pvk.ca/Blog/2014/08/16/how-to-define-new-intrinsics-in-sbcl/
i think we can unifty the cache-like registers with the stack as follows:
so now we can have something like:
or, we can go with our old idea of having a few less than 32 regs in total
and/or we can split the GPRs into: special, caller-save, callee-save or, we can recognize the stack as a form of callee-save, and so have more caller-save regs or, we can recognize the stack as a form of caller-save, and so have more callee-save regs or, we can specify that the top half of the stack is 'reserved' from the callee (as a mini callstack), and also that the caller can only store that amount / 4 when calling (so you can go 4 calls deep) eg if 16 stack regs, each callee can use 8 stack regs as temporaries, and each caller can store 2 items on the stack as caller-saves; this allows for calls of depth 4 this seems suitable for the 'macroinstructions' i have in mind for later, but it seems it would be better to make that flexible, and let the program divide up the stack regs as needed, with prototypes for each macro specifying its stack space reqs
so one idea is:
another idea (to leave more room for the implementation):
i dunno, the latter seems like overkill; if the implementation only has 16 regs, this is still too big; if the implementation has 31 regs, now it has 7 regs to itself, which seems like too much; otoh the implementation might have its own 'special register' overhead. Also it lets the implementation maintain a cache of the in-memory callstack.
so one idea is:
otoh i like the idea of the implementation being able to cache the in-memory callstack. Wasn't the idea to pass all arguments on the stack but allow this mechanism to achieve the same speed efficiencies as passing in regs?
i guess in those cases, the callee can reuse the registers that held arguments as temporaries, which we can't do (in the same way) if 'architecturally' everything is being passed on the in-memory stack. But RISC-V has 6 other temporaries in addition to the argument regs. Windows has 3 other temporaries, plus 2 FP/vector temporaries [4] [5]. in linux, "If the callee wishes to use registers RBX, RSP, RBP, and R12–R15, it must restore their original values before returning control to the caller. All other registers must be saved by the caller if it wishes to preserve their values.[25]: 16 " [6]; https://www.agner.org/optimize/calling_conventions.pdf lists about 3 ordinary ('R'-prefixed) non-argument temporaries (RAX, R10, R11).
and 64-windows has about 6 non-special callee-save regs, linux has 5, (in both cases i'm considering RBP as special b/c it's often the frame ptr), risc-v has 11.
oh yeah and what about arm64? we have 8 regs for arg passing, 7 temporaries, 10 callee-saves (and 3 reserved regs) [7]
so it seems like, all of them have a few temps aside from arg passing, but less of those than callee-saves; but otoh more (temps + arg passing) than callee-saves.
so now i'm leaning towards:
This leaves the implementation some room to cache the in-memory stack, which means that we can achieve the speed efficiencies of passing args in regs without specifying how many args get passed in regs (this is assuming that the implementation is itself running on a 32-register platform).
note: one reason to have a separate smallstack, rather than just cached value of the in-memory stack, is that this allows us to have the items-pushed-too-deep-just-disappear semantics, which allows us to use it as immutable SSA values without bothering to pop off items when we are done with them
---
so:
---
check if each part of that proposal can fit in the corresponding part of other 32-register ISAs (risc-v, arm64):
risc-v:
arm64:
also check forwardcom:
hmm, so even on risc-v and arm64, you can't easily fit this stuff into their existing calling conventions.
however, you could reduce the stack regs to 4. This would reduce each quantity by 2, and bring them up to or under those limits. It would also mean there are exactly 16 ordinary regs (still not quite good enough to fit within a 16-register system like x86-64 tho, b/c the additional special regs still take it past 16)
---
in forwardcom, agner fog says "A dedicated flags or status register is unfeasible for vector processing, parallel processing, out-of-order processing, and instruction scheduling."
i guess following this maxim would prevent us from returning a carry in a dedicated result register
---
actually having any of the smallstack be callee-saved is a little dumb because it prevents it from being used in the push-only style (until you save it; but if you always have to save it, then it should be caller-saved)
so it should be all caller-saved (temporary). And we don't need 4+8 temporaries, so maybe the smallstacks should only be 4 items. So:
---
so this doesn't quite fit in either risc-v or arm64. Given that, is it even worth it to do the stack cache thing?
oh right, we wanted that so that LOVM can use the same calling convention but can pass everything on the stack (from ootAssemblyNotes29.txt). That's still useful.
i'm sad that the caller-save components don't quite fit in risc-v or arm64, but it's close (7 instead of 8), and i think it's more important to have powers of 2 than to exactly fit, esp. because it won't fit in x86-64 anyways.
i'm sad that smallstack is only a pathetic 4 registers instead of a much more usable 8, but looking at these other 32-register architectures, they really do want 8 regs for argument passing and a >4 special regs, so i don't want to push it, especially because, across procedure calls, this is being used more like 4 more temporaries than like a traditional stack. i'll keep considering it though.
---
an old tab:
https://en.wikipedia.org/wiki/Sbrk
---
y'know... you could always make 4 of the callee-save registers into a second smallstack. So:
---
even with just 4 stack reg temporaries, we can use these for instructions like compare-and-swap that need more than two input arguments
---
with two smallstacks and a register-associated cache (stack cache), the implementation might have to store at least 3 things in platform registers (not available to the program)
---
i talk about OVM maybe having 'custom instructions' provided by a standard library which builds up capabilities like hash tables; i had been thinking of runtime mechanisms for dynamically loading these but actually now i realize that you can make the custom instructions compile-time only and then treat them like compile-time macros.
maybe the registers that i am reserving so that the implementation has room to cache the top of the in-memory stack could do double-duty of being available for 'custom instructions'. Could reserve 8 registers, and then divide them into 4 registers just for macros and a 4-item smallstack just for macros.
otoh OVM has a larger instruction encoding and so more registers, right? So why use the 32 registers that Oot Assembly uses, why not use higher-numbered registers, to reduce contention? ---
"Stack architectures are pretty easy to generate code for and have a (slight) advantage with code density, but their memory access patterns are hard to optimise for, and they are even harder to make superscalar." [8]
---
https://web.archive.org/web/20200610223140/http://home.pipeline.com/~hbaker1/ForthStack.html particularly section STACK MACHINE IMPLEMENTATION
---
make a list of those instructions used in sectorforth, sectorlisp, jonesforth, other small assembly language programming language implementations (see a list of ones that i'm going to explore in ootOvmNotes2). Those are probably the instructions you need.
---
" We did not include special instruction set support for overflow checks on integer arithmetic operations in the base instruction set, as many overflow checks can be cheaply implemented using RISC-V branches. Overflow checking for unsigned addition requires only a single additional branch instruction after the addition: add t0, t1, t2; bltu t0, t1, overflow. For signed addition, if one operand’s sign is known, overflow checking requires only a single branch after the addition: addi t0, t1, +imm; blt t0, t1, overflow. This covers the common case of addition with an immediate operand. For general signed addition, three additional instructions after the addition are required, " -- RISC-V spec v2.2 section 2.4
---
"Sweet spot seems to be 16-bit instructions with 32/64-bit registers. With 64-bit registers you need some clever way to load your immediates, e.g., like the shift/offset in ARM instructions." [9]