proj-oot-ootAssemblyNotes31

SBCL has a way to write assembly called VOPs, so that you can use Lisp macros with your assembly:

https://news.ycombinator.com/item?id=378581 https://pvk.ca/Blog/2014/08/16/how-to-define-new-intrinsics-in-sbcl/


i think we can unifty the cache-like registers with the stack as follows:

so now we can have something like:

or, we can go with our old idea of having a few less than 32 regs in total

and/or we can split the GPRs into: special, caller-save, callee-save or, we can recognize the stack as a form of callee-save, and so have more caller-save regs or, we can recognize the stack as a form of caller-save, and so have more callee-save regs or, we can specify that the top half of the stack is 'reserved' from the callee (as a mini callstack), and also that the caller can only store that amount / 4 when calling (so you can go 4 calls deep) eg if 16 stack regs, each callee can use 8 stack regs as temporaries, and each caller can store 2 items on the stack as caller-saves; this allows for calls of depth 4 this seems suitable for the 'macroinstructions' i have in mind for later, but it seems it would be better to make that flexible, and let the program divide up the stack regs as needed, with prototypes for each macro specifying its stack space reqs

so one idea is:

another idea (to leave more room for the implementation):

i dunno, the latter seems like overkill; if the implementation only has 16 regs, this is still too big; if the implementation has 31 regs, now it has 7 regs to itself, which seems like too much; otoh the implementation might have its own 'special register' overhead. Also it lets the implementation maintain a cache of the in-memory callstack.

so one idea is:

otoh i like the idea of the implementation being able to cache the in-memory callstack. Wasn't the idea to pass all arguments on the stack but allow this mechanism to achieve the same speed efficiencies as passing in regs?

i guess in those cases, the callee can reuse the registers that held arguments as temporaries, which we can't do (in the same way) if 'architecturally' everything is being passed on the in-memory stack. But RISC-V has 6 other temporaries in addition to the argument regs. Windows has 3 other temporaries, plus 2 FP/vector temporaries [4] [5]. in linux, "If the callee wishes to use registers RBX, RSP, RBP, and R12–R15, it must restore their original values before returning control to the caller. All other registers must be saved by the caller if it wishes to preserve their values.[25]: 16  " [6]; https://www.agner.org/optimize/calling_conventions.pdf lists about 3 ordinary ('R'-prefixed) non-argument temporaries (RAX, R10, R11).

and 64-windows has about 6 non-special callee-save regs, linux has 5, (in both cases i'm considering RBP as special b/c it's often the frame ptr), risc-v has 11.

oh yeah and what about arm64? we have 8 regs for arg passing, 7 temporaries, 10 callee-saves (and 3 reserved regs) [7]

so it seems like, all of them have a few temps aside from arg passing, but less of those than callee-saves; but otoh more (temps + arg passing) than callee-saves.

so now i'm leaning towards:

This leaves the implementation some room to cache the in-memory stack, which means that we can achieve the speed efficiencies of passing args in regs without specifying how many args get passed in regs (this is assuming that the implementation is itself running on a 32-register platform).

note: one reason to have a separate smallstack, rather than just cached value of the in-memory stack, is that this allows us to have the items-pushed-too-deep-just-disappear semantics, which allows us to use it as immutable SSA values without bothering to pop off items when we are done with them

---

so:

---

check if each part of that proposal can fit in the corresponding part of other 32-register ISAs (risc-v, arm64):

risc-v:

arm64:

also check forwardcom:

hmm, so even on risc-v and arm64, you can't easily fit this stuff into their existing calling conventions.

however, you could reduce the stack regs to 4. This would reduce each quantity by 2, and bring them up to or under those limits. It would also mean there are exactly 16 ordinary regs (still not quite good enough to fit within a 16-register system like x86-64 tho, b/c the additional special regs still take it past 16)

---

in forwardcom, agner fog says "A dedicated flags or status register is unfeasible for vector processing, parallel processing, out-of-order processing, and instruction scheduling."

i guess following this maxim would prevent us from returning a carry in a dedicated result register

---

actually having any of the smallstack be callee-saved is a little dumb because it prevents it from being used in the push-only style (until you save it; but if you always have to save it, then it should be caller-saved)

so it should be all caller-saved (temporary). And we don't need 4+8 temporaries, so maybe the smallstacks should only be 4 items. So:

---

so this doesn't quite fit in either risc-v or arm64. Given that, is it even worth it to do the stack cache thing?

oh right, we wanted that so that LOVM can use the same calling convention but can pass everything on the stack (from ootAssemblyNotes29.txt). That's still useful.

i'm sad that the caller-save components don't quite fit in risc-v or arm64, but it's close (7 instead of 8), and i think it's more important to have powers of 2 than to exactly fit, esp. because it won't fit in x86-64 anyways.

i'm sad that smallstack is only a pathetic 4 registers instead of a much more usable 8, but looking at these other 32-register architectures, they really do want 8 regs for argument passing and a >4 special regs, so i don't want to push it, especially because, across procedure calls, this is being used more like 4 more temporaries than like a traditional stack. i'll keep considering it though.

---

an old tab:

https://en.wikipedia.org/wiki/Sbrk

---

y'know... you could always make 4 of the callee-save registers into a second smallstack. So:

---

even with just 4 stack reg temporaries, we can use these for instructions like compare-and-swap that need more than two input arguments

---

with two smallstacks and a register-associated cache (stack cache), the implementation might have to store at least 3 things in platform registers (not available to the program)

---

i talk about OVM maybe having 'custom instructions' provided by a standard library which builds up capabilities like hash tables; i had been thinking of runtime mechanisms for dynamically loading these but actually now i realize that you can make the custom instructions compile-time only and then treat them like compile-time macros.

maybe the registers that i am reserving so that the implementation has room to cache the top of the in-memory stack could do double-duty of being available for 'custom instructions'. Could reserve 8 registers, and then divide them into 4 registers just for macros and a 4-item smallstack just for macros.

otoh OVM has a larger instruction encoding and so more registers, right? So why use the 32 registers that Oot Assembly uses, why not use higher-numbered registers, to reduce contention? ---

"Stack architectures are pretty easy to generate code for and have a (slight) advantage with code density, but their memory access patterns are hard to optimise for, and they are even harder to make superscalar." [8]

---

https://web.archive.org/web/20200610223140/http://home.pipeline.com/~hbaker1/ForthStack.html particularly section STACK MACHINE IMPLEMENTATION

---

make a list of those instructions used in sectorforth, sectorlisp, jonesforth, other small assembly language programming language implementations (see a list of ones that i'm going to explore in ootOvmNotes2). Those are probably the instructions you need.

---

" We did not include special instruction set support for overflow checks on integer arithmetic operations in the base instruction set, as many overflow checks can be cheaply implemented using RISC-V branches. Overflow checking for unsigned addition requires only a single additional branch instruction after the addition: add t0, t1, t2; bltu t0, t1, overflow. For signed addition, if one operand’s sign is known, overflow checking requires only a single branch after the addition: addi t0, t1, +imm; blt t0, t1, overflow. This covers the common case of addition with an immediate operand. For general signed addition, three additional instructions after the addition are required, " -- RISC-V spec v2.2 section 2.4

---

"Sweet spot seems to be 16-bit instructions with 32/64-bit registers. With 64-bit registers you need some clever way to load your immediates, e.g., like the shift/offset in ARM instructions." [9]

---

done

---

hmm.. actually it would be nice to have 'global' addr modes, rather than per-operand, so that you can do things like base_reg+offset_const+index_reg*scale_const. Also, yeah, when you want immediates, you want to use more bits than when you want register, and those extra bits are what you would be using as 'addr mode' bits.

So how many bits do we have?

We want to have 8bit opcodes + at least 2 format bits = 10 bits so far

and for LDI (load constant immediate) we want to have a 16-bit immediate operand and a 5-bit op0 register destination operand, which is 21 bits.

So we have 31 bits so far. So only 1 bit left for 'global addr mode bits', which i'm guessing we should use to specify whether or not we have a large immediate (when the instruction is only two operands, the input operand is 16-bits, and when the instructions has three operands, the immediate bit means that one input operand is an 8-bit immediate and the other one is normal), and then we can use some of those 21 bits for subaddr modes or other purposes when we are not using a 16-bit immediate (when the instruction has 3 operands and one of them is an 8-bit immediate, the other two have at least 5-bits of operand data, so we have 18 bits in total, which is 3 less than 21, so in this case we can use those other 3 bits for addr modes).

the reasoning here is that it seems silly to not be able to represent 16-bit immediates in 32-bit instructions

---

an argument for having one bit for pointers and four bits for bytes is that way you can represent 8 bytes which will be needed for offset addressing mode when incrementing loop counters

---

i think i should make op0 the first input operand, op1 the second input operand (when present) and op2 the output operand. Yes, this means that op1 will sometimes not be present even when op2 is, but i think it makes more sense to have the input operands earlier in the instruction stream than the output operand.

---

---

how about:

This has the advantages of:

and disadvantages of:

An alternative is to get rid of the immediate addr mode bit and instead have immediate-mode instruction variants. The extra bit could be repurposed as another format bit (which would allow one more free bit for either the 8-bit or the 16-bit encoding), or as an addr mode bit for op0 (which, however, needs it less than op1 and op2 because op0 will never need: immediate mode, autoincrement/decrement mode; also, in general, i think the addr modes are less useful than i would like, b/c: we already have stack push/pop in the 'magic registers' beyond r15; memstack addressing mode can't reach too many things b/c of the small size of the 5-bit operand, and the necessity of being agnostic about pointer size).

An argument for putting the addr mode in instruction variants rather than addr mode bits is that this would allow addr modes to apply to more than one operand, e.g. things like (base + offset) addr mode. Cons of that include: (a) in this case you may want different instruction encoding formats for different addr modes, which would add complexity (b) no parallel decoding of operands. My current thought on that is no, it's more complex; or rather, sure, do that if you want, but do it as ordinary instruction behavior rather than at the instruction encoding level.

---

one question is, for the deep smallstack regs, should they be writable?

pros of writable:

cons of writable:

i'm leaning towards writable, because i don't really know enuf about hardware to know if there is really any significant hardware savings, and i certainly don't see this architecture in the wild except for some very niche/researchy things; so, writable is more 'do what everyone else is doing'.

---

i'm thinking that for stack addressing, we just want two components, instead of 4 like i currently have in boot_reference:

This allows addressing of up to 4 pointers and up to 4 int16s (or 2 int32s).

The small number of things that can be addressed reduces the usefulness of this addressing mode.

---

want to reserve a large number of opcodes for extensions. This allows HLL VMs to implement their high-level instructions as extensions, and use the same instruction encoding and instructions as Boot to represent simple stuff.

---

RISC-V has an interesting way of encoding instructions >32bits. 32-bit instruction start (end?) with 11, and then next 3 bits are special; >32-bit instructions have 111 in those next 3 bits, and 32-bit instructions have anything of the other 7 possibilities there. So, by not making a rigid distinction b/t format encoding bits and other bits, we can make available more space for 32-bit instructions at the expense of less for >32-bit instructions, which makes sense b/c bits are cheaper for the larger instructions.

we should consider doing this.

---

risc-v uses a little-endian encoding which is why the opcode is in the LSB, not the MSB. i guess we should do the same (see section 1.5, page 10, of https://github.com/riscv/riscv-isa-manual/releases/download/draft-20211216-5651528/riscv-spec.pdf )

risc-v puts rd (destination operand) next to the opcode, providing support for our op0, op1, op2 (but why did they do it that way?)

can't say that i understand why risc-v reorders the immediate bits sometimes, eg section 2.5 page 21 diagram below "Plain unconditional jumps (assembler pseudoinstruction J) are encoded as a JAL with rd=x0.".

---

i'm liking the idea of having base_reg+offset_const+index_reg*scaled_const via just specifying one register (base), and mandating that the index register be the next register. Now this takes up 7+offset_bits bits instead of 12+offset_bits.

or, specify zero registers, and have 3 sets of pairs of registers be base_reg and index_reg for each of op0, op1, op2. Now for each of the three operands, have a 2-bit scale immediate constant, and a 5-bit offset/displacement immediate constant. The only problem is that that's 7*3=21 bits for all three operands; there aren't any bits left to say which operands are using this addr mode and which are ordinary register-mode operands. And since there's 3 of them, that's 8 combinations, which is too many to fit into the opcode. So i guess it won't work.

---

maybe we should go back to the goal of "wasm, but with gotos, and fixed-length encoding" or "risc-v, but slightly simpler".

maybe BootS? is close to perfect, even though it wastes a bunch of bits with 8-bit register operand encoding. Also, BootS? doesn't have 2 format bits; if it didn, there would only be 6 bits for the opcode, which isn't enough.

---

if opcode is 8 bits and operands are 5 bits then each op has three bits left -- use the three bits for the destination operand for global stuff: two format bits and one immediate bit. There are now six addr mode bits left. One of them is the immediate bit for the other input operand. Now there are five left for actual address modes and/or augmenting the number of bits in the input operands in some of these modes.

---

how long can RISC-V opcodes get?

riscv r type instructions have effectively 14-bit Opcodes (six bits in "opcode" field minus two format bits because the 2 lsb bits in 32-bit instructions must be 11, plus 3 bits in "funct3" field plus 7 bits in "funct7" field)

---

we probably want to avoid such 'variable width opcodes', but i'm not sure; mb have the trick where if the instruction needs less operands, one operand can be an additional opcode

---

we want to leave half the opcodes for custom extensions; so the msb opcode bit should always be 0 in the defined instructions. So we only have 128 instructions, max, to specify. Should custom instructions be able to have their own custom instruction encoding too (e.g. to leave room for stuff like variable width opcodes)? Maybe; but the instruction length should be unchangable/specified (a custom extension can use the custom instruction length if it wants longer instructions, or something else weird).

--- all immediates are sign-extended ---

so we are looking at:

out of the 21 remaining bits, if the global immediate bit is set:

out of the 21 remaining bits, if the global immediate bit is not set:

the question marks are because we might want to let the local immediate flags indicate just a 5-bit immediate, in which case we have more bits remaining; or we might not want local immediate flags at all

note that risc-v has 12-bit immediates for a variety of instructions.. so maybe if the global immediate bit is set for a 3-operand instructions, we want: 11-bit immediate, two 5-bit operand data fields

all immediates are sign-extended

---

so how about:

out of the 21 remaining bits, if the immediate bit is set:

out of the 21 remaining bits, if the immediate bit is not set:

when there are bits remaining, they are used for addr modes, which might include:

(note: immediate mode is already covered via the immediate bit; note that only one operand per instruction can have a non-default addr mode, so we don't need to provide a way for one operand to be immediate and another one to be in a non-default addr mode; may want to revisit this however, as stuff like add(immediate_constant, register_indirect) would be useful; could provide that sort of thing only for smaller immediates, in which case we would burn one bit to indicate whether the other operand is immediate when the 'immediate' bit is not set, as we'd have 10 and 5 bits remaining, respectively)

ideas for ?? are:

notes:

32 regs: 4 caller-saved GPRs 4 callee-saved GPRs 4 caller-saved smallstack regs 4 callee-saved smallstack regs 8 magic regs (zero reg, memstack (stack ptr), 2 push/pop to smallstacks regs, pc, mb some of: global ptr, thread ptr, frame ptr, ?todo look at what we used to have for special regs, look at what risc-v, x86, amd32, amd64 has?) 8 unused (the idea here is to allow the implementation to use these internally and to be able to use this same instruction format for an intermediate representation in which it references its internal registers; in addition, this eases the memory burden on machines without enough registers to represent all the regs that we demand)

---

arm64: "Has dedicated zero or stack pointer (SP) register (depending on instruction)."

--

should mb have a way to encode LEA (base + scale*index + immediate_offset)

hmm... i can encode Lea with an 8-bit displacement if there is no destination register specified

---

mb the reason riscv has that weird encoding with split immediates is that it wants one register operand data field in each byte? almost; rd and rs1 are both at the beginning of a byte, but rs2 is sometimes next to rs1, at the end of that byte

---

should have an addressing mode that is indexed address mode but uses pre-specified registers for base and index and scale, and where the operand data is used for an immediate displacement; also one with no displacement but index in the operand; etc

---

should we have a few prefix bits in the opcode that specify the instruction format?

actually i think our old system was fine; have the instruction format be determined by contiguous ranges in the instruction encoding (possibly still constrained to fall on even boundaries so that you can determine the instruciton format from a prefix of the full opcode bits). That way, formats that don't have many instructions (like the one-big-immediate format that relative jumps use) don't take up much space.

---

and we might want to just do the usual thing and wrap the addr mode into the opcode

and in that case we can have various instruction formats; various indexed & displacement formats; a four operand format; a format with a long opcode; etc

mb this time we should just start with sticking to the basics (eg addr modes/immediates similar to what everyone else is doing, instead of immediately jumping in with my own silly ideas)

---

are we giving up on sz 1 int32s?

what about the idea of defining structs? that seems to advanced for assembly language tho.

maybe back to using 5 bits for: int8, int16, int32, int64, ptr

and when we have 4 bits (eg a scale constant), we need to include ptrs. Ideas:

2,4,8,ptr 1,4,8,ptr 1,2,4,ptr

eh, mb only int8 and ptr. Let implementations over HLLs suffer. Also, even a Python implementation could use bytestrings for memory.

---

also need register (not immediate) displacement in index addr mode for computed struct size (must be computed bc pointer size agnostic)

what i mean is, we need an addressing mode like displacement addressing but with a register holding the displacement rather than an immediate altho, i guess that's the same as indexed addr mode so mb don't bother

---

even wasm already uses more than 128, and almost 256, opcodes

---

i guess the reason for the popularity of stack machines is not really the efficiency (although the theoretically higher code density is sometimes cited), but rather the simplicity of not having registers (-> not having to have operands in your instruction encoding; not having to fix a number of registers; letting different instructions have different numbers of inputs and outputs). If you're just trying to formally define a very simple VM this seems simpler. Also, when you use registers you do sometimes feel like you are doing something that could be better expressed with a stack (e.g. call instruction F with arguments A,B,C,D).

---

so maybe:

the caller-saved ("temporary") stack can't be popped (it's just a circular buffer that goes in one direction only); the idea is that this is just used for temporaries, including passing inputs to/receiving outputs from instructions when you don't have enough operands, either because the instruction takes more than 2 inputs or yields more than 1 output, or because you are using a compact encoding.

by contrast, the callee-saved stack is poppable and even has a way to push something onto the bottom of the stack at the same time that you are popping something off the top of it (this allows for the callee to restore the state of this stack before exiting). This stack is more suitable for stuff like expression evaluation, or for calling subroutines up to a fixed level (eg special assembler macros)

---

"You can’t copy code with memcpy; code is more complicated than that" https://devblogs.microsoft.com/oldnewthing/20211229-00/?p=106061

mostly talks about code not being position-independent

---

4 archs that a linux kernel contributor (Ingo Molnar) built for first are:

ARM64, X86 and then MIPS, Sparc [10]

---

" My roommate had an assembler program for the computer, along with a manual. I pored over them and because BASIC has such a paucity of features, it was not too difficult to convert the lexer to assembler, though it took me a few days.

But I had to call this routine from BASIC. That’s an odd combination of CLEAR (reserve memory), DATA (define your program), POKE (put your program in memory), DEFUSR... (define the user routine used to call your program) and USR... (call the user routine) commands. Here’s a small program to print out the ASCII value of whatever character the user typed (from the user manual ) (warning: pdf): " [11]

---

" Since its initial release, Go has used a stack-based calling convention based on the Plan 9 ABI, in which arguments and result values are passed via memory on the stack. This has significant simplicity benefits: the rules of the calling convention are simple and build on existing struct layout rules; all platforms can use essentially the same conventions, leading to shared, portable compiler and runtime code; and call frames have an obvious first-class representation, which simplifies the implementation of the go and defer statements and reflection calls. Furthermore, the current Go ABI has no callee-save registers, meaning that no register contents live across a function call (any live state in a function must be flushed to the stack before a call). This simplifies stack tracing for garbage collection and stack growth and stack unwinding during panic recovery. " -- https://go.googlesource.com/proposal/+/refs/changes/78/248178/1/design/40724-register-calling.md

" > A recent improvement was in version 1.17, which passes function arguments and results in registers instead of on the stack. This was a significant improvement for GoAWK’s? old tree-walking interpreter, where I saw a 38% increase in speed on one microbenchmark, and a 17% average speedup across all microbenchmarks" -- https://benhoyt.com/writings/go-version-performance/ via https://news.ycombinator.com/item?id=30201969

nappy-doo 3 hours ago

parent prev next [–]

We published a 5% win, but it was really closer to 10% (for amd64). We were conservative in our marketing. The arm64 register ABI (which will be in 1.18 [March]), is showing performance improvements of over ≥ 15. (Rob Pike saw 20% on some of his stuff.)

reply

EdiX?