proj-oot-ootAssemblyNotes31

SBCL has a way to write assembly called VOPs, so that you can use Lisp macros with your assembly:

https://news.ycombinator.com/item?id=378581 https://pvk.ca/Blog/2014/08/16/how-to-define-new-intrinsics-in-sbcl/


i think we can unifty the cache-like registers with the stack as follows:

so now we can have something like:

or, we can go with our old idea of having a few less than 32 regs in total

and/or we can split the GPRs into: special, caller-save, callee-save or, we can recognize the stack as a form of callee-save, and so have more caller-save regs or, we can recognize the stack as a form of caller-save, and so have more callee-save regs or, we can specify that the top half of the stack is 'reserved' from the callee (as a mini callstack), and also that the caller can only store that amount / 4 when calling (so you can go 4 calls deep) eg if 16 stack regs, each callee can use 8 stack regs as temporaries, and each caller can store 2 items on the stack as caller-saves; this allows for calls of depth 4 this seems suitable for the 'macroinstructions' i have in mind for later, but it seems it would be better to make that flexible, and let the program divide up the stack regs as needed, with prototypes for each macro specifying its stack space reqs

so one idea is:

another idea (to leave more room for the implementation):

i dunno, the latter seems like overkill; if the implementation only has 16 regs, this is still too big; if the implementation has 31 regs, now it has 7 regs to itself, which seems like too much; otoh the implementation might have its own 'special register' overhead. Also it lets the implementation maintain a cache of the in-memory callstack.

so one idea is:

otoh i like the idea of the implementation being able to cache the in-memory callstack. Wasn't the idea to pass all arguments on the stack but allow this mechanism to achieve the same speed efficiencies as passing in regs?

i guess in those cases, the callee can reuse the registers that held arguments as temporaries, which we can't do (in the same way) if 'architecturally' everything is being passed on the in-memory stack. But RISC-V has 6 other temporaries in addition to the argument regs. Windows has 3 other temporaries, plus 2 FP/vector temporaries [4] [5]. in linux, "If the callee wishes to use registers RBX, RSP, RBP, and R12–R15, it must restore their original values before returning control to the caller. All other registers must be saved by the caller if it wishes to preserve their values.[25]: 16  " [6]; https://www.agner.org/optimize/calling_conventions.pdf lists about 3 ordinary ('R'-prefixed) non-argument temporaries (RAX, R10, R11).

and 64-windows has about 6 non-special callee-save regs, linux has 5, (in both cases i'm considering RBP as special b/c it's often the frame ptr), risc-v has 11.

oh yeah and what about arm64? we have 8 regs for arg passing, 7 temporaries, 10 callee-saves (and 3 reserved regs) [7]

so it seems like, all of them have a few temps aside from arg passing, but less of those than callee-saves; but otoh more (temps + arg passing) than callee-saves.

so now i'm leaning towards:

This leaves the implementation some room to cache the in-memory stack, which means that we can achieve the speed efficiencies of passing args in regs without specifying how many args get passed in regs (this is assuming that the implementation is itself running on a 32-register platform).

note: one reason to have a separate smallstack, rather than just cached value of the in-memory stack, is that this allows us to have the items-pushed-too-deep-just-disappear semantics, which allows us to use it as immutable SSA values without bothering to pop off items when we are done with them

---

so:

---

check if each part of that proposal can fit in the corresponding part of other 32-register ISAs (risc-v, arm64):

risc-v:

arm64:

also check forwardcom:

hmm, so even on risc-v and arm64, you can't easily fit this stuff into their existing calling conventions.

however, you could reduce the stack regs to 4. This would reduce each quantity by 2, and bring them up to or under those limits. It would also mean there are exactly 16 ordinary regs (still not quite good enough to fit within a 16-register system like x86-64 tho, b/c the additional special regs still take it past 16)

---

in forwardcom, agner fog says "A dedicated flags or status register is unfeasible for vector processing, parallel processing, out-of-order processing, and instruction scheduling."

i guess following this maxim would prevent us from returning a carry in a dedicated result register

---

actually having any of the smallstack be callee-saved is a little dumb because it prevents it from being used in the push-only style (until you save it; but if you always have to save it, then it should be caller-saved)

so it should be all caller-saved (temporary). And we don't need 4+8 temporaries, so maybe the smallstacks should only be 4 items. So:

---

so this doesn't quite fit in either risc-v or arm64. Given that, is it even worth it to do the stack cache thing?

oh right, we wanted that so that LOVM can use the same calling convention but can pass everything on the stack (from ootAssemblyNotes29.txt). That's still useful.

i'm sad that the caller-save components don't quite fit in risc-v or arm64, but it's close (7 instead of 8), and i think it's more important to have powers of 2 than to exactly fit, esp. because it won't fit in x86-64 anyways.

i'm sad that smallstack is only a pathetic 4 registers instead of a much more usable 8, but looking at these other 32-register architectures, they really do want 8 regs for argument passing and a >4 special regs, so i don't want to push it, especially because, across procedure calls, this is being used more like 4 more temporaries than like a traditional stack. i'll keep considering it though.

---

an old tab:

https://en.wikipedia.org/wiki/Sbrk

---

y'know... you could always make 4 of the callee-save registers into a second smallstack. So:

---

even with just 4 stack reg temporaries, we can use these for instructions like compare-and-swap that need more than two input arguments

---

with two smallstacks and a register-associated cache (stack cache), the implementation might have to store at least 3 things in platform registers (not available to the program)

---

i talk about OVM maybe having 'custom instructions' provided by a standard library which builds up capabilities like hash tables; i had been thinking of runtime mechanisms for dynamically loading these but actually now i realize that you can make the custom instructions compile-time only and then treat them like compile-time macros.

maybe the registers that i am reserving so that the implementation has room to cache the top of the in-memory stack could do double-duty of being available for 'custom instructions'. Could reserve 8 registers, and then divide them into 4 registers just for macros and a 4-item smallstack just for macros.

otoh OVM has a larger instruction encoding and so more registers, right? So why use the 32 registers that Oot Assembly uses, why not use higher-numbered registers, to reduce contention? ---

"Stack architectures are pretty easy to generate code for and have a (slight) advantage with code density, but their memory access patterns are hard to optimise for, and they are even harder to make superscalar." [8]

---

https://web.archive.org/web/20200610223140/http://home.pipeline.com/~hbaker1/ForthStack.html particularly section STACK MACHINE IMPLEMENTATION

---

make a list of those instructions used in sectorforth, sectorlisp, jonesforth, other small assembly language programming language implementations (see a list of ones that i'm going to explore in ootOvmNotes2). Those are probably the instructions you need.

---

" We did not include special instruction set support for overflow checks on integer arithmetic operations in the base instruction set, as many overflow checks can be cheaply implemented using RISC-V branches. Overflow checking for unsigned addition requires only a single additional branch instruction after the addition: add t0, t1, t2; bltu t0, t1, overflow. For signed addition, if one operand’s sign is known, overflow checking requires only a single branch after the addition: addi t0, t1, +imm; blt t0, t1, overflow. This covers the common case of addition with an immediate operand. For general signed addition, three additional instructions after the addition are required, " -- RISC-V spec v2.2 section 2.4

---

"Sweet spot seems to be 16-bit instructions with 32/64-bit registers. With 64-bit registers you need some clever way to load your immediates, e.g., like the shift/offset in ARM instructions." [9]

---

done

---

hmm.. actually it would be nice to have 'global' addr modes, rather than per-operand, so that you can do things like base_reg+offset_const+index_reg*scale_const. Also, yeah, when you want immediates, you want to use more bits than when you want register, and those extra bits are what you would be using as 'addr mode' bits.

So how many bits do we have?

We want to have 8bit opcodes + at least 2 format bits = 10 bits so far

and for LDI (load constant immediate) we want to have a 16-bit immediate operand and a 5-bit op0 register destination operand, which is 21 bits.

So we have 31 bits so far. So only 1 bit left for 'global addr mode bits', which i'm guessing we should use to specify whether or not we have a large immediate (when the instruction is only two operands, the input operand is 16-bits, and when the instructions has three operands, the immediate bit means that one input operand is an 8-bit immediate and the other one is normal), and then we can use some of those 21 bits for subaddr modes or other purposes when we are not using a 16-bit immediate (when the instruction has 3 operands and one of them is an 8-bit immediate, the other two have at least 5-bits of operand data, so we have 18 bits in total, which is 3 less than 21, so in this case we can use those other 3 bits for addr modes).

the reasoning here is that it seems silly to not be able to represent 16-bit immediates in 32-bit instructions

---

an argument for having one bit for pointers and four bits for bytes is that way you can represent 8 bytes which will be needed for offset addressing mode when incrementing loop counters

---

i think i should make op0 the first input operand, op1 the second input operand (when present) and op2 the output operand. Yes, this means that op1 will sometimes not be present even when op2 is, but i think it makes more sense to have the input operands earlier in the instruction stream than the output operand.

---

---

how about:

This has the advantages of:

and disadvantages of:

An alternative is to get rid of the immediate addr mode bit and instead have immediate-mode instruction variants. The extra bit could be repurposed as another format bit (which would allow one more free bit for either the 8-bit or the 16-bit encoding), or as an addr mode bit for op0 (which, however, needs it less than op1 and op2 because op0 will never need: immediate mode, autoincrement/decrement mode; also, in general, i think the addr modes are less useful than i would like, b/c: we already have stack push/pop in the 'magic registers' beyond r15; memstack addressing mode can't reach too many things b/c of the small size of the 5-bit operand, and the necessity of being agnostic about pointer size).

An argument for putting the addr mode in instruction variants rather than addr mode bits is that this would allow addr modes to apply to more than one operand, e.g. things like (base + offset) addr mode. Cons of that include: (a) in this case you may want different instruction encoding formats for different addr modes, which would add complexity (b) no parallel decoding of operands. My current thought on that is no, it's more complex; or rather, sure, do that if you want, but do it as ordinary instruction behavior rather than at the instruction encoding level.

---

one question is, for the deep smallstack regs, should they be writable?

pros of writable:

cons of writable:

i'm leaning towards writable, because i don't really know enuf about hardware to know if there is really any significant hardware savings, and i certainly don't see this architecture in the wild except for some very niche/researchy things; so, writable is more 'do what everyone else is doing'.

---

i'm thinking that for stack addressing, we just want two components, instead of 4 like i currently have in boot_reference:

This allows addressing of up to 4 pointers and up to 4 int16s (or 2 int32s).

The small number of things that can be addressed reduces the usefulness of this addressing mode.

---

want to reserve a large number of opcodes for extensions. This allows HLL VMs to implement their high-level instructions as extensions, and use the same instruction encoding and instructions as Boot to represent simple stuff.

---

RISC-V has an interesting way of encoding instructions >32bits. 32-bit instruction start (end?) with 11, and then next 3 bits are special; >32-bit instructions have 111 in those next 3 bits, and 32-bit instructions have anything of the other 7 possibilities there. So, by not making a rigid distinction b/t format encoding bits and other bits, we can make available more space for 32-bit instructions at the expense of less for >32-bit instructions, which makes sense b/c bits are cheaper for the larger instructions.

we should consider doing this.

---

risc-v uses a little-endian encoding which is why the opcode is in the LSB, not the MSB. i guess we should do the same (see section 1.5, page 10, of https://github.com/riscv/riscv-isa-manual/releases/download/draft-20211216-5651528/riscv-spec.pdf )

risc-v puts rd (destination operand) next to the opcode, providing support for our op0, op1, op2 (but why did they do it that way?)

can't say that i understand why risc-v reorders the immediate bits sometimes, eg section 2.5 page 21 diagram below "Plain unconditional jumps (assembler pseudoinstruction J) are encoded as a JAL with rd=x0.".

---

i'm liking the idea of having base_reg+offset_const+index_reg*scaled_const via just specifying one register (base), and mandating that the index register be the next register. Now this takes up 7+offset_bits bits instead of 12+offset_bits.

or, specify zero registers, and have 3 sets of pairs of registers be base_reg and index_reg for each of op0, op1, op2. Now for each of the three operands, have a 2-bit scale immediate constant, and a 5-bit offset/displacement immediate constant. The only problem is that that's 7*3=21 bits for all three operands; there aren't any bits left to say which operands are using this addr mode and which are ordinary register-mode operands. And since there's 3 of them, that's 8 combinations, which is too many to fit into the opcode. So i guess it won't work.

---

maybe we should go back to the goal of "wasm, but with gotos, and fixed-length encoding" or "risc-v, but slightly simpler".

maybe BootS? is close to perfect, even though it wastes a bunch of bits with 8-bit register operand encoding. Also, BootS? doesn't have 2 format bits; if it didn, there would only be 6 bits for the opcode, which isn't enough.

---

if opcode is 8 bits and operands are 5 bits then each op has three bits left -- use the three bits for the destination operand for global stuff: two format bits and one immediate bit. There are now six addr mode bits left. One of them is the immediate bit for the other input operand. Now there are five left for actual address modes and/or augmenting the number of bits in the input operands in some of these modes.

---

how long can RISC-V opcodes get?

riscv r type instructions have effectively 14-bit Opcodes (six bits in "opcode" field minus two format bits because the 2 lsb bits in 32-bit instructions must be 11, plus 3 bits in "funct3" field plus 7 bits in "funct7" field)

---

we probably want to avoid such 'variable width opcodes', but i'm not sure; mb have the trick where if the instruction needs less operands, one operand can be an additional opcode

---

we want to leave half the opcodes for custom extensions; so the msb opcode bit should always be 0 in the defined instructions. So we only have 128 instructions, max, to specify. Should custom instructions be able to have their own custom instruction encoding too (e.g. to leave room for stuff like variable width opcodes)? Maybe; but the instruction length should be unchangable/specified (a custom extension can use the custom instruction length if it wants longer instructions, or something else weird).

--- all immediates are sign-extended ---

so we are looking at:

out of the 21 remaining bits, if the global immediate bit is set:

out of the 21 remaining bits, if the global immediate bit is not set:

the question marks are because we might want to let the local immediate flags indicate just a 5-bit immediate, in which case we have more bits remaining; or we might not want local immediate flags at all

note that risc-v has 12-bit immediates for a variety of instructions.. so maybe if the global immediate bit is set for a 3-operand instructions, we want: 11-bit immediate, two 5-bit operand data fields

all immediates are sign-extended

---

so how about:

out of the 21 remaining bits, if the immediate bit is set:

out of the 21 remaining bits, if the immediate bit is not set:

when there are bits remaining, they are used for addr modes, which might include:

(note: immediate mode is already covered via the immediate bit; note that only one operand per instruction can have a non-default addr mode, so we don't need to provide a way for one operand to be immediate and another one to be in a non-default addr mode; may want to revisit this however, as stuff like add(immediate_constant, register_indirect) would be useful; could provide that sort of thing only for smaller immediates, in which case we would burn one bit to indicate whether the other operand is immediate when the 'immediate' bit is not set, as we'd have 10 and 5 bits remaining, respectively)

ideas for ?? are:

notes:

32 regs: 4 caller-saved GPRs 4 callee-saved GPRs 4 caller-saved smallstack regs 4 callee-saved smallstack regs 8 magic regs (zero reg, memstack (stack ptr), 2 push/pop to smallstacks regs, pc, mb some of: global ptr, thread ptr, frame ptr, ?todo look at what we used to have for special regs, look at what risc-v, x86, amd32, amd64 has?) 8 unused (the idea here is to allow the implementation to use these internally and to be able to use this same instruction format for an intermediate representation in which it references its internal registers; in addition, this eases the memory burden on machines without enough registers to represent all the regs that we demand)

---

arm64: "Has dedicated zero or stack pointer (SP) register (depending on instruction)."

--

should mb have a way to encode LEA (base + scale*index + immediate_offset)

hmm... i can encode Lea with an 8-bit displacement if there is no destination register specified

---

mb the reason riscv has that weird encoding with split immediates is that it wants one register operand data field in each byte? almost; rd and rs1 are both at the beginning of a byte, but rs2 is sometimes next to rs1, at the end of that byte

---

should have an addressing mode that is indexed address mode but uses pre-specified registers for base and index and scale, and where the operand data is used for an immediate displacement; also one with no displacement but index in the operand; etc

---

should we have a few prefix bits in the opcode that specify the instruction format?

actually i think our old system was fine; have the instruction format be determined by contiguous ranges in the instruction encoding (possibly still constrained to fall on even boundaries so that you can determine the instruciton format from a prefix of the full opcode bits). That way, formats that don't have many instructions (like the one-big-immediate format that relative jumps use) don't take up much space.

---

and we might want to just do the usual thing and wrap the addr mode into the opcode

and in that case we can have various instruction formats; various indexed & displacement formats; a four operand format; a format with a long opcode; etc

mb this time we should just start with sticking to the basics (eg addr modes/immediates similar to what everyone else is doing, instead of immediately jumping in with my own silly ideas)

---

are we giving up on sz 1 int32s?

what about the idea of defining structs? that seems to advanced for assembly language tho.

maybe back to using 5 bits for: int8, int16, int32, int64, ptr

and when we have 4 bits (eg a scale constant), we need to include ptrs. Ideas:

2,4,8,ptr 1,4,8,ptr 1,2,4,ptr

eh, mb only int8 and ptr. Let implementations over HLLs suffer. Also, even a Python implementation could use bytestrings for memory.

---

also need register (not immediate) displacement in index addr mode for computed struct size (must be computed bc pointer size agnostic)

what i mean is, we need an addressing mode like displacement addressing but with a register holding the displacement rather than an immediate altho, i guess that's the same as indexed addr mode so mb don't bother

---

even wasm already uses more than 128, and almost 256, opcodes

---

i guess the reason for the popularity of stack machines is not really the efficiency (although the theoretically higher code density is sometimes cited), but rather the simplicity of not having registers (-> not having to have operands in your instruction encoding; not having to fix a number of registers; letting different instructions have different numbers of inputs and outputs). If you're just trying to formally define a very simple VM this seems simpler. Also, when you use registers you do sometimes feel like you are doing something that could be better expressed with a stack (e.g. call instruction F with arguments A,B,C,D).

---

so maybe:

the caller-saved ("temporary") stack can't be popped (it's just a circular buffer that goes in one direction only); the idea is that this is just used for temporaries, including passing inputs to/receiving outputs from instructions when you don't have enough operands, either because the instruction takes more than 2 inputs or yields more than 1 output, or because you are using a compact encoding.

by contrast, the callee-saved stack is poppable and even has a way to push something onto the bottom of the stack at the same time that you are popping something off the top of it (this allows for the callee to restore the state of this stack before exiting). This stack is more suitable for stuff like expression evaluation, or for calling subroutines up to a fixed level (eg special assembler macros)

---

"You can’t copy code with memcpy; code is more complicated than that" https://devblogs.microsoft.com/oldnewthing/20211229-00/?p=106061

mostly talks about code not being position-independent

---

4 archs that a linux kernel contributor (Ingo Molnar) built for first are:

ARM64, X86 and then MIPS, Sparc [10]

---

" My roommate had an assembler program for the computer, along with a manual. I pored over them and because BASIC has such a paucity of features, it was not too difficult to convert the lexer to assembler, though it took me a few days.

But I had to call this routine from BASIC. That’s an odd combination of CLEAR (reserve memory), DATA (define your program), POKE (put your program in memory), DEFUSR... (define the user routine used to call your program) and USR... (call the user routine) commands. Here’s a small program to print out the ASCII value of whatever character the user typed (from the user manual ) (warning: pdf): " [11]

---

" Since its initial release, Go has used a stack-based calling convention based on the Plan 9 ABI, in which arguments and result values are passed via memory on the stack. This has significant simplicity benefits: the rules of the calling convention are simple and build on existing struct layout rules; all platforms can use essentially the same conventions, leading to shared, portable compiler and runtime code; and call frames have an obvious first-class representation, which simplifies the implementation of the go and defer statements and reflection calls. Furthermore, the current Go ABI has no callee-save registers, meaning that no register contents live across a function call (any live state in a function must be flushed to the stack before a call). This simplifies stack tracing for garbage collection and stack growth and stack unwinding during panic recovery. " -- https://go.googlesource.com/proposal/+/refs/changes/78/248178/1/design/40724-register-calling.md

" > A recent improvement was in version 1.17, which passes function arguments and results in registers instead of on the stack. This was a significant improvement for GoAWK’s? old tree-walking interpreter, where I saw a 38% increase in speed on one microbenchmark, and a 17% average speedup across all microbenchmarks" -- https://benhoyt.com/writings/go-version-performance/ via https://news.ycombinator.com/item?id=30201969

nappy-doo 3 hours ago

parent prev next [–]

We published a 5% win, but it was really closer to 10% (for amd64). We were conservative in our marketing. The arm64 register ABI (which will be in 1.18 [March]), is showing performance improvements of over ≥ 15. (Rob Pike saw 20% on some of his stuff.)

reply

EdiX? 11 hours ago

parent prev next [–]

> I find that surprising that Go did not use registers for its (internal) calling convention until 1.17. And I also find it surprising that they only expected a 5% boost from it.

It's because there's no difference for a normal non-leaf function between a stack calling convention and a register calling convention. If 'f' is a non-leaf function either the caller of 'f' will have to put f's arguments onto the stack (stack calling convention) or 'f' will have to save its argument registers onto the stack before doing its calls (register calling convention).

Register calling conventions end up being a big win if you have lots of non-leaf functions that can forward their arguments (rare) or lots of leaf functions that can't be inlined (also rare). Some programs will get a big boost out of it but most will only get a small one.

reply

---

stack guard pages

https://devblogs.microsoft.com/oldnewthing/20220203-00/?p=106215 https://lobste.rs/s/ssdxwt/closer_look_at_stack_guard_page

---

ok if we really just want to be a simple, target-independent assembly (simpler and slightly lower-level than LLVM) then instead of having 32 registers, or instead of LLVM's infinity registers with an encoding using arbitrary-length integers, we could just use 65536 registers. This means that an operand with 3 registers would need at least 6 bytes to encode (round up to 8). So, crazy inefficient encoding, but simple, and more platform-agnostic than 32 regs.

somewhere in the middle, we could have 256 registers, so at least 4 bytes per instruction. In terms of encoding size, it's no better than other 32-bit encodings that we've explored, and there's no room left for addressing modes, and if we have instruction format bits then there's not much room left for opcodes.

the downside of having an overwhelming number of registers is that even on targets that provide lots of registers, you have to do a register allocation pass, so the fastest possible interpreter (or complier) becomes a little slower.

another similar goal is for the representation of instructions in the toolchain to be large enough so that every component of every instruction can be naturally represented (rather than having to be unpacked), and so that all instructions are the same size. For this goal, even if you have 2 bytes per operand, that's enough to specify a register, but not enough for the >16-bit constants and jump offsets found in many real-world ISAs. 32-bit jump offsets are probably enough for most purposes, which suggests a 128-bit instruction encoding, but if you really want to be completely general you may want 64-bits for each operand, which suggests a 256-bit encoding. This also allows you to encode 64-bit immediate constants, which is pretty general. If most instructions could be encoded in one byte then this is approximately a 32x bloat! In a 256-bit encoding, if you have 3 64-bit operands, that leaves 64 bits (8 bytes) for other stuff like instruction encoding format bits, opcode, addressing modes. It's easy to imagine that you might want more than 256 opcodes but hard to imagine that you would want more than 65536 opcodes so let's give that 2 bytes. We want at least some instruction encoding format bits, so let's give that 1 byte. We want at least some addressing mode bits for each of 3 operands, so let's give that 3 bytes. We have 2 bytes left to reserve for future use, or to reserve for the toolchain or to reserve as implementation-dependent. Even this format would still require multiple subformats if you want some instructions to support four operands (for example, you could have a CAS (compare-and-swap) instruction with one output operand and three input operands). An obvious choice for a subformat would be to break one of the 64-bit operands into two 32-bit operands, and to break the corresponding addressing mode byte into two 4-bit addressing modes, too (presumably in the case of CAS you'd want the output operand and the pointer operand to be the small ones, so that you could have 64-bit constants for the two old/new value operands). Generalizing this by adding two more subformats (for a total of 4 subformats), you could allow up to 6 32-bit operands, each with a 32-bit addressing mode. This skirts the goal of having each component be 'naturally' represented, though, because now multiple addressing modes are packed into one byte.

(note that i also like my other solution, though, of having smallstacks, and using those to represent inputs and outputs to instructions that need more than 3 operands)

as a bonus, with the bits in 3 64-bit operands, we could support up to 12 16-bit operands, 24 8-bit operands, or 32 6-bit operands -- enough for a CALL instructions that can take a lot of inputs and outputs in one instruction. This would allow us to encapsulate platform-specific calling conventions in a simple way, without a multi-instruction 'calling sequence' involving things like pushing onto a stack (at least, not unless the function being called has more arguments than our limit). If we use 32 6-bit operands or some huge number like that, we may even want to segregate the operands into input-only, output-only, and in-out, to make things clearer for analysis tools and to enable more generic interop between platform calling conventions that distinguish between these things. But if we use just 12 16-bit register operands, then we could have 1 or 2 output operands and the rest in-out (input) operands. One issue with this is that there are no addressing modes -- so if the actual platform calling convention has arguments on the stack, then here we would be pulling arguments off the stack into registers for the sake of our CALL instruction semantics, just to put them back on the stack in the implementation of the platform-specific calling convention. This suggests that maybe we should have some addressing modes within our CALL instruction, which would reduce the number of operands supported. For example, with 4-bit addressing modes, we could support only 9 16-bit operands, not 12 (9*16+9*4 = 180). That's still pretty useful, though. If we had only one output operand, we have 8 input operands here.

also, we have more than enough bits to have more complex addressing modes that split their operands, like base_register+immediate_displacement+index_register*immediate_scale.

this is too bloated to be the actual instruction set for a performant language implementation or tooling, but it could be great as a platform-independent baseline 'interchange language' for communicating sequences of instructions between various tooling without requiring the tooling to either implement or link to libraries implementing more complex formats such as LLVM's format which utilizes variable-length integers, and without taking the code being worked on through transformations which lose 'high-level intent' or which unnecessarily muddy the code by, for example, doing a superfluous register allocation just to satisfy the interchange language's limit on number of registers, or unpacking a single complex addressing mode calculation, or large immediate constant, or function call, into multiple instructions, just to have something later on in a tooling pipeline try to reconstruct these things to take advantage of platform-specific features. With ~65536 registers, it allows high-level languages to translate functions (without unreasonably many variables) directly without doing register allocation or SSA, or it can represent (reasonably short functions or basic blocks) in SSA form. More compact assembly formats could specify their semantics just by giving their translation into this bloated interchange format.

You could have most of these advantages with a less-bloated format if you were willing to accept variable-length instructions. The usual rationale for fixed-length instructions, performant hardware decoding, probably doesn't apply here because the bloatedness probably swamps those gains. But perhaps with huge amounts of instruction-level parallelism (ILP) it would still be worth it, since fixed-length instructions can be decoded in parallel. But i suspect there are further advantages for tooling implementation simplicity in having fixed-length instructions; for some tooling you would ultimately want a table that lists all instructions in sequential order and indexes them by their place in the sequence ("instruction #"); in which case without fixed-length instructions, you would have to either maintain and pass-in as an input, or build inside the tooling, a lookup table that maps instruction # to where that instruction starts (or some more complex thing like a table that tracks the location of every 1024th instruction, and then you dynamically find the locs of other instructions as needed by going to the previous 1024th instruction and decoding) -- a fixed-length instruction format removes the need for that complication. In some applications, the time/memory saved by not having to construct/maintain that other stuff might even make the bloating almost worthwhile, making this relatively performant after all.

for application in which fixed-length is really not worthwhile, though, you could have a variant of this format in which the instruction encoding format bits also specify how many operands there are and how large they are -- this 'self-describing' format makes it easier for generic tooling to do stuff with it without knowing how many operands each opcode takes and how each addressing mode works. Note that the fixed-length instruction encoding is then just a special case of this variable-length variant, where the numbers and lengths of operands are all maxxed out. Hmm, i like that. Not sure if you could/want to fit that into the same byte that we gave to instruction encoding, above; we want at least 4 bits for my prefix-encoding-of-(fixed-length-8-16-32-64)-or-larger format choice, which would leave only one bit for each operand plus one more bit; and for each operand, we probably actually want to select between each of 0/4/8/16/32/64 bits, or similar. But 3 bits per each of 3 operands is 9 bits, which is unfortunate, so i guess we don't actually want to support that much choice. Anyways, it looks like we want a whole byte for this. So maybe that consumes one of the 2 bytes we had left over, above. OR, we could use the addr mode bytes for this, assuming we want to burn a whole addr mode byte even for operands that aren't actually present.

so what we are looking at is:

format encoding:

first byte: format encoding 4 bits: all 1s (zeros here indicate a fixed-length encoding of 8,16,32,64 using a prefix encoding) 4 bits: all 0s (other values RESERVED for now; some of these we want to make implementation-dependent, though; and probably want all 1s to be ILLEGAL)

second byte: variable-length encoding? or is that in the addr mode bytes? TBD

third byte, 4th byte: opcode

are the following always present, or not, if variable length? fifth byte: addr mode for operand 0 sixth byte: addr mode for operand 1 seventh byte: addr mode for operand 2 eight byte: RESERVED

next 8 bytes: operand 0 next 8 bytes: operand 1 last 8 bytes: operand 2

---

if we want to save by not having addr modes for non-existent operands, there's a simple solution:

i think that's better.

actually, we can do better; if only one operand is absent, it has to be operand 2, right? so actually we just have a choice between 0,1,2. So two bits are enuf, we don't need three. well, no, what if we want MORE operands, as in our CALL proposal? We want up to 8 or even 9, right? Well.. if we have 9, and we have to specify the sizes of all those, then we can't fit those sizes here. So, we have some choices here. Also, if we are allowed to have 8 64-bit operands, then the 256-bit fixed-size encoding is not actually maxxed out. That's probably fine, though.

so what we are looking at is:

byte 0 (format encoding): all zeros or all ones in this byte are RESERVED bits 0 thru 3: all 1s (zeros here will indicate a fixed-length encoding of 8,16,32,64 using a TBD/RESERVED prefix encoding) bit 4: 0 bits 5 thru 7: operand count

bytes 1, 2: opcode

bytes 3, 4, 5: for first three operands, if present: 4 bits?: addr mode 4 bits?: their size

byte 6: for first rest of the operands, if present: 4 bits?: addr mode 4 bits?: their size

byte 7: RESERVED

i don't actually like this, though, because there's no need to fixed the variable-sized operands at 3 if there is a variable number of operands, because in that case the header isn't necessarily 64-bit aligned anyhow b/c we are already considering leaving out some of those addr mode/size bytes.

we could have two counts (number of operand with variable size, then number of operands with a given size). And/or we could even have an arbitrarily large number of operands, with a sentinel terminator.

we're kinda getting into the weeds here tho, given that we don't actually want to support this anytime soon, just the fixed length one. I guess it's good to think it out, though.

---

ok so in the simple variable length format considered above, i guess if we are having choices between operand lengths under 64 then we're not having 64-bit alignment anyways. So we can do:

byte 0: all zeros or all ones in this byte are RESERVED bits 0 thru 3: all 1s (zeros here will indicate a fixed-length encoding of 8,16,32,64 using a TBD/RESERVED prefix encoding) bit 4, 5: 0 bits 6, 7: operand count (0,1,2,3)

bytes 1, 2: opcode

now one byte for each operand: 4 bits?: addr mode 4 bits?: their size

a problem with this is that when operand count is 3, all these headers end up being 6 bytes, not 8, so even though we have 3 64-bit operands, we are not 64-bit aligned at all, which isn't what we want for the fixed-length 'bloated' representation.

yeah, i dont think the idea of having the fixed-length 256-bit format be a special case of the variable-length format is going to work. There's just no reason that a variable-length format would need to add those two extra bytes -- it doesn't need to be aligned. Unless you put an alignment descriptor in the variable-length format... 8/16/32/64, two bits. Could use bits 4 and 5 of the format byte for that? but then the all-1s byte would be valid. Also, want some room to encode other formats... Could drop the 64-bit fixed length format. Or could force other formats to use these padding bytes (bytes 6 and 7 in the header)

---

also, for the fixed-length 256-bit 'bloated' format, should look at what a C struct layout in memory would look like for this stuff on a typical platform, and make it that, to enable zero-copy interop with C on at least one platform.

---

https://wiki.xxiivv.com/site/varvara.html

---

""There are alternative floating point formats out there. Last year we did ... half-width IEEE floating point. But another one that's really popular especially in embedded is bfloat16 for machine learning. We couldn't get to it last year. We're working on getting to it this year," Himelstein said." -- [12]

---

i guess another way to deal with the stack cache would be to simply declare the stack area to be outside of normal memory. That seems a little weird though; eg if you have a string on the stack and you want to call an external function that takes a pointer to a string, you'd have to copy it onto the heap first. So i don't think that's a good idea for a low-level language like this that is meant to run mostly on top of platforms that keep the stack in general-purpose RAM.

---

Ask HN: Recommendation for general purpose JIT compiler mentions high-end or more than once:

one comment says "There are a lot of "simple JIT" libraries for the" (low-latency compilation) "case (you just want to feed in a simple IR and get machine code out, do an okay job at register allocation, but nothing fancy). None of them has "won" and most only have C bindings (to my knowledge).". This rings true to me.

i think one think to keep in mind is that esp. to start with, for oot we just want something that makes it easy to use any/all of those as a backend (except i dont care to use graal/truffle/sulong) -- so that whichever one "wins" later on we'll be able to use it as our backend without fuss.

So especially initially, we just want the sorta-intersection of the instruction sets inside those things. And by now the sort-of is getting pretty clear to me: addition, multiplication, bit-shifting, etc.

But it's not quite that easy. There are two sorts of complications:

Most likely each of the above support some way for the user to do most of these things (things like: spawn, communicate between, and synchronize threads or processes; call and be called from external processes; emit output, accept input, and open and read and write files). But they most likely have different ways of doing these (eg some of them expose an assembly-like language, and others expose an HLL-like language, and the way in which details about calling conventions are abstracted (or not) can be wrapped into an HLL-language), and in most cases probably expose platform architectural details -- whereas, ultimately in Oot we want to abstract away from the underlying platform and have a portable VM at some low level. To make this worse, I kind of feel like these 'complications' may be the sorts of things that sometimes are less documented and kinda fiddly.

So these leaves us with lots of questions. The first is probably, at what level do we hide (encapsulate) platform details and have truly portable VM: Boots, Boot, or OVM? And then, for each complication above (and any others that i didn't list), should we encapsulate it or expose it, and how exactly should we encapsulate it?

?: is boots just for bootstrapping a toolchain, or is all code compilable to boots as a target language? A: i guess all code should be compilable to boots as a target language -- that way we can truly say that to port Oot to a new machine, you just have to write a Boots implementation ---

https://lobste.rs/s/fvaycc/passerine_programming_language#c_arsq74

---

https://opensource.googleblog.com/2022/06/Vectorized%20and%20performance%20portable%20Quicksort.html

---

RISC-V embedded (RV32E, so works with 16 instead of 32 registers) API

https://github.com/riscv-non-isa/riscv-eabi-spec/blob/master/EABI.adoc

"

The number of argument registers is reduced from 8 to 4.

The upper 16 x registers (x16-x31), where present, are all callee-saved. (In UABI, 2 are argument registers and 4 are temporaries.)

The number of temporary registers is reduced from 3 to 2.

" " EABI Name Description Saver x0 zero Hard-wired zero value - x1 ra Return address Caller x2 sp Stack pointer Callee x3 gp Global pointer - x4 tp Thread pointer - x5 t0 Temporary/link register Caller x6 s3 Saved register Callee x7 s4 Saved register Callee x8 s0/fp Saved register/frame pointer Callee x9 s1 Saved register Callee x10 a0 Argument/return value Caller x11 a1 Argument/return value Caller x12 a2 Argument Caller x13 a3 Argument Caller x14 s2 Saved register Callee x15 t1 Temporary Caller x16-x31 s5-s20 Saved registers Callee "

--- 2022-06-06

recall that on most systems (those without type tags in memory), integers and pointers can be coerced by writing and reading main memory. However a remaining motivation for typed values is to allow portability across systems with different sized pointers, while still allowing code to precisely specify integer sizes. This latter motivation does not by itself justify also having the types of codepointer (label) and handle (opaque), and indeed on contemporary machines codepointers and pointers are the same. So perhaps there should only be integers and pointers, and we should collapse the types of labels and handles into pointers. I sort of recalled that the motivation for handles was to abstract over file identifiers, which in some systems are integers and in some are presumably pointers. On the other hand all four types are useful for the purpose of defining undefined behavior. Also recall there was an earlier effort to have separate register banks for each type. I wonder if we should go back to having separate integer and pointer registers. This is inefficient in that on most platforms these are not actually separate. It does allow us to save a bit in each register specifier in instruction encoding however. I'm imagining having 8 integer registers, 8 general purpose pointer registers, four special pointer registers, and two small stacks of size four each, One caller saved and one Callee saved, with The small stacks being both integer and pointer? But for Boots we don't have smallstacks. Also if int64 takes 2 int registers then only 4 int64s can fit in registers, but if it only takes one register then on 32bit machines We would need 16 platform registers for eight integer registers, and 16 platform registers for eight small stack registers, so the above scheme would take 44 platform registers in total, which is more than the 28th that we want to use. On the other hand I think contemporary platforms mostly have separate floating point registers, so maybe we should too, and maybe we don't care if there are only 4 int64s in that case. Could just specify boots as eight uniform registers of either integers or pointers where the integers are either 32 or 64 bits. This brings up the question of whether boot can easily emulate a uniform register machine, and if not, do we care? Can C emulate a uniform register machine in a natural way without undefined behavior? I guess give up on register banking.

so are we giving up on making Boots incrementally expandable into Boot? are we giving up on Boots/Boot ABI compatibility?

consider dynasm syntax for assembly and especially macros

if boots can use only 12 registers that makes it even easier to implement. Or even 14. Yes that means we have to do register alloc in between boot and boots. If we can get it down to 16 instructions (which would require giving up separate instructions for integers and pointers) then we could have 16 bit encoding

22-06-07 some conflicting potential design goals:

--- 220607

i think we should:

should we keep code pointers separate, so as to support harvard architectures?

are there really many harvard archs around that we would use?

AVR is modified harvard in that separate instructions are provided for reading/(and writing?) program memory

http://ithare.com/modified-harvard-architecture-clarifying-confusion/ discusses various different things that are called 'modified harvard'. Many things are 'modified harvard' in that there are separate code and data busses, yet still a unified memory space.

it seems like only the tiniest microcontrollers don't have a unified memory space (eg AVR8, PIC18; not ARM Cortex M). And, unless our humble ISA takes over the world, we're just a VM running instructions out of data memory anyways. For the purpose of defining UB we can still require that only pointers to code are treated as code, but i don't think we need to add a new pointer size for code pointers, and i don't think we need to add a new datatype (other than for the purpose of defining UB).

like our decision to go with 32-bit registers because they are common in contemporary stuff, and our decision to require multiply instructions because on things small enuf not to need them you probably aren't using Boot anyways, i think we can say that on most contemporary machines there is a unified program/data address space, so we don't have to worry about supporting differently-sized code pointers or special instructions to read/write code memory. In general, the ARM Cortex is a very popular line of microcontrollers, and the Cortex M0 is the lowest of the line; anything it can afford, we can afford too.

---

some of the sorts of syscalls we want are in here:

https://www.csie.ntu.edu.tw/~cyy/courses/introCS/14fall/lectures/handouts/lec16_OS_4up.pdf

---

yknow, the minimalistic assembly languages with only 16 opcodes (such as nand2tetris's Toy) tend to just have one word length (eg Toy is 16 bits). We want to be more usable than that. I think what we want is like WASM, except with GOTO control flow. So it's okay if we are more than 16 opcodes; which means we will have a 32 bit encoding, because if we want 3 operands of 4 bits each, only 4 bits would be left for the opcode.

otoh.. i guess many opcodes don't need all three operands...

but we don't want to do that compress-y thing for Boots, it's too complicated to encode/decode; so that's only maybe for Boot. So for Boot, we'll have a 4-byte instruction encoding, with one byte for opcodes (so 256 opcodes).

---

[13]

---

ok so it looks like that 'multiply step' in Chuck Moore's Green Arrays F18 is just iterating addition:

https://github.com/AshleyF/Color/blob/master/Docs/multiply_step.md

---

"A large proportion of microcontroller models available on the market still have less than 64 kB of flash and less than 2 kB of RAM...All the microcontrollers I’ve worked with in my career as a firmware engineer have had ≤ 16kB RAM...Microvium compiled with the same settings compiles to just 8.5kB....Microvium takes 54 bytes for the kernel state...All of the sizes quoted in this post are when targetting the 32-bit ARM Cortex M0 using GCC with optimization for size." -- [14]

---

a comment on QBE; we dont have this problem, but its good to keep in mind:

---

tail calls in wasm:

---

wasm syscalls(?) in golang-on-wasm implementation: wasmMove wasmZero wasmDiv wasmTruncS wasmTruncU exitThread osyield usleep currentMemory growMemory wasmExit wasmWrite nanotime walltime scheduleCallback clearScheduledCallback getRandomData -- https://github.com/neelance/go/blob/13b0931dc3fa8c8a6ab403dbdae348978a53c014/src/runtime/sys_wasm.s

the golang-on-wasm implmentation appears to have 16 integer registers, 32 float registers, plus an "SP" register, a "g" register, and an "SB" pseudo-register: https://github.com/golang/go/blob/9012996a9a127fd566c72baac5e7d8ba45b4865e/src/cmd/compile/internal/ssa/gen/WasmOps.go

SP is presumably the stack pointer, according to https://go.dev/doc/asm g is the goroutine structure, and "the SB pseudo-register can be thought of as the origin of memory, so the symbol foo(SB) is the name foo as an address in memory. This form is used to name global functions and data."

that document also says "The set of pseudo-registers is the same for all architectures:

    FP: Frame pointer: arguments and locals.
    PC: Program counter: jumps and branches.
    SB: Static base pointer: global symbols.
    SP: Stack pointer: the highest address within the local stack frame."

---

[16] [17] [18] discuss a fear with RISC-V's vector programming ISA because the mask bits for prediction of the vector operations are stored in v0 (and if that's not enough bits, v1 etc), which are ordinary vector registers, and the poster is concerned that having dual uses for vector registers (that is, as normal vector registers, or as holding mask bits) may cause a perf problem for high-end implementations. Various contrary replies from the RISC-V folks say that it's not clear how much that actually hurts perf, and that they have a strong preference for less architectural state which outweighs this.

one takeaway for me is that predication (via mask registers) is pretty important for these vector ops and we should support that if we ever have a vector extension. I didn't realize that was so important.

my guess is that, if we ever get around to vector ops in the ISA, we should have predication, and we should have a separate mask register (all agree that one mask reg would be sufficient, although there are ISAs with multiple mask regs), as it seems more easily optimized (for similar reasons that i dont think the PC should be a GPR).

---

i guess the 4 main pseudo-registers in the golang assembly suggests that exposing 12 registers in Boots is a good choice. I am still worried about maybe having 2 more hidden regs to allow the implementation to have some scratch registers (so exposing 10 instead of 12) -- also golang also has a 'g' register which stores the goroutine struct, whatever that is, but i'm guessing it's probably local thread state/TCB. But i also remember that other study that said that about 12 registers are usually needed for efficient general-purpose computation.

so maybe i should lean towards 10, not 12, under the idea that the point of Boots is to make it dead easy to (efficiently) implement, and that this is a higher value than making the whole system performant.

i dunno. I'm leaning towards 12. Inside Boots the Boots program is probably going to have to have its own versions of all that stuff (PC, stack pointer, frame pointer, base pointer, local thread TCB, scratch regs for the compiler), which could leave very few regs for actual variables in the program being executed.

This also makes me think that having 32 regs in Boot (not Boots) is a good idea, so that after all this overhead there are still at least 16 GPRs available to program variables.

---

" The arm64 and x86 architectures enable coarse-grain protection of the forward edges using a dedicated BTI (arm64) and ENDBR{32,64} (x86) instructions, which create a "landing pad" for indirect branches. When an indirect branch is executed, the CPU will check for the landing pad at the target address; if a landing pad is not found, the CPU will trap. "

" To protect the on-stack copy of the return address, the arm64 and powerpc architectures allow storing a cryptographically calculated hash alongside the address...There are differences in where that hash is stored; on powerpc systems it's saved on the stack next to the pointer it protects, while on arm64 systems the hash occupies unused high bits in the pointer itself. "

"Another way to ensure backward-edge integrity is to use a shadow stack to supplement the regular stack. The shadow stack stores the return address for each function that is called; whenever a function returns, the return address on the normal stack is compared to the address on the shadow stack. Execution is only allowed to continue if the two addresses match."

"Some applications, like GDB and CRIU, need the ability to control the execution of other processes, meaning that they require ways to deal with the shadow stack in a nonstandard way. The GDB debugger, for example, often needs to jump between the stack frames of the program being debugged; it needs to keep the shadow stack in sync with the normal stack as it moves up and down the call chain. "a

---

halt and mwait instructions (for idling) we probably dont need those [19]

---

summary of https://lobste.rs/s/d61s5c/code_density_compared_between_way_too

---

" Every instruction that writes to a register consumes a new rename register. Rename registers are one of the most expensive things on a large core.

This is why complex addressing modes are such a big win: they are effectively one or more arithmetic instructions followed by a load, but the ALUs for the address calculation can sit in the load-store pipeline and the intermediate results are just wires between pipeline stages, they don’t need any control logic in register rename. If you do this as two instructions then you need to allocate a rename register and burn scheduler power to dispatch the second instruction once the first is scheduled. A rename register remains live until an instruction that writes to the same architectural register is either executed in the same basic block or retires out of speculative execution (so that the value is no longer architecturally visible on any path). If the compiler is not able to clobber the register in the second instruction (e.g. if it’s a floating-point load and so ends up in a different register bank) then the two-instruction sequence will increase rename register pressure and hurt performance (it will hurt power either way).

This is part of the reason why server-class x86 chips do better that you might expect on power-performance numbers. " -- [20]

---

---

AnIdiotOnTheNet? on May 17, 2021

prev next [–]

> We want to produce lasting versions of our tools and games, and by using simpler systems(Uxn has only 32 instructions) we can build more resilient software due to their lack of dependencies, and support older hardware when possible.

Lately I've been thinking a lot along similar lines, but have come to a wildly different conclusion about how to do it. Of course, they are explicitly limiting their use-cases so that's to be expected. I generally agree with the idea of a VM that is simple to implement, but I think the result shouldn't look like real hardware with memory addresses and registers and a stack and all that, it should be more abstract and leave more implementation details up to the interpreter/jit.

Regardless, it makes me slightly less gloomy about the future of computing whenever I see more people coming to similar conclusions.

)

megameter on May 18, 2021

root parent next [–]

My current "VM" inspirations include...regular expressions, constraint logic, modular synthesizers, three-ring binders.

You can go really, really far away from a computing system being a Turing machine with arithmetic, if you work at it - like, that's the thing you already have in every PL. But I think getting away from that also entails getting down and dirty with your chosen benchmark of expressiveness, and accepting that the system does not do some things well, and if you want to have multiple things coordinated, then you need to be careful about expressing that, too, to avoid just making a Turing tarpit.

justanothersys on May 18, 2021

root parent next [–]

Yeah this is exactly why I came up with whistlegraphs. I was doing a lot of thinking about how to define a VM for dynamic graphics and drawing and then realized I could start to specify things at a much higher level of abstraction before ever needing the computer. (https://www.youtube.com/watch?v=YXUUCkqv2LY)

---

i like uxn but (a) we want to address more than 64k of memory because we want to be able to translate arbitrary conventional programs (to the extent that they are portable), which often need more memory than that, and (b) i think we should aim for a little more complexity than 32 opcodes -- i found it hard to include everything i wanted with that few

---

2208 (aug 2022)

a change: BootS? is no longer a completely separate 32-bit encoding split cleanly into 4 bytes, it's just the same encoding but with less opcodes and with the addr mode bits always set to zero

---

i removed much of the flexibility/portability relating to integer representation in memory, because i thought it would be rarely used, was adding complexity, and reducing efficiency. Now:

Pointers to data memory are still an implementation-dependent size and have an opaque representation.

Labels are the same size as pointers, but are not interchangable with pointers to data memory.

---

on platform with tagged pointers, how do you copy memory and what happens if you read a pointer as an int?

https://discourse.llvm.org/t/pointers-are-complicated-iii-or-pointer-integer-casts-exposed/61683/4

---

text removed from bootS reference:

  1. ## Label lines (@) Some lines may begin with the character '@' (a 'label line'). This is followed (without an intervening space) by a 'label name', which begins with an alphabetic character and is followed by one or more alphanumeric characters and/or underscores. The total number of characters in a label name must be 31 or less.

---