Bayle Shanks's website: proj-oot-ootAssemblyNotes32

so it's been some months. Without quite rereading what i have above, how about, for the 32-bit encoding:

3 format bits
8 opcode bits
3x (3 addr mode bits + 4 operand data bits) = 21 operand bits

the addr modes would be:

immediate constant
register
indirect (the register has a pointer to the effective address)
(small?)stack
postinc (4-bit operand data split into 2 scale bits (increment by 1,2,4, or 8) and 2 base register specifier bits)
predec (4-bit operand data split into 2 scale bits (decrement by 1,2,4, or 8) and 2 base register specifier bits)
index (4-bit operand data split into 2 base register specifier bits and 2 index register specifier bits)
?index into constant table? ?stack offset without push/pop? ?stack but treat any register as stack pointer?

when the output operand has immediate constant mode, that means that:

if the first bit of the output operand data is 0, then op0 = op1
if the first bit of the output operand data is 1, then op0 = op2
if the addr mode of the other operand is immediate constant or stack, then the remaining 3 bits of the output operand data are prepended to the bits of the other operand data
if the addr mode of the other operand is register, then ???
if the addr mode of the other operand is indirect, postinc, predec, or index, then the remaining 3 bits of the output operand data are an offset (with minimum value 1 and maximum value 8) which is added to the effective address

copied to boot_reference.md

---

result of popping the stack above the frame pointer (stack underflow) is undefined? That helps with specifying the cache behavior but it makes fourth style stuff where there are no boundaries between the data that was written by different functions, impossible. But I guess you don't want to overwrite the return address on the stack! So forth style stuff requires a separate return stack anyway (where the frame pointer would go)

what else does Forth do with the return stack?

loop counter variables
most local variable implementations

so, in Forth, there are typically two stacks, one called either just 'stack' or 'parameter stack', and the other called 'return stack'. The parameter stack tends to be much larger than the return stack. The return stack is the one organized with traditional 'stack frames', with return addresses, loop counters, and maybe local variables, so it looks the most like a C stack. Function calling parameters go on Forth's larger 'parameter stack'.

---

if, instead of putting the stack at the top of a chunk of memory (growing downward) and the heap at the bottom (growing upward), we allocate a separate chunk for the stack, then we can put the parameter stack at the top and the data stack at the bottom of that chunk

---

link register instead of read-only segment pointer? RISC-V has one but forwardcom does not. But no one has a read-only segment pointer. should we require that "on a control flow join the statically determined stack depth has to be the same on all joining control flows" like http://www.complang.tuwien.ac.at/anton/euroforth/ef13/papers/ertl-paf.pdf ? they have tagged GOTOs and tagged labels for this purpose.. this allows them to store stack in registers and compile out some stack manipulation i think this is more appropriate for the next level up (LOVM) -- it seems like it would limit efficiently writing something very low level, like an emulator, but i'm not sure

there seems to be no need to have a callee save small stack, because: One of the main reasons that small stack has an advantage over in memory stack is that you don't have to pop if it's a ring. But if it's a callee save, then when a subroutine is not a leaf it will have to pop to save all that stuff

but... i guess a ring buffer is harder to implement on today's machines than just using an in-memory parameter stack. The special cache can be applied to offsets using either stack or frame pointer as base, and can be applied only to the stack and frame addressing modes, making it simpler to detect

---

I still wonder if we couldn't just somehow use the 16-bit encoding for boots. I don't really see how we could do it because if you have 3x 4-bit operands, and even just 1 format bit, you only have 3 bits left to specify 8 opcodes. So I guess if you were going this route you would have to reduce the bits in the operand and use special copy instructions to reach most of the registers, and only allow normal instructions to use a few of the registers

---

i think we want 16 regs, not 32, to minimize architectural state, so that we can have gazillions of green threads?

should we go 16-bit instead of 32-bit? nah.. even popular microcontrollers are 32-bits these days.

for choosing the calling convention/ABI, look at:

risc-v (32 regs, but has 2 embedded api proposals for RV32E, which has 16 regs: RISC-V ILP32E, and RISC-V EABI) https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc 16 regs 11 GPRs (3 caller-saved temporaries, 2 callee-saved, 6 argument registers), zero, return address, stack ptr, global ptr, thread ptr "The ILP32E calling convention is not compatible with ISAs that have registers that require load and store alignments of more than 32 bits. In particular, this calling convention must not be used with the D ISA extension." -- why? is this a problem for us?

https://github.com/riscv-non-isa/riscv-eabi-spec : 16 regs: https://github.com/riscv-non-isa/riscv-eabi-spec/blob/master/EABI.adoc 11 GPRs (4 argument registers (2 of which are also return registers), 2 temporaries/caller-saved, 5 callee-saved, (including suggested link register (in place of a caller-saved/temporary) and frame pointer (in place of a callee-saved register), stack ptr zero, return addr, global ptr, thread ptr optional: "If an entire embedded application and its libraries make no use of thread-local storage, the tp register becomes available as a global register or as a temporary register, at the application’s discretion. If the __global_pointer$ symbol is not defined, the gp register becomes available in the same fashion. Using the tp and gp registers in this alternate way is a nonstandard extension to the EABI and might not compose with some EABI libraries."

ARM aarch64 (32 regs)

ARM cortex-m (17 regs: 13 GPRs, stack pointer, link register, PC, Special-purpose Program Status register) https://developer.arm.com/documentation/ddi0439/b/Programmers-Model/Processor-core-register-summary https://en.wikipedia.org/wiki/Calling_convention out of the 13 GPRs: 4 argument regs (some reused as return regs) 8 callee-saves 1 caller-save temporary

forwardcom https://www.agner.org/optimize/forwardcom.pdf instruction size of 1, 2, or 3 32-bit words fully orthogonal, a zillion instruction formats (can each instruction use any format?) addressing modes: Address = Base pointer + Index * Scale + Offset 32 GPRs including a stack ptr, plus IP (pc), Data section pointer (DATAP), Thread data pointer (THREADP), Numeric control register (NUMCONTR)

system v calling convention https://www.agner.org/optimize/calling_conventions.pdf esp. page 10, "6 Register usage Table 4. Register usage" not sure but it looks like we have 16 registers: 15 GPRs 6 caller-save argument registers (RDI,RSI,RDX,RCX,R8,R9) (of which 1, RDX, is also a return register), 1 additional caller-save return register (RAX), 2 additional caller-save temporaries (R10, R11), and 6 callee-saves (RBX, RBP, R12, R13, R14, R15) plus 1 stack pointer (rsp) but rbp is also a suggested frame pointer register and "FS is used for a thread environment block in Windows and for thread specific data in Linux" (by glibc i think?) and GS is used by the kernal for a per-CPU pointer to kernel memory ( https://stackoverflow.com/questions/6611346/how-are-the-fs-gs-registers-used-in-linux-amd64 )?

https://refspecs.linuxfoundation.org/elf/x86_64-abi-0.99.pdf https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI

64 bit Windows calling convention https://www.agner.org/optimize/calling_conventions.pdf

  15 GPRs
    4 caller-save argument registers (RCX,RDX,R8,R9), 1 additional caller-save return register (RAX), 2 additional caller-save temporaries (R10, R11), and 8 callee-saves (RBX, RBP, R12, R13, R14, R15, RSI, RDI)
  plus rsp stack pointer
  and i think it uses FS and GS for something too?
    "On 64-bit, GS is used to access the PEB in userland or the KPCR in kernel land in Windows" -- [https://github.com/NationalSecurityAgency/ghidra/issues/1339]
    "The reason Win64 uses GS is that there the FS register is used in the 32 bit compatibility layer (confusingly called Wow64)." -- [https://stackoverflow.com/questions/39137043/what-is-the-gs-register-used-for-on-windows]
    (so, commonalities and differences b/t 64-bit windows and linux: commonalities: 4 caller-save argument registers (RCX,RDX,R8,R9) 1 caller-save return register (RAX) 2 additional caller-save temporaries (R10, R11) 6 callee-saves (RBX, RBP, R12, R13, R14, R15) (but RBP is recommended frame pointer) rsp FS, GS used by the platform differences: RDI, RSI are caller-save argument regs in linux and callee-saves in windows

so some commonalities, at least over RISC-V EABI, RISC-V ILP32E, 64-bit windows, 64-bit linux: - at least 4 caller-save argument registers (in x86-64 there is also a separate return reg but in risc-v an argument reg is reused for this) - at least 2 other caller-save temporaries - at least 2 other callee-saves (everyone except risc-v EABI has at least 5 tho) - at least 1 stack ptr - at least 2 other platform ptrs (but risc-v EABI makes these optional) (on risc-v these are GPRs, on windows they are separate special regs)

adding in ARM cortex-M (ARM32):

at least 4 caller-save argument registers (in x86-64 there is also a separate return reg but in risc-v and ARM32 an argument reg is reused for this)
at least 1 other caller-save temporaries (everyone but ARM32 has at least 2 tho)
at least 2 other callee-saves (everyone except risc-v EABI has at least 5 tho)
at least 1 stack ptr
except for ARM32, at least 2 other platform ptrs (but risc-v EABI makes these optional) (on risc-v these are GPRs, on windows they are separate special regs) (but ARM32 has a link register)

32 register platforms:

ARM64: https://en.wikipedia.org/wiki/Calling_convention#ARM_(A64)

17 callee-saved
7 caller-saved
plus 2 "Intra-Procedure-call scratch registers"
8 argument registers
stack ptr,
link reg, frame ptr, platform reg, Indirect return value address

RISC-V:

8 argument registers (of which 2 are also return regs)
7 other caller-saved
12 callee-saved registers (including 1 suggested frame ptr)
zero, link reg, stack ptr, global ptr, thread ptr

forwardcom:

15 callee-save GPRs
1 stack ptr
16 argument registers
plus special regs IP (pc), Data section pointer (DATAP), Thread data pointer (THREADP), Numeric control register (NUMCONTR)

so the intersection of these are:

8 argument regs
15 total argument+other caller-save
12 callee-save regs
1 stack ptr
3 special regs (same answer whether or not you could the PC and zero regs)

so i one idea is:

4 argument regs
2 additional caller-save temporary
5 callee-saves
1 stack ptr
2 special regs
?? 2 more to go -- could spend one of these on PC

but we dont need argument regs because we can cache the stack?

suggest:

pc (or zero depending on context, eg the 16- and 8-bit forms?) 6 caller-save (including err/accumulator) 6 callee-save stack ptr (points within the parameter stack) frame ptr (points within the return stack) Data section pointer

parameter stack ptr (callee save, sorta) (this is the one used by stack addressing mode) return stack ptr (callee save) frame ptr (callee save) (i think this points into the return stack? do we even need a separate return stack pointer?) global ptr (callee save) thread ptr (callee save) read-only segment pointer (callee save) result/accumulator/err (caller save) PC (not directly accessible? so really just 15 registers? or, readable but not writable (allows for constant pools in the middle of the code)?)

ok, copied this to a note at the end (currently) of boot_reference.md

---

old idea for addr modes:

the addr modes would be:

immediate constant
register
indirect (the register has a pointer to the effective address)
displacement (4-bit operand data split into 2 displacement bits and 2 base register specifier bits; the displacement is added to the value in the base register and this is the effective address)
the displacement bits are themselves split into two halves, with the first half indicating a number of pointers, and the second indicating a number of bytes
stack (when operand data is 0, pushes after instruction (for op0) or pops before instruction (for op1 and op2); for other operand datas, they are an offset, and this is like indirect offset addressing where the stack pointer is the base pointer and the offset is added to it
note: for some instructions op0 is not (only) output and/or op1 or op2 are not (only) input, in which case the push/pop behavior might not make sense; however, to simplify implementation, we don't want the selection of push or pop to depend on the opcode
indirect predec/postinc (4-bit operand data split into 1 base register specifier bits, 1 direction bit (predec/postinc), and 2 scale bit (8-bit, 16-bit, 32-bit, or ptr)
frame (frame pointer plus signed 4-bit displacement)
indexed indirect (4-bit operand data split into 1 base register specifier bit, 2 scale bits, and 1 index register specifier bits)

new idea:

immediate constant
register
stack/frame: 1st bit of operand data is stack or frame ptr
stack: next 3 bits of operand data are a displacement
frame: next bit of operand data is a sign, next 2 bits is a displacement
indexed indirect offset / displacement: we have 1 addr mode bit left and 4 operand data bits, so 5 bits total. These are used for:
1 base register specifier bit
2 scale bits (zero, 8-bit, 32-bit, ptr)
if the scale is non-zero:
1 index register specifier bit
1 offset bit (zero or 32-bit)
if the scale is zero:
2 offset bits (32-bit offset, ptr offset)

eh, a problem with this is no predec/postinc. You could use the index reg bit to specify index vs predec/postinc but then no bits for direction of predec/postinc. I guess you could give up the 2 base regs tho. Or actually how about:

immediate constant
register
stack/frame: 1st bit of operand data is stack or frame ptr
stack: next 3 bits of operand data are a displacement
frame: next bit of operand data is a sign, next 2 bits is a displacement
indexed 8-bit
indexed 32-bit
indexed ptr
predec
2 scale bits: (8-bit, 16-bit, 32-bit, ptr)
2 offset bits
postinc
like predec but postinc instead

i dunno, this is kinda dumb, b/c if the scale is so small then the offset will go into the next array location. But, giving up the 16-bit scale is cool. How about:

immediate constant
register
stack/frame: 1st bit of operand data is stack or frame ptr
stack: next 3 bits of operand data are a displacement
frame: next bit of operand data is a sign, next 2 bits is a displacement
scaled 8-bit
scaled 32-bit
scaled ptr
?
?

eh, this doesn't scale actually in the sense that in a larger instruction size, you want a linear combination of 8-bit and ptr for your scale. Hmm, if you had at least 4 scale bits it would make sense to have both an index and a displacement. Also i dont think you'll ever want both index and predec/postinc, but you might want neither (ie just the offset). So we have at least:

immediate constant
register
stack/frame: 1st bit of operand data is stack or frame ptr
stack: next 3 bits of operand data are a displacement
frame: next bit of operand data is a sign, next 2 bits is a displacement
displacement
indexed/(predec/postinc)

i don't think this is much better than the original status quo idea. predec/postinc aren't as bit-hungry as indexed so we don't need to break them out into two separate addr modes. But they may as well have a mode that is separate from indexed, b/c they need a sign and indexed doesn't. For similar reasons, it would be nice if stack and frame were separate, and these are probably common so it makes sense to give them their own modes (although otoh i doubt stack really needs 4 bits of data). So the remaining difference from then status quo is whether we give an extra bit to 'indexed', or whether we have a separate 'indirect' mode that can take any base reg. I think taking any base reg in indirect is pretty useful.

old:

the displacement bits are themselves split into two groups, with the first 2 bits indicating a number of bytes (0, 1, 2, 4), and the second indicating a number of ptrs (0 or 1), and these are added together - the displacement bits signify a choice of: (8-bit, 16-bit, 32-bit, 64-bit, ptr, 8-bit + ptr, 32-bit + ptr, 64-bit + ptr) - the displacement bits are themselves split into two groups, with the first 2 bits indicating a number of bytes (0, 1, 2, 4), and the third bit indicating a number of ptrs (0 or 1), and these are added together

hmm, all these addr modes with so few bits for the offset/scale, is there even any point?

an alternative would be to give up on all of the register choice bits and make the offsets/scales 1 bit larger

old:

indirect displacement (4-bit operand data split into 3 displacement bits and 1 base register specifier bit; the displacement is added to the value in the base register and this is the effective address)
the displacement bits signify a choice of: (8-bit, 16-bit, 32-bit, 64-bit, ptr, 8-bit + ptr, 32-bit + ptr, 2 * ptr)
stack (when operand data is 0, pushes after instruction (for op0) or pops before instruction (for op1 and op2); for other operand datas, they are an offset, and this is like indirect offset addressing where the stack pointer is the base pointer and the offset is added to it
note: for some instructions op0 is not (only) output and/or op1 or op2 are not (only) input, in which case the push/pop behavior might not make sense; however, to simplify implementation, we don't want the selection of push or pop to depend on the opcode
indirect predec/postinc (4-bit operand data split into 1 base register specifier bits, 1 direction bit (predec/postinc), and 2 scale bits (8-bit, 16-bit, 32-bit, or ptr)
frame (frame pointer plus signed 4-bit displacement)
indirect indexed (4-bit operand data split into 1 base register specifier bit, 2 scale bits, and 1 index register specifier bits)

old:

indirect displacement (4-bit operand data split into 3 displacement bits and 1 base register specifier bit; the displacement is added to the value in the base register and this is the effective address)
the displacement bits signify a choice of: (8-bit, 16-bit, 32-bit, 64-bit, ptr, 8-bit + ptr, 32-bit + ptr, 2 * ptr)
indirect predec/postinc (4-bit operand data split into 1 direction bit (predec/postinc), and 3 displacement bits (8-bit, 16-bit, 32-bit, or ptr). The base reg is a fixed caller-save register which is one of the two base regs.
indirect indexed (4-bit operand data split into 1 base register specifier bit, 3 scale bits (interpreted as the displacement bits, above)). Indexed register is a fixed caller-save register disjoint from the base regs.

old:

indirect displacement (4-bit operand data split into 3 displacement bits and 1 base register specifier bit; the displacement is added to the value in the base register and this is the effective address)
the displacement bits signify a choice of: (8-bit, 16-bit, 32-bit, 64-bit, ptr, 8-bit + ptr, 32-bit + ptr, 2 * ptr)
indirect predec/postinc (4-bit operand data split into 1 direction bit (predec/postinc), and 3 displacement bits (as above). The base reg is a fixed caller-save register which is one of the two base regs.
indirect indexed (3-bit operand data split into 1 base register specifier bit, 2 scale bits (8-bit, 16-bit, 32-bit, or ptr)). Indexed register is a fixed caller-save register disjoint from the base regs.
predec/postinc (3-bit operand data is an immediate constant which is subtracted/added to the index register before or after the instruction)

old:

if the addr mode of the other operand is indirect using fixed base reg and two scale bits then the remaining 3 bits of the output operand data are a signed integer which is multiplied by the interpretation of the existing scale bits to produce a modified scale
if the addr mode of the other operand is indirect using fixed base reg and two scale bits then the remaining 3 bits of the output operand data are: 2 bits for a base register selector (2 out of the caller-save GPRs and 2 out of the callee-save GPRs) and 1 bit is either an index register selector (in indirect indexed mode; one caller-save choice and one callee-save choice, disjoint with the possible base regs) or an additional scale bit (now a choice between 1,2,4, 8, 4 + ptrsize, ptrsize, 2*ptrsize, ??)

---

1*x + 4*y + ptr*z + 1,2,4,ptr (cant express 7, redundant 2)

---

Non-indirect Postinc postdec addr modes are kind of a waste b/c they don't obviate a need for temporary registers unlike the other addressing modes

---

now i'm not sure bootS is worth it still.

---

z29LiTp?5qUC30n on May 28, 2020

next [–]

I think current bootstrapping work clearly shows that the Maxwell equations of software are:

   pop rx -> sp--; rx := mem[sp]
   push rx -> mem[sp] := rx; sp++
   sub rx ry -> rx := rx - ry
   jmp rx $I -> if rx is zero jump $I bytes from end of instruction.

One might also argue:

   call rx -> mem[sp] := IP; sp++; IP := rx
   ret -> sp--; IP := mem[sp]
   nand rx ry -> rx := rx nand ry

but those are just optimizations " -- [1]

---

most of the indirect/index/displacement address modes are pretty useless without more bits. What would it look like if we allowed these to take a 32-bit payload?

this whole concept is a little offensive; i think RISC-V and ARM64 manage to do without payloads at all. How?
RISCV
a few different complicated instruction encoding formats
LUI and AUIPC for loading immediates; only has a 20bit operand
unconditional jumps only have a 20bit operand
AArch64 (ARM64)
16-bit immediates or in some cases 21
https://dinfuehr.github.io/blog/encoding-of-immediate-values-on-aarch64/
https://stackoverflow.com/questions/30904718/range-of-immediate-values-in-armv8-a64-assembly
literal pools via PC-relative load with a +-1MB reach (21 bits)
https://devblogs.microsoft.com/oldnewthing/20220808-00/?p=106953
if however we consider instructions as sometimes being macros that expand to many simple instructions (that use the stack for temporaries), it makes more sense; using a payload expands to a PC-relative load and a skip (small unconditional jump) over the payload
note that if we are allowing everything of that form (i think the risc-v spec calls that idea "millicode"), however, we may as well also allow instructions that push/pop many registers onto the return stack at once
the instructions that need a payload could just not allow/use these address modes. That makes sense b/c these are mostly things like JMP and LOADK that dont need 3 operands anyhow
the addr modes without a payload would be: register direct, indirect, stack, frame. Note that this means that immediate always has a payload now. We could add back at least the zero register to compensate. That would also help to compile this code on other platforms with only 16 regs where some of the regs are needed by the implementation. However it's unclear if the loss of the 16 immediates without a payload, and the loss of one register, is worth having 32-bit immediates.
we would also bring back immediate variants of some instructions, such as ADD, to allow those 16-bit immediates
do we want to allow instructions to have more than 1 payload? if all 3 operands need a payload, then the total instruction length would be 4x32-bits = 128 bits = 16 bytes. That's pretty big.
if we want to keep max total instruction length to 64-bits, then we either make a rule that no more than one operand can have a payload, or we couple the representation of addr modes between operands, and have 2 bits that say: 00 = no operand has an extended addr mode, 01 = op0 has an extended addr mode, 10 = op1 has an extended addr mode, 11 = op2 has an extended addr mode. I'm leaning towards the latter.
we run into a problem with the principal that the decoder should be able to tell the instruction length from the first byte it sees. With JMP and LOADK payloads but without extended addr mode payloads, we had 3 format encoding bits, plus 5 opcode bits; this allows us to assign a unique 5-bit opcode prefix to 8 opcodes like JMP and LOADK which take a payload, which doesn't waste too many of the 256 opcodes. But now we are talking about 3 format encoding bits, plus 2 or 3 addr mode bits, before we get to the opcode bits, so we have at most 3 opcode bits in the first byte. If we do the same thing as before, that implies that we'd waste 32 opcodes on opcodes that need a payload, which is too much imo. One thing we could do is have just 1 format encoding bit, which forces the 8-bit and 16-bit encoding to have more format encoding bits. Then we'd get back 2 more opcode bits in the first byte, so we'd be back to wasting only 8.

how long are other instruction sets instructions?

RISCV: 32-bit w/ no payload
AArch64: 64-bit w/ no payload
x86 and amd64: 15 bytes
forwardcom: 12 bytes used by so-far-designed instructions, but in theory up to 16 bytes

note that if we allow more than one operand to use indirect addressing, even with no payload, then we're already beyond the pale for a 'real' ISA that would be directly executed by a modern processor (this is "memory-memory" as opposed to "memory-register" and "register-register"; it would be too inefficient/unpredictable/complicated to have instructions which might read and write to up to three different memory locations, depending). So i'm leaning towards saying, whatevs, just consider this a notation for a very-CISC cross-platform assembly language with a relatively simple encoding, and allow up to 3 payloads per instruction (so at most 16 bytes per instruction). In addition, if we relax the requirement that you can figure out the total instruction length from the first byte, and say that we can figure it out from the first 4 bytes (32-bits), then i think we're good to go.

this increases the importance of BootS? for being an actual RISC-y, simple, instruction set, and i think increases the argument for BootS? to have the same instruction encoding, just with the addr mode bits set to zero or something like that.

---

so, based on the above, my tentative decision is:

addr modes are now in two groups:
non-extended: register, indirect, stack, frame. No payload
extended: displacement, indexed, postinc/predec, combinations of these. Has a payload.
any combination of operands can have extended addr mode; there can be up to 3 payloads, so max instruction size is 16 bytes
zero register
CALL3 instruction is included
save/restore multiple registers is included
there are immediate instruction variants (to allow small immediates without 32-byte payloads)
BootS? is no longer a completely separate 32-bit encoding split cleanly into 4 bytes, it's just the same encoding but with less opcodes and only 'register' addr mode

---

what size should our stack slots be? probably max(int size, ptr size). But what size int? Should they accomodate at least 32-bit integers or 64-bit ints?

If 32-bit, then putting a 64-bit quantity on the stack takes two stack slots -- so on a platform with 64 bit ptrs, a stack slot would be 64 bits, and a 64-bit quantity would occupy 128 bits! The benefit of 32-bit is that on a platform with 32-bit ptrs, stack slots are only 32-bits instead of 64-bits

note that some other 64-bit platform ABIs appear to require stack alignment of 64-bits (x86-64?) or even 128-bits (AArch64?):

https://research.csiro.au/tsblog/debugging-stories-stack-alignment-matters/ https://stackoverflow.com/questions/64627897/system-v-abi-amd64-stack-alignment-in-gcc-emitted-assembly https://stackoverflow.com/questions/40305965/does-each-push-instruction-push-a-multiple-of-8-bytes-on-x64 https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop https://github.com/dibyendumajumdar/ravi-ffi/issues/10

a 128-bit alignment restriction is similar to the cost of 128 bit stack slots, i guess.. although not exactly because a smart program/compiler could make use of 128 bits on AArch64's stack to store more than one value

---

old comment that i removed from the BootS? ref:

what about only 12 regs instead of 16? (this would leaves 4 regs for the implementation's use on 16-register machines)
or mb 11 or 10; on x86_64, we have a stack ptr (rsp), and an optional base pointer/frame pointer (rbp); and the implementation may want a VM PC, a memory pointer into a thread control block, and two scratch regs; 16-5 = 11; otoh mb 12 is good b/c (a) VM PC can be omitted if compiling, and (b) frame pointer can be omitted on many compilers with the use of an option, and (c) GNU lightning gives programs 6 GPRs, and presumably that's b/c of platforms like x86 w/ only 8 regs suggesting that they save only 2 regs for the implementation.

---

yes, little-endian won: https://stackoverflow.com/questions/61701973/does-arm-assume-that-all-cortex-m-microcontrollers-are-little-endian

---

is it better to have stack-offset and frame-offset be different addr modes (allowing access to 16 stack locs and 16 frame locs), or to combine them and have an indirect addr mode (with no displacement, index, post/pre mutation, and no payload)?

having an indirect addr mode would significantly decrease instruction size if vanilla indirect is used a lot (b/c no payload needed), but if stack- and frame- offsets above 8 are used a lot, it would increase instruction size there (b/c would need payload to access 8 thru 16)

RISC vs. CISC from the perspective of compiler/instruction set interaction, Daniel V. Klein. (summarized in plChAssemblyFrequentInstructions.txt] suggests that register indirect with displacement is at least as common as register indirect without displacement on many architectures. So maybe skip on vanilla indirect.

---

should we make both the argument stack (where arguments and return values are passed, and temporary computation is done) and the return stack (where return addresses, frame pointers, and local variables are held) grow downward, or should one grow down and one grow up?

if one grows down and the other up, that may let us use less memory when you have a zillion greenthreads, b/c if you allocate too much for one and not enough for the other, they can 'steal' from each other. But if they both grow down, then you can use MMAP GROWSDOWN on GNU/Linux. Not sure what the situation is on Android, iOS, Windows, MacOS?.

---

The ILP32E calling convention is not compatible with ISAs that have registers that require load and store alignments of more than 32 bits. In particular, this calling convention must not be used with the D ISA extension. "

-- https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc

why not? is this a problem for us?

i think it's not. i bet they just decided that the 32-bit and 64-bit RISC-V stuff, including instruction encoding but also ABI, was separate rather than one being an extension of the other.

---

DONE (i modified this a bit, so it's not exactly what i have here; i still have to finish the rewrite from "TODO 220919 REWRITE BOUNDARY HERE" in boot_reference.md; but everything here (as modified) has been copied in already either copied into that file as part of the rewrite, or copied in as notes next to the REWRITE BOUNDARY annotation; so the rest of this section is just for my records. Also boot_reference.md has already been rewritten to be a subset of this with register addr mode only and no payloads) so it's been some months. Without quite rereading what i have above, how about, for the 32-bit encoding:

3 format bits
8 opcode bits
3x (3 addr mode bits + 4 operand data bits) = 21 operand bits

the addr modes would be:

immediate constant
register
indirect (the register has a pointer to the effective address)
stack (when operand data is 0, pushes after instruction (for op0) or pops before instruction (for op1 and op2); for other operand datas, they are an offset, and this is like indirect offset addressing where the stack pointer is the base pointer and the offset is added to it
note: for some instructions op0 is not (only) output and/or op1 or op2 are not (only) input, in which case the push/pop behavior might not make sense; however, to simplify implementation, we don't want the selection of push or pop to depend on the opcode
frame (frame pointer plus unsigned 4-bit displacement)
predec (register is decremented before the instruction)
postinc (register is incremented after the instruction)
todo: hmm these non-indirect Postinc postdec addr modes are kind of a waste b/c they don't obviate a need for temporary registers unlike the other addressing modes
indirect using fixed base register (which is caller-save) and two scale bits (the two displacement/scale bits are interpreted as a choice b/t 1, 2, 4, ptrsize):
indirect displacement: effective_address = base_register + scale
indirect predec: base_register = base_register - scale; effective_address = base_register
indirect postinc: effective_address = base_register; base_register = base_register + scale;
indirect indexed: effective_address = base_register + scale*index_register (index register is a fixed caller-save register different from base register)

(other ideas i didnt do: ?index into constant table? ?stack offset without push/pop? ?stack but treat any register as stack pointer?)

do we have stack slots of more than 8 bits? (in which case the stack and frame addr modes can be scaled by stack slot size; otherwise they must be in units of bytes)

when the output operand (op0) has immediate constant mode, that means that:

if the first bit of the output operand data is 0, then op0 = op1
if the first bit of the output operand data is 1, then op0 = op2
if the last 3 bits of op0 are 0, then this is UNDEFINED (RESERVED for future use; because this would be redundant with 'ordinary' addressing modes
if the addr mode of the other operand is immediate constant or stack or frame, then the remaining 3 bits of the output operand data are prepended to the bits of the other operand data
if the addr mode of the other operand is register, then ???
if the addr mode of the other operand is predec or postinc, then the remaining 3 bits of the output operand data are a 3-bit unsigned immediate constant which is subtracted/added to the register instead of subtracting/adding 1
if the addr mode of the other operand is indirect using fixed base reg and two scale bits then the remaining 3 bits of the output operand data are a signed integer which is multiplied by the interpretation of the existing scale bits to produce a modified scale
todo you can do better than this. also indirect index could use some index reg selection. note that in 64-bit encoding we'll already have 4 more bits in the operand data so we can select between 16 base regs with that.
mb: 1*x + 4*y + ptr*z + 1,2,4,ptr (cant express 7, redundant 2)

16 registers:

pc (or zero depending on context, eg the 16- and 8-bit forms? readable, but not writable?)? 6 caller-save GPRs (including err/accumulator/result reg) (2 of which are eligible 'base pointers' for the complex addr modes) 6 callee-save GPRs (2 of which are eligible 'base pointers' for the complex addr modes) stack ptr (points within the parameter stack) frame ptr (points within the return stack) data section pointer

(so, decided against: thread ptr; read-only segment pointer; a return-stack pointer in addition to the frame ptr)

the size of slots on the stack is the smallest size needed to fit either of (that is, a union type of):

a pointer
a 32-bit integer

64-bit integers and 64-bit pointers take up 2 stack slots (even on platforms where pointers, and hence stack slots, are 64 bits)

the operand data in the stack and frame addressing modes ('displacement') are in units of stack slots

reads and writes using the stack and frame addressing modes may access a cached version of the stacks, which is not necessarily in sync with the in-memory version of these same stacks except for at the moment of completion of the appropriate flush command (so i guess there are four flush commands? flush parameter stack from memory to cache, flush parameter stack from cache to memory, flush return stack from memory to cache, flush return stack from cache to memory)

the 2 base registers and 2 index registers are disjoint, in each case with 1 of them caller-save and 1 of them callee-save

note: there's a little redundancy in displacement b/c a zero displacment is the same as indirect on the base reg. But this is worth it for the regularity in the interpretation of the displacement bits, which makes it easier to extend to a 64-bit instruction encoding

---

"Clang seems pretty impressive: given its self-imposed limitation of 2 argument + 8 scratch registers, it manages to use an impressively small number of loads and stores, at a slight penalty to stack space. ... It turns out that Clang only used 10 registers:

    s0 and s1 are function arguments (per calling convention)
    s8 through s15 are local variables
    It refused to use s16 through s32, which are "local variables, caller saved"" -- [2]

---

ideas for more addr modes:

a mode that is pre-dec/post-inc, but on the index register instead of the base register
syscalls / library calls with call2 (eg op2 is the symbol identifier, and op1 and op0 are inputs, or maybe op1 is an input and op0 is an output)
another idea for immediate OP0 is to transform other memory addressing modes in the instruction so as to have a memory addressing mode but don't not take a payload

two freedoms that remain to us (eg space to encode additional complexities such as more addr modes) are:

immediate addressing mode in OP0
register addressing mode in OP0 with a zero register

---

" Flags are still there on everything except RISC-V, and even there they exist in the floating-point ISA. They are annoying to implement but they’re a huge win for software. Compilers are really good at if conversion now and you can get the same performance on an architecture with conditional moves as one without if you have about less than half as much branch predictor state. The saving in the branch predictor more than offsets the cost of flags. They’re also essential for any kind of constant-time code to be fast, which is increasingly important with modern systems and side channels. " -- [3]

---

renox 5 hours ago

parent

next [–]

There's the issue of integers overflow, on Rust they're only detected in debug mode not in release mode.

The MIPS had 'trap on overflow ' integer arithmetic instructions but sadly the RISC-V doesn't have those..

---

"x86 that has seen several major architectural upgrades, without dropping backwards compatibility. For instance the addition of 64-bit integer arithmetic, extra registers (r8-r15) and a new floating-point paradigm (SSE2 replacing x87)" -- [4]

---

"RISC ISA:s have added more powerful instructions. One of the traits of RISC ISA:s is that they have a limited instruction encoding space (whereas variable length CISC ISA:s have an almost infinite encoding space), which incentivizes RISC ISA designers to come up with clever and powerful instructions. While the early RISC ISA:s were pretty bare bones, more recent RISC ISA:s have included some interesting instructions (that never made it into CISC ISA:s for some reason). Among these are bit-field instructions that essentially do the work of 2-4 traditional bit manipulation instructions in a single instruction (as well as remove the need for traditional shift instructions), and integer multiply-and-add instructions (two instructions in one), etc. Another example is clever encoding of immediate values (numeric constants) so that most of the time you do not need to waste four bytes to represent a 32-bit numeric constant, for instance." -- [5]

---

https://ziglang.org/news/goodbye-cpp/ does something somewhat similar to what we want to do for bootstrapping (but not terribly similar)

they compile (a stripped-down for bootstrapping version of) the self-hosting compiler to WASM with WASI. The only WASI calls they need are these:

" The OS interop layer has been completely abstracted into a handful of WASI functions to be implemented in the WASI interpreter:

(import "wasi_snapshot_preview1" "args_sizes_get" (func (;0;) (type 3))) (import "wasi_snapshot_preview1" "args_get" (func (;1;) (type 3))) (import "wasi_snapshot_preview1" "fd_prestat_get" (func (;2;) (type 3))) (import "wasi_snapshot_preview1" "fd_prestat_dir_name" (func (;3;) (type 6))) (import "wasi_snapshot_preview1" "proc_exit" (func (;4;) (type 11))) (import "wasi_snapshot_preview1" "fd_close" (func (;5;) (type 8))) (import "wasi_snapshot_preview1" "path_create_directory" (func (;6;) (type 6))) (import "wasi_snapshot_preview1" "fd_read" (func (;7;) (type 5))) (import "wasi_snapshot_preview1" "fd_filestat_get" (func (;8;) (type 3))) (import "wasi_snapshot_preview1" "path_rename" (func (;9;) (type 9))) (import "wasi_snapshot_preview1" "fd_filestat_set_size" (func (;10;) (type 36))) (import "wasi_snapshot_preview1" "fd_pwrite" (func (;11;) (type 28))) (import "wasi_snapshot_preview1" "random_get" (func (;12;) (type 3))) (import "wasi_snapshot_preview1" "fd_filestat_set_times" (func (;13;) (type 51))) (import "wasi_snapshot_preview1" "path_filestat_get" (func (;14;) (type 12))) (import "wasi_snapshot_preview1" "fd_fdstat_get" (func (;15;) (type 3))) (import "wasi_snapshot_preview1" "fd_readdir" (func (;16;) (type 28))) (import "wasi_snapshot_preview1" "fd_write" (func (;17;) (type 5))) (import "wasi_snapshot_preview1" "path_open" (func (;18;) (type 52))) (import "wasi_snapshot_preview1" "clock_time_get" (func (;19;) (type 53))) (import "wasi_snapshot_preview1" "path_remove_directory" (func (;20;) (type 6))) (import "wasi_snapshot_preview1" "path_unlink_file" (func (;21;) (type 6))) (import "wasi_snapshot_preview1" "fd_pread" (func (;22;) (type 28)))

This is the entire set. In order for the Zig compiler to compile itself to C, these are the only syscalls needed. " -- https://ziglang.org/news/goodbye-cpp/

the variant of WASM they use is "wasm32-wasi with a CPU of generic+bulk_memory". I guess bulk_memory might mean this: https://github.com/WebAssembly/bulk-memory-operations/blob/master/proposals/bulk-memory-operations/Overview.md

discussion: https://news.ycombinator.com/item?id=33913231 https://lobste.rs/s/g55iso/goodbye_c_implementation_zig

---

" As far as I am aware, no high level compiled language has ever done really well on an 8-bit CPU like a 6502. (Forth aside, perhaps.) You can do it but from what I’ve heard you tend to end up writing C or whatever in a dialect that ends up working a lot like the target machine’s assembly language anyway. But life gets a lot better on a 16-bit CPU where you have a bit more register space and probably enough memory for a stack. "

(Maybe OT: I realize people have a fondness for the 6502 based on beloved hardware, but IIRC it is considered a particularly hostile (challenging?) target for compilers because of its extreme shortage of registers and 16-bit operations. Even Woz got frustrated enough to write and use a small 16-bit interpreter called “Sweet16” for the Apple ][ ROM. Back in the day I found the Z80 much easier to code for.)

Anyway. FORTH worked really well on 8-bit systems, and there’s been a lot of progress in concatenative languages lately, so I wonder if any of those would work well in that domain. I’m guessing you’d still want a traditional threaded interpreter, not a compiler, because of the above mentioned problems with native codegen, but modern features like static typing and lambdas/quotes would be great " -- [6]

---

https://en.wikipedia.org/wiki/SWEET16

---

" #87 Eliot Miranda on 10.16.15 at 5:59 am

    Hi Yossi,

    compative is not always bad; I like the skepticism in your original post; bravo.

    Earlier in the thread "#41 Peufeu on 09.28.09 at 2:25 am" says it best; JITs can do a good job at executing dynamic oopls like Smalltalk (my love), so the issue should indeed be supporting the writing of JITs, at least in part. But also one should support the execution of the code a JIT would like to produce, and support GC.

    There is a short history of Smalltalk processors. SOAR led to SPARC (register windows) and its tagged arithmetic instructions, but neither ended up being used in Peter Deutsch's HPS, the highest performance commercially available Smalltalk VM for commodity processors (which my Spur VM is beating by about -40%). So two concrete suggestions from SOAR

    1. Tagged arithmetic instructions should neither hardware the tag values nor the tag width and should jump on failure (or better skip next on success) rather than trap. So the instruction should allow one to specify the number of tag bits (forcing them to be least significant irks probably fine) and which tag pattern represents a fixnum.

    A key instruction sequence in a Smalltalk JIT VM is the inline cache check which looks like eg

    movq %rdx, %rax
    andq #7, %rax. ; test tag bits of receiver
    jnz Lcmp ; if nonzero they are the cache tag
    movq (%rdx),%rax ; if zero, fetch class (index) from header
    andq #3FFFFF,%rax
    Lcmp:
    cmpq %rax, %rcx ; compare cache with receiver's cache tag

    The x86 provides a handy-dandy conditional move that one would think could eliminate the jump and yield

    movq %rdx, %rax
    andq #7, %rax. ; test tag bits of receiver
    cmoveqq (%rdx),%rax ; if zero, fetch class (index) from header
    andq #3FFFFF,%rax
    Lcmp:
    cmpq %rax, %rcx ; compare cache with receiver's cache tag

    except some cruel tease in Intel decided to make the instruction trap when given an illegal address *whetger the instruction made the move or not* so it's useless :-(. So if you provide a yummy conditional move, make sure it doesn't trap if the condition is false.

    Another tedious operation is the store check. It would be great to have a checked store instruction, again one that used the skip next on success pattern so one can jump to the remembering code, rather than handling traps. A store check would be based on bounds registers, set up infrequently, eg when entering jutted code, that would specify the base and extent of new space and trap if storing an untagged value that lies in new space to an address outside it.

    One *really cool* piece of hardware support, that would essentially reduce my new VM to something trivial to implement would be the ability to search memory in parallel for aligned word sized cells that contain a specific value. This means /all/of the heap, not just the part that is paged in, which probably implies no paging ;-). Smalltalk supports become, upon which lots of things are implemented including shape changing instances at runtime. Become is essentially "replace all references to a with references to b" or "exchange all references to a and b", so to add an inst var at runtime, the system collects all instances of a class, creates a new class with the added inst var, creates copies of all instances, with the new inst var having the value of nil, and then "atomically" exchanges the classes and the instances so that all references in the system refer to the new class and instances, leaving the old ones to the GC. The problem is in finding the references. A trawl through the heap is slow. The original Smalltalk-80 added an j direction in the header of each object, and hence one exchanges the indirections, and this design remains in HPS, but the explicit indirection slows down all accesses. My new Spur VM uses "lazy forwarding", turning the original objects into forwarding pointers to copies, and fixing them up when message sends fail (since forwarders don't have valid classes) or when visited by the GC, which is why Sour is exactly twice as fast as HOS in the benchmarks game's binary trees benchmark, but at the cost of /lots/ of complexity. If memory were smart enough to render the search O(1) that would be magnificent.

    There are other such ideas (azul's already been mentioned, asplos is a good source for the design ideas around cache management (avoiding the need to zero unmapped pages etc) and cheap user-level traps (although I think snip next/conditional instructions a la arm are way more convenient). But the above are what are pressing from my own experience." -- comment on https://yosefk.com/blog/the-high-level-cpu-challenge.html

---

" I'm pretty sure I'm hopelessly naive, but RISC processors are still way too complex. Integrate a multi-port main memory on the CPU die, toss out the caches, memory manager, branch prediction, TLB, security modes and supervisor registers, etc. and keep the core busy via barrel processing with equal bandwidth threads intercommunicating in the shared memory space. Stick a bunch of these together (a miracle occurs here) and deal with security at a higher level. " -- comment on https://yosefk.com/blog/the-high-level-cpu-challenge.html

---

https://www.bdti.com/InsideDSP/2013/10/23/SingularComputing

"Imperfect Processing: A Functionally Feasible (and Fiscally Attractive) Option, Says Singular Computing ... Around a decade ago, Bates had a breakthrough realization that the human brain's neurons don't do exact arithmetic; they were only about 99 percent right on average. What, he wondered, would happen if he tried to build hardware that wasn’t neural in design, but which implemented approximate arithmetic? What Bates discovered was that he could shrink the silicon area consumed by each arithmetic unit by approximately 100x versus the DSP-, FPGA- or GPU-based alternatives, an especially attractive outcome in AI and other highly parallelizable applications. ... Imperfect computing, Bates freely admits, is not an idea that's unique to his startup company, Singular Computing. Less than two weeks ago, for example, Joel Hruska at ExtremeTech? published the informative article "Probabilistic computing: Imprecise chips save power, improve performance," which covers the research being done by Christian Enz, the Director of the Institute of Microengineering at the École Polytechnique Fédérale de Lausanne. Hruska's article references similar work being done at Rice University, which he explored in greater detail in a writeup published in May of last year. And Hruska also mentions Intel, which among other things "has explored the idea of a variable FPU that can drop to 6-bit computation when a full eight bits isn’t required." https://www.extremetech.com/computing/168348-probabilistic-computing-imprecise-chips-save-power-improve-performance https://www.extremetech.com/computing/129665-can-probabilistic-computing-save-moores-law ... In Singular Computing's case, the core arithmetic unit does "floating point-like" operations (add, subtract, divide, multiply, and square root) in a single cycle and pairs with 256 words' worth of high-speed memory and multiple local registers to form an APE (approximate processing element) (Figure 1). APEs communicate with each other over a neural network-reminiscent massively parallel grid interconnect scheme; Bates estimates that modern process lithographies could enable the cost-effective integration of several hundred thousand APEs on a single chip, alongside an ARM or other host processor.

---

    2
    brucehoult 1 year ago | link

    Many of the applications for a CPU like this don’t need any state outside of the CPU registers – especially as RISC-V lets you do multiple levels of subroutine call without touching RAM if you manually allocate different registers and a different return address register for each function (which means programming in asm not C). A lot of 8051 / PIC / AVR have been sold without any RAM (or with RAM == memory mapped registers)-- https://lobste.rs/s/nqxfoc/serv_is_award_winning_bit_serial_risc_v

---

(talking about casting in C): kornel 8 hours ago (unread)

link

flag

float is a special kind of fun, because it’s also UB if a float to int cast overflows. Rust had to add a range check to float casts to close this loophole in LLVM.

---

" Code size reduction extensions

A family of extensions referred to as the RISC-V code size reduction extensions https://github.com/riscv/riscv-code-size-reduction/releases/download/v1.0.4-2/Zc_1.0.4-2.pdfhttps://github.com/riscv/riscv-code-size-reduction/releases/download/v1.0.4-2/Zc_1.0.4-2.pdf was ratified earlier this year. One aspect of this is providing ways of referring to subsets of the standard compressed 'C' (16-bit instructions) extension that don't include floating point loads/stores, as well as other variants. But the more meaningful additions are the Zcmp and Zcmt extensions, in both cases targeted at embedded rather than application cores, reusing encodings for double-precision FP store.

Zcmp provides instructions that implement common stack frame manipulation operations that would typically require a sequence of instructions, as well as instructions for moving pairs of registers. The RISCVMoveMerger? pass performs the necessary peephole optimisation to produce cm.mva01s or cm.mvsa01 instructions for moving to/from registers a0-a1 and s0-s7 when possible. It iterates over generated machine instructions, looking for pairs of c.mv instructions that can be replaced. cm.push and cm.pop instructions are generated by appropriate modifications to the RISC-V function frame lowering code, while the RISCVPushPopOptimizer? pass looks for opportunities to convert a cm.pop into a cm.popretz (pop registers, deallocate stack frame, and return zero) or cm.popret (pop registers, deallocate stack frame, and return).

Zcmt provides the cm.jt and cm.jalt instructions to reduce code size needed for implemented a jump table. Although support is present in the assembler, the patch to modify the linker to select these instructions is still under review so we can hope to see full support in LLVM 18.

The RISC-V code size reduction working group have estimates of the code size impact of these extensions https://docs.google.com/spreadsheets/d/1bFMyGkuuulBXuIaMsjBINoCWoLwObr1l9h5TAWN8s7k/edit#gid=1837831327 ((my note: avg 11%) produced using this analysis script. I'm not aware of whether a comparison has been made to the real-world results of implementing support for the extensions in LLVM, but that would certainly be interesting. "

---