proj-oot-ootAssemblyNotes32

so it's been some months. Without quite rereading what i have above, how about, for the 32-bit encoding:

the addr modes would be:

when the output operand has immediate constant mode, that means that:

copied to boot_reference.md

---

result of popping the stack above the frame pointer (stack underflow) is undefined? That helps with specifying the cache behavior but it makes fourth style stuff where there are no boundaries between the data that was written by different functions, impossible. But I guess you don't want to overwrite the return address on the stack! So forth style stuff requires a separate return stack anyway (where the frame pointer would go)

what else does Forth do with the return stack?

so, in Forth, there are typically two stacks, one called either just 'stack' or 'parameter stack', and the other called 'return stack'. The parameter stack tends to be much larger than the return stack. The return stack is the one organized with traditional 'stack frames', with return addresses, loop counters, and maybe local variables, so it looks the most like a C stack. Function calling parameters go on Forth's larger 'parameter stack'.

---

if, instead of putting the stack at the top of a chunk of memory (growing downward) and the heap at the bottom (growing upward), we allocate a separate chunk for the stack, then we can put the parameter stack at the top and the data stack at the bottom of that chunk

---

link register instead of read-only segment pointer? RISC-V has one but forwardcom does not. But no one has a read-only segment pointer. should we require that "on a control flow join the statically determined stack depth has to be the same on all joining control flows" like http://www.complang.tuwien.ac.at/anton/euroforth/ef13/papers/ertl-paf.pdf ? they have tagged GOTOs and tagged labels for this purpose.. this allows them to store stack in registers and compile out some stack manipulation i think this is more appropriate for the next level up (LOVM) -- it seems like it would limit efficiently writing something very low level, like an emulator, but i'm not sure

there seems to be no need to have a callee save small stack, because: One of the main reasons that small stack has an advantage over in memory stack is that you don't have to pop if it's a ring. But if it's a callee save, then when a subroutine is not a leaf it will have to pop to save all that stuff

but... i guess a ring buffer is harder to implement on today's machines than just using an in-memory parameter stack. The special cache can be applied to offsets using either stack or frame pointer as base, and can be applied only to the stack and frame addressing modes, making it simpler to detect

---

I still wonder if we couldn't just somehow use the 16-bit encoding for boots. I don't really see how we could do it because if you have 3x 4-bit operands, and even just 1 format bit, you only have 3 bits left to specify 8 opcodes. So I guess if you were going this route you would have to reduce the bits in the operand and use special copy instructions to reach most of the registers, and only allow normal instructions to use a few of the registers

---

i think we want 16 regs, not 32, to minimize architectural state, so that we can have gazillions of green threads?

should we go 16-bit instead of 32-bit? nah.. even popular microcontrollers are 32-bits these days.

for choosing the calling convention/ABI, look at:

risc-v (32 regs, but has 2 embedded api proposals for RV32E, which has 16 regs: RISC-V ILP32E, and RISC-V EABI) https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc 16 regs 11 GPRs (3 caller-saved temporaries, 2 callee-saved, 6 argument registers), zero, return address, stack ptr, global ptr, thread ptr "The ILP32E calling convention is not compatible with ISAs that have registers that require load and store alignments of more than 32 bits. In particular, this calling convention must not be used with the D ISA extension." -- why? is this a problem for us?

https://github.com/riscv-non-isa/riscv-eabi-spec : 16 regs: https://github.com/riscv-non-isa/riscv-eabi-spec/blob/master/EABI.adoc 11 GPRs (4 argument registers (2 of which are also return registers), 2 temporaries/caller-saved, 5 callee-saved, (including suggested link register (in place of a caller-saved/temporary) and frame pointer (in place of a callee-saved register), stack ptr zero, return addr, global ptr, thread ptr optional: "If an entire embedded application and its libraries make no use of thread-local storage, the tp register becomes available as a global register or as a temporary register, at the application’s discretion. If the __global_pointer$ symbol is not defined, the gp register becomes available in the same fashion. Using the tp and gp registers in this alternate way is a nonstandard extension to the EABI and might not compose with some EABI libraries."

ARM aarch64 (32 regs)

ARM cortex-m (17 regs: 13 GPRs, stack pointer, link register, PC, Special-purpose Program Status register) https://developer.arm.com/documentation/ddi0439/b/Programmers-Model/Processor-core-register-summary https://en.wikipedia.org/wiki/Calling_convention out of the 13 GPRs: 4 argument regs (some reused as return regs) 8 callee-saves 1 caller-save temporary

forwardcom https://www.agner.org/optimize/forwardcom.pdf instruction size of 1, 2, or 3 32-bit words fully orthogonal, a zillion instruction formats (can each instruction use any format?) addressing modes: Address = Base pointer + Index * Scale + Offset 32 GPRs including a stack ptr, plus IP (pc), Data section pointer (DATAP), Thread data pointer (THREADP), Numeric control register (NUMCONTR)

system v calling convention https://www.agner.org/optimize/calling_conventions.pdf esp. page 10, "6 Register usage Table 4. Register usage" not sure but it looks like we have 16 registers: 15 GPRs 6 caller-save argument registers (RDI,RSI,RDX,RCX,R8,R9) (of which 1, RDX, is also a return register), 1 additional caller-save return register (RAX), 2 additional caller-save temporaries (R10, R11), and 6 callee-saves (RBX, RBP, R12, R13, R14, R15) plus 1 stack pointer (rsp) but rbp is also a suggested frame pointer register and "FS is used for a thread environment block in Windows and for thread specific data in Linux" (by glibc i think?) and GS is used by the kernal for a per-CPU pointer to kernel memory ( https://stackoverflow.com/questions/6611346/how-are-the-fs-gs-registers-used-in-linux-amd64 )?

https://refspecs.linuxfoundation.org/elf/x86_64-abi-0.99.pdf https://github.com/hjl-tools/x86-psABI/wiki/X86-psABI

64 bit Windows calling convention https://www.agner.org/optimize/calling_conventions.pdf

  15 GPRs
    4 caller-save argument registers (RCX,RDX,R8,R9), 1 additional caller-save return register (RAX), 2 additional caller-save temporaries (R10, R11), and 8 callee-saves (RBX, RBP, R12, R13, R14, R15, RSI, RDI)
  plus rsp stack pointer
  and i think it uses FS and GS for something too?
    "On 64-bit, GS is used to access the PEB in userland or the KPCR in kernel land in Windows" -- [https://github.com/NationalSecurityAgency/ghidra/issues/1339]
    "The reason Win64 uses GS is that there the FS register is used in the 32 bit compatibility layer (confusingly called Wow64)." -- [https://stackoverflow.com/questions/39137043/what-is-the-gs-register-used-for-on-windows]
    (so, commonalities and differences b/t 64-bit windows and linux: commonalities: 4 caller-save argument registers (RCX,RDX,R8,R9) 1 caller-save return register (RAX) 2 additional caller-save temporaries (R10, R11) 6 callee-saves (RBX, RBP, R12, R13, R14, R15) (but RBP is recommended frame pointer) rsp FS, GS used by the platform differences: RDI, RSI are caller-save argument regs in linux and callee-saves in windows

so some commonalities, at least over RISC-V EABI, RISC-V ILP32E, 64-bit windows, 64-bit linux: - at least 4 caller-save argument registers (in x86-64 there is also a separate return reg but in risc-v an argument reg is reused for this) - at least 2 other caller-save temporaries - at least 2 other callee-saves (everyone except risc-v EABI has at least 5 tho) - at least 1 stack ptr - at least 2 other platform ptrs (but risc-v EABI makes these optional) (on risc-v these are GPRs, on windows they are separate special regs)

adding in ARM cortex-M (ARM32):

32 register platforms:

ARM64: https://en.wikipedia.org/wiki/Calling_convention#ARM_(A64)

RISC-V:

forwardcom:

so the intersection of these are:

so i one idea is:

but we dont need argument regs because we can cache the stack?

suggest:

pc (or zero depending on context, eg the 16- and 8-bit forms?) 6 caller-save (including err/accumulator) 6 callee-save stack ptr (points within the parameter stack) frame ptr (points within the return stack) Data section pointer

parameter stack ptr (callee save, sorta) (this is the one used by stack addressing mode) return stack ptr (callee save) frame ptr (callee save) (i think this points into the return stack? do we even need a separate return stack pointer?) global ptr (callee save) thread ptr (callee save) read-only segment pointer (callee save) result/accumulator/err (caller save) PC (not directly accessible? so really just 15 registers? or, readable but not writable (allows for constant pools in the middle of the code)?)

ok, copied this to a note at the end (currently) of boot_reference.md

---

old idea for addr modes:

the addr modes would be:

new idea:

eh, a problem with this is no predec/postinc. You could use the index reg bit to specify index vs predec/postinc but then no bits for direction of predec/postinc. I guess you could give up the 2 base regs tho. Or actually how about:

i dunno, this is kinda dumb, b/c if the scale is so small then the offset will go into the next array location. But, giving up the 16-bit scale is cool. How about:

eh, this doesn't scale actually in the sense that in a larger instruction size, you want a linear combination of 8-bit and ptr for your scale. Hmm, if you had at least 4 scale bits it would make sense to have both an index and a displacement. Also i dont think you'll ever want both index and predec/postinc, but you might want neither (ie just the offset). So we have at least:

i don't think this is much better than the original status quo idea. predec/postinc aren't as bit-hungry as indexed so we don't need to break them out into two separate addr modes. But they may as well have a mode that is separate from indexed, b/c they need a sign and indexed doesn't. For similar reasons, it would be nice if stack and frame were separate, and these are probably common so it makes sense to give them their own modes (although otoh i doubt stack really needs 4 bits of data). So the remaining difference from then status quo is whether we give an extra bit to 'indexed', or whether we have a separate 'indirect' mode that can take any base reg. I think taking any base reg in indirect is pretty useful.

old:

the displacement bits are themselves split into two groups, with the first 2 bits indicating a number of bytes (0, 1, 2, 4), and the second indicating a number of ptrs (0 or 1), and these are added together - the displacement bits signify a choice of: (8-bit, 16-bit, 32-bit, 64-bit, ptr, 8-bit + ptr, 32-bit + ptr, 64-bit + ptr) - the displacement bits are themselves split into two groups, with the first 2 bits indicating a number of bytes (0, 1, 2, 4), and the third bit indicating a number of ptrs (0 or 1), and these are added together

hmm, all these addr modes with so few bits for the offset/scale, is there even any point?

an alternative would be to give up on all of the register choice bits and make the offsets/scales 1 bit larger

old:

old:

old:

old:

---

1*x + 4*y + ptr*z + 1,2,4,ptr (cant express 7, redundant 2)

---

Non-indirect Postinc postdec addr modes are kind of a waste b/c they don't obviate a need for temporary registers unlike the other addressing modes

---

now i'm not sure bootS is worth it still.

---

"

z29LiTp?5qUC30n on May 28, 2020

prev next [–]

I think current bootstrapping work clearly shows that the Maxwell equations of software are:

   pop rx -> sp--; rx := mem[sp]
   push rx -> mem[sp] := rx; sp++
   sub rx ry -> rx := rx - ry
   jmp rx $I -> if rx is zero jump $I bytes from end of instruction.

One might also argue:

   call rx -> mem[sp] := IP; sp++; IP := rx
   ret -> sp--; IP := mem[sp]
   nand rx ry -> rx := rx nand ry

but those are just optimizations " -- [1]

---

most of the indirect/index/displacement address modes are pretty useless without more bits. What would it look like if we allowed these to take a 32-bit payload?

how long are other instruction sets instructions?

note that if we allow more than one operand to use indirect addressing, even with no payload, then we're already beyond the pale for a 'real' ISA that would be directly executed by a modern processor (this is "memory-memory" as opposed to "memory-register" and "register-register"; it would be too inefficient/unpredictable/complicated to have instructions which might read and write to up to three different memory locations, depending). So i'm leaning towards saying, whatevs, just consider this a notation for a very-CISC cross-platform assembly language with a relatively simple encoding, and allow up to 3 payloads per instruction (so at most 16 bytes per instruction). In addition, if we relax the requirement that you can figure out the total instruction length from the first byte, and say that we can figure it out from the first 4 bytes (32-bits), then i think we're good to go.

this increases the importance of BootS? for being an actual RISC-y, simple, instruction set, and i think increases the argument for BootS? to have the same instruction encoding, just with the addr mode bits set to zero or something like that.

---

so, based on the above, my tentative decision is:

---

what size should our stack slots be? probably max(int size, ptr size). But what size int? Should they accomodate at least 32-bit integers or 64-bit ints?

If 32-bit, then putting a 64-bit quantity on the stack takes two stack slots -- so on a platform with 64 bit ptrs, a stack slot would be 64 bits, and a 64-bit quantity would occupy 128 bits! The benefit of 32-bit is that on a platform with 32-bit ptrs, stack slots are only 32-bits instead of 64-bits

note that some other 64-bit platform ABIs appear to require stack alignment of 64-bits (x86-64?) or even 128-bits (AArch64?):

https://research.csiro.au/tsblog/debugging-stories-stack-alignment-matters/ https://stackoverflow.com/questions/64627897/system-v-abi-amd64-stack-alignment-in-gcc-emitted-assembly https://stackoverflow.com/questions/40305965/does-each-push-instruction-push-a-multiple-of-8-bytes-on-x64 https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/using-the-stack-in-aarch64-implementing-push-and-pop https://github.com/dibyendumajumdar/ravi-ffi/issues/10

a 128-bit alignment restriction is similar to the cost of 128 bit stack slots, i guess.. although not exactly because a smart program/compiler could make use of 128 bits on AArch64's stack to store more than one value

---

old comment that i removed from the BootS? ref:

---

yes, little-endian won: https://stackoverflow.com/questions/61701973/does-arm-assume-that-all-cortex-m-microcontrollers-are-little-endian

---

is it better to have stack-offset and frame-offset be different addr modes (allowing access to 16 stack locs and 16 frame locs), or to combine them and have an indirect addr mode (with no displacement, index, post/pre mutation, and no payload)?

having an indirect addr mode would significantly decrease instruction size if vanilla indirect is used a lot (b/c no payload needed), but if stack- and frame- offsets above 8 are used a lot, it would increase instruction size there (b/c would need payload to access 8 thru 16)

RISC vs. CISC from the perspective of compiler/instruction set interaction, Daniel V. Klein. (summarized in plChAssemblyFrequentInstructions.txt] suggests that register indirect with displacement is at least as common as register indirect without displacement on many architectures. So maybe skip on vanilla indirect.

---

should we make both the argument stack (where arguments and return values are passed, and temporary computation is done) and the return stack (where return addresses, frame pointers, and local variables are held) grow downward, or should one grow down and one grow up?

if one grows down and the other up, that may let us use less memory when you have a zillion greenthreads, b/c if you allocate too much for one and not enough for the other, they can 'steal' from each other. But if they both grow down, then you can use MMAP GROWSDOWN on GNU/Linux. Not sure what the situation is on Android, iOS, Windows, MacOS?.

---

"

The ILP32E calling convention is not compatible with ISAs that have registers that require load and store alignments of more than 32 bits. In particular, this calling convention must not be used with the D ISA extension. "

-- https://github.com/riscv-non-isa/riscv-elf-psabi-doc/blob/master/riscv-cc.adoc

why not? is this a problem for us?

i think it's not. i bet they just decided that the 32-bit and 64-bit RISC-V stuff, including instruction encoding but also ABI, was separate rather than one being an extension of the other.

---

DONE (i modified this a bit, so it's not exactly what i have here; i still have to finish the rewrite from "TODO 220919 REWRITE BOUNDARY HERE" in boot_reference.md; but everything here (as modified) has been copied in already either copied into that file as part of the rewrite, or copied in as notes next to the REWRITE BOUNDARY annotation; so the rest of this section is just for my records. Also boot_reference.md has already been rewritten to be a subset of this with register addr mode only and no payloads) so it's been some months. Without quite rereading what i have above, how about, for the 32-bit encoding:

the addr modes would be:

(other ideas i didnt do: ?index into constant table? ?stack offset without push/pop? ?stack but treat any register as stack pointer?)

do we have stack slots of more than 8 bits? (in which case the stack and frame addr modes can be scaled by stack slot size; otherwise they must be in units of bytes)

when the output operand (op0) has immediate constant mode, that means that:

16 registers:

pc (or zero depending on context, eg the 16- and 8-bit forms? readable, but not writable?)? 6 caller-save GPRs (including err/accumulator/result reg) (2 of which are eligible 'base pointers' for the complex addr modes) 6 callee-save GPRs (2 of which are eligible 'base pointers' for the complex addr modes) stack ptr (points within the parameter stack) frame ptr (points within the return stack) data section pointer

(so, decided against: thread ptr; read-only segment pointer; a return-stack pointer in addition to the frame ptr)

the size of slots on the stack is the smallest size needed to fit either of (that is, a union type of):

64-bit integers and 64-bit pointers take up 2 stack slots (even on platforms where pointers, and hence stack slots, are 64 bits)

the operand data in the stack and frame addressing modes ('displacement') are in units of stack slots

reads and writes using the stack and frame addressing modes may access a cached version of the stacks, which is not necessarily in sync with the in-memory version of these same stacks except for at the moment of completion of the appropriate flush command (so i guess there are four flush commands? flush parameter stack from memory to cache, flush parameter stack from cache to memory, flush return stack from memory to cache, flush return stack from cache to memory)

the 2 base registers and 2 index registers are disjoint, in each case with 1 of them caller-save and 1 of them callee-save

note: there's a little redundancy in displacement b/c a zero displacment is the same as indirect on the base reg. But this is worth it for the regularity in the interpretation of the displacement bits, which makes it easier to extend to a 64-bit instruction encoding

---

"Clang seems pretty impressive: given its self-imposed limitation of 2 argument + 8 scratch registers, it manages to use an impressively small number of loads and stores, at a slight penalty to stack space. ... It turns out that Clang only used 10 registers:

    s0 and s1 are function arguments (per calling convention)
    s8 through s15 are local variables
    It refused to use s16 through s32, which are "local variables, caller saved"" -- [2]

---

ideas for more addr modes:

two freedoms that remain to us (eg space to encode additional complexities such as more addr modes) are:

---

" Flags are still there on everything except RISC-V, and even there they exist in the floating-point ISA. They are annoying to implement but they’re a huge win for software. Compilers are really good at if conversion now and you can get the same performance on an architecture with conditional moves as one without if you have about less than half as much branch predictor state. The saving in the branch predictor more than offsets the cost of flags. They’re also essential for any kind of constant-time code to be fast, which is increasingly important with modern systems and side channels. " -- [3]

---

renox 5 hours ago

parent prev next [–]

There's the issue of integers overflow, on Rust they're only detected in debug mode not in release mode.

The MIPS had 'trap on overflow ' integer arithmetic instructions but sadly the RISC-V doesn't have those..

reply

---

"x86 that has seen several major architectural upgrades, without dropping backwards compatibility. For instance the addition of 64-bit integer arithmetic, extra registers (r8-r15) and a new floating-point paradigm (SSE2 replacing x87)" -- [4]

---

"RISC ISA:s have added more powerful instructions. One of the traits of RISC ISA:s is that they have a limited instruction encoding space (whereas variable length CISC ISA:s have an almost infinite encoding space), which incentivizes RISC ISA designers to come up with clever and powerful instructions. While the early RISC ISA:s were pretty bare bones, more recent RISC ISA:s have included some interesting instructions (that never made it into CISC ISA:s for some reason). Among these are bit-field instructions that essentially do the work of 2-4 traditional bit manipulation instructions in a single instruction (as well as remove the need for traditional shift instructions), and integer multiply-and-add instructions (two instructions in one), etc. Another example is clever encoding of immediate values (numeric constants) so that most of the time you do not need to waste four bytes to represent a 32-bit numeric constant, for instance." -- [5]

---

https://ziglang.org/news/goodbye-cpp/ does something somewhat similar to what we want to do for bootstrapping (but not terribly similar)

they compile (a stripped-down for bootstrapping version of) the self-hosting compiler to WASM with WASI. The only WASI calls they need are these:

" The OS interop layer has been completely abstracted into a handful of WASI functions to be implemented in the WASI interpreter:

(import "wasi_snapshot_preview1" "args_sizes_get" (func (;0;) (type 3))) (import "wasi_snapshot_preview1" "args_get" (func (;1;) (type 3))) (import "wasi_snapshot_preview1" "fd_prestat_get" (func (;2;) (type 3))) (import "wasi_snapshot_preview1" "fd_prestat_dir_name" (func (;3;) (type 6))) (import "wasi_snapshot_preview1" "proc_exit" (func (;4;) (type 11))) (import "wasi_snapshot_preview1" "fd_close" (func (;5;) (type 8))) (import "wasi_snapshot_preview1" "path_create_directory" (func (;6;) (type 6))) (import "wasi_snapshot_preview1" "fd_read" (func (;7;) (type 5))) (import "wasi_snapshot_preview1" "fd_filestat_get" (func (;8;) (type 3))) (import "wasi_snapshot_preview1" "path_rename" (func (;9;) (type 9))) (import "wasi_snapshot_preview1" "fd_filestat_set_size" (func (;10;) (type 36))) (import "wasi_snapshot_preview1" "fd_pwrite" (func (;11;) (type 28))) (import "wasi_snapshot_preview1" "random_get" (func (;12;) (type 3))) (import "wasi_snapshot_preview1" "fd_filestat_set_times" (func (;13;) (type 51))) (import "wasi_snapshot_preview1" "path_filestat_get" (func (;14;) (type 12))) (import "wasi_snapshot_preview1" "fd_fdstat_get" (func (;15;) (type 3))) (import "wasi_snapshot_preview1" "fd_readdir" (func (;16;) (type 28))) (import "wasi_snapshot_preview1" "fd_write" (func (;17;) (type 5))) (import "wasi_snapshot_preview1" "path_open" (func (;18;) (type 52))) (import "wasi_snapshot_preview1" "clock_time_get" (func (;19;) (type 53))) (import "wasi_snapshot_preview1" "path_remove_directory" (func (;20;) (type 6))) (import "wasi_snapshot_preview1" "path_unlink_file" (func (;21;) (type 6))) (import "wasi_snapshot_preview1" "fd_pread" (func (;22;) (type 28)))

This is the entire set. In order for the Zig compiler to compile itself to C, these are the only syscalls needed. " -- https://ziglang.org/news/goodbye-cpp/

the variant of WASM they use is "wasm32-wasi with a CPU of generic+bulk_memory". I guess bulk_memory might mean this: https://github.com/WebAssembly/bulk-memory-operations/blob/master/proposals/bulk-memory-operations/Overview.md

discussion: https://news.ycombinator.com/item?id=33913231 https://lobste.rs/s/g55iso/goodbye_c_implementation_zig

---

" As far as I am aware, no high level compiled language has ever done really well on an 8-bit CPU like a 6502. (Forth aside, perhaps.) You can do it but from what I’ve heard you tend to end up writing C or whatever in a dialect that ends up working a lot like the target machine’s assembly language anyway. But life gets a lot better on a 16-bit CPU where you have a bit more register space and probably enough memory for a stack. "

"

(Maybe OT: I realize people have a fondness for the 6502 based on beloved hardware, but IIRC it is considered a particularly hostile (challenging?) target for compilers because of its extreme shortage of registers and 16-bit operations. Even Woz got frustrated enough to write and use a small 16-bit interpreter called “Sweet16” for the Apple ][ ROM. Back in the day I found the Z80 much easier to code for.)

Anyway. FORTH worked really well on 8-bit systems, and there’s been a lot of progress in concatenative languages lately, so I wonder if any of those would work well in that domain. I’m guessing you’d still want a traditional threaded interpreter, not a compiler, because of the above mentioned problems with native codegen, but modern features like static typing and lambdas/quotes would be great " -- [6]

---

https://en.wikipedia.org/wiki/SWEET16

---

" #87 Eliot Miranda on 10.16.15 at 5:59 am

    Hi Yossi,
    compative is not always bad; I like the skepticism in your original post; bravo.
    Earlier in the thread "#41 Peufeu on 09.28.09 at 2:25 am" says it best; JITs can do a good job at executing dynamic oopls like Smalltalk (my love), so the issue should indeed be supporting the writing of JITs, at least in part. But also one should support the execution of the code a JIT would like to produce, and support GC.
    There is a short history of Smalltalk processors. SOAR led to SPARC (register windows) and its tagged arithmetic instructions, but neither ended up being used in Peter Deutsch's HPS, the highest performance commercially available Smalltalk VM for commodity processors (which my Spur VM is beating by about -40%). So two concrete suggestions from SOAR
    1. Tagged arithmetic instructions should neither hardware the tag values nor the tag width and should jump on failure (or better skip next on success) rather than trap. So the instruction should allow one to specify the number of tag bits (forcing them to be least significant irks probably fine) and which tag pattern represents a fixnum.
    A key instruction sequence in a Smalltalk JIT VM is the inline cache check which looks like eg
    movq %rdx, %rax
    andq #7, %rax. ; test tag bits of receiver
    jnz Lcmp ; if nonzero they are the cache tag
    movq (%rdx),%rax ; if zero, fetch class (index) from header
    andq #3FFFFF,%rax
    Lcmp:
    cmpq %rax, %rcx ; compare cache with receiver's cache tag
    The x86 provides a handy-dandy conditional move that one would think could eliminate the jump and yield
    movq %rdx, %rax
    andq #7, %rax. ; test tag bits of receiver
    cmoveqq (%rdx),%rax ; if zero, fetch class (index) from header
    andq #3FFFFF,%rax
    Lcmp:
    cmpq %rax, %rcx ; compare cache with receiver's cache tag
    except some cruel tease in Intel decided to make the instruction trap when given an illegal address *whetger the instruction made the move or not* so it's useless :-(. So if you provide a yummy conditional move, make sure it doesn't trap if the condition is false.
    Another tedious operation is the store check. It would be great to have a checked store instruction, again one that used the skip next on success pattern so one can jump to the remembering code, rather than handling traps. A store check would be based on bounds registers, set up infrequently, eg when entering jutted code, that would specify the base and extent of new space and trap if storing an untagged value that lies in new space to an address outside it.
    One *really cool* piece of hardware support, that would essentially reduce my new VM to something trivial to implement would be the ability to search memory in parallel for aligned word sized cells that contain a specific value. This means /all/of the heap, not just the part that is paged in, which probably implies no paging ;-). Smalltalk supports become, upon which lots of things are implemented including shape changing instances at runtime. Become is essentially "replace all references to a with references to b" or "exchange all references to a and b", so to add an inst var at runtime, the system collects all instances of a class, creates a new class with the added inst var, creates copies of all instances, with the new inst var having the value of nil, and then "atomically" exchanges the classes and the instances so that all references in the system refer to the new class and instances, leaving the old ones to the GC. The problem is in finding the references. A trawl through the heap is slow. The original Smalltalk-80 added an j direction in the header of each object, and hence one exchanges the indirections, and this design remains in HPS, but the explicit indirection slows down all accesses. My new Spur VM uses "lazy forwarding", turning the original objects into forwarding pointers to copies, and fixing them up when message sends fail (since forwarders don't have valid classes) or when visited by the GC, which is why Sour is exactly twice as fast as HOS in the benchmarks game's binary trees benchmark, but at the cost of /lots/ of complexity. If memory were smart enough to render the search O(1) that would be magnificent.
    There are other such ideas (azul's already been mentioned, asplos is a good source for the design ideas around cache management (avoiding the need to zero unmapped pages etc) and cheap user-level traps (although I think snip next/conditional instructions a la arm are way more convenient). But the above are what are pressing from my own experience." -- comment on https://yosefk.com/blog/the-high-level-cpu-challenge.html

---

" I'm pretty sure I'm hopelessly naive, but RISC processors are still way too complex. Integrate a multi-port main memory on the CPU die, toss out the caches, memory manager, branch prediction, TLB, security modes and supervisor registers, etc. and keep the core busy via barrel processing with equal bandwidth threads intercommunicating in the shared memory space. Stick a bunch of these together (a miracle occurs here) and deal with security at a higher level. " -- comment on https://yosefk.com/blog/the-high-level-cpu-challenge.html

---

https://www.bdti.com/InsideDSP/2013/10/23/SingularComputing

"Imperfect Processing: A Functionally Feasible (and Fiscally Attractive) Option, Says Singular Computing ... Around a decade ago, Bates had a breakthrough realization that the human brain's neurons don't do exact arithmetic; they were only about 99 percent right on average. What, he wondered, would happen if he tried to build hardware that wasn’t neural in design, but which implemented approximate arithmetic? What Bates discovered was that he could shrink the silicon area consumed by each arithmetic unit by approximately 100x versus the DSP-, FPGA- or GPU-based alternatives, an especially attractive outcome in AI and other highly parallelizable applications. ... Imperfect computing, Bates freely admits, is not an idea that's unique to his startup company, Singular Computing. Less than two weeks ago, for example, Joel Hruska at ExtremeTech? published the informative article "Probabilistic computing: Imprecise chips save power, improve performance," which covers the research being done by Christian Enz, the Director of the Institute of Microengineering at the École Polytechnique Fédérale de Lausanne. Hruska's article references similar work being done at Rice University, which he explored in greater detail in a writeup published in May of last year. And Hruska also mentions Intel, which among other things "has explored the idea of a variable FPU that can drop to 6-bit computation when a full eight bits isn’t required." https://www.extremetech.com/computing/168348-probabilistic-computing-imprecise-chips-save-power-improve-performance https://www.extremetech.com/computing/129665-can-probabilistic-computing-save-moores-law ... In Singular Computing's case, the core arithmetic unit does "floating point-like" operations (add, subtract, divide, multiply, and square root) in a single cycle and pairs with 256 words' worth of high-speed memory and multiple local registers to form an APE (approximate processing element) (Figure 1). APEs communicate with each other over a neural network-reminiscent massively parallel grid interconnect scheme; Bates estimates that modern process lithographies could enable the cost-effective integration of several hundred thousand APEs on a single chip, alongside an ARM or other host processor.

---

    2
    brucehoult 1 year ago | link
    Many of the applications for a CPU like this don’t need any state outside of the CPU registers – especially as RISC-V lets you do multiple levels of subroutine call without touching RAM if you manually allocate different registers and a different return address register for each function (which means programming in asm not C). A lot of 8051 / PIC / AVR have been sold without any RAM (or with RAM == memory mapped registers)-- https://lobste.rs/s/nqxfoc/serv_is_award_winning_bit_serial_risc_v

---

(talking about casting in C): kornel 8 hours ago (unread)

link flag

float is a special kind of fun, because it’s also UB if a float to int cast overflows. Rust had to add a range check to float casts to close this loophole in LLVM.

---

" Code size reduction extensions

A family of extensions referred to as the RISC-V code size reduction extensions https://github.com/riscv/riscv-code-size-reduction/releases/download/v1.0.4-2/Zc_1.0.4-2.pdfhttps://github.com/riscv/riscv-code-size-reduction/releases/download/v1.0.4-2/Zc_1.0.4-2.pdf was ratified earlier this year. One aspect of this is providing ways of referring to subsets of the standard compressed 'C' (16-bit instructions) extension that don't include floating point loads/stores, as well as other variants. But the more meaningful additions are the Zcmp and Zcmt extensions, in both cases targeted at embedded rather than application cores, reusing encodings for double-precision FP store.

Zcmp provides instructions that implement common stack frame manipulation operations that would typically require a sequence of instructions, as well as instructions for moving pairs of registers. The RISCVMoveMerger? pass performs the necessary peephole optimisation to produce cm.mva01s or cm.mvsa01 instructions for moving to/from registers a0-a1 and s0-s7 when possible. It iterates over generated machine instructions, looking for pairs of c.mv instructions that can be replaced. cm.push and cm.pop instructions are generated by appropriate modifications to the RISC-V function frame lowering code, while the RISCVPushPopOptimizer? pass looks for opportunities to convert a cm.pop into a cm.popretz (pop registers, deallocate stack frame, and return zero) or cm.popret (pop registers, deallocate stack frame, and return).

Zcmt provides the cm.jt and cm.jalt instructions to reduce code size needed for implemented a jump table. Although support is present in the assembler, the patch to modify the linker to select these instructions is still under review so we can hope to see full support in LLVM 18.

The RISC-V code size reduction working group have estimates of the code size impact of these extensions https://docs.google.com/spreadsheets/d/1bFMyGkuuulBXuIaMsjBINoCWoLwObr1l9h5TAWN8s7k/edit#gid=1837831327 ((my note: avg 11%) produced using this analysis script. I'm not aware of whether a comparison has been made to the real-world results of implementing support for the extensions in LLVM, but that would certainly be interesting. "

---