proj-oot-ootAssemblyNotes30

--- old

removed from boots_extension_reference:

A big issue with this is that it's just not really linear: - you can have floats and filesystems and UIs and multiprocessing without 64-bits or even 32-bits (old computers) - you can have filesystems and UIs without multiprocessing (serial computers) - you can have multiprocessing without UIs and filesystems (embedded OSs)

so either we give up on linearity, or we look at linearity as our specific way of organizing the world particular to our use-case and/or aesthetic.

One thing to focus on is emulatable vs I/O. E.g. 32-bit, 64-bit, and concurrency are all emulatable. Filesystems, TUIs, and graphics are I/O. It's not such a big deal to get the linearization wrong for the emulatable part b/c it can always be emulated. But if a profile insists that graphics are present and they're not, the program simply can't be run.

Ideas: - let the I/Os indicate unavailability if the desired device type doesn't exist on this system - focus on our use-cases and don't worry about whether the profiles are useful for others, others can always use the feature indicator bits directly - by the way, do we really need to include profile #s in the VM at all? Mb the VM should only have feature indicator bits.Profiles could just be a thing defined in the spec.

  1. # Profile 12 TODO: not sure if this level and up should even be part of the Boot project. Higher-level languages and frameworks can add this stuff. It is not needed for bootstrapping. It is not in the C standard library, and was not available on 1980s 8-bit consumer desktop systems. Suggest moving to LOVM or higher.

This profile adds more I/O modalities. It is suitable for running modern desktop programs.

Boots12 (Boots Profile 12) includes everything in Profile 11 and also:

syscall extensions:

  1. # Profile 13 TODO: not sure if this level and up should even be part of the Boot project. Higher-level languages and frameworks can add this stuff. It is not needed for bootstrapping. It is not in the C standard library, and was not available on 1980s 8-bit consumer desktop systems. Suggest moving to LOVM or higher.

This profile adds more I/O modalities.

Boots13 (Boots Profile 13) includes everything in Profile 12 and also:

syscall extensions:

  1. # Profile 11 TODO: not sure if this level and up should even be part of the Boot project. Higher-level languages and frameworks can add this stuff. It is not needed for bootstrapping. It is not in the C standard library, although it was available on relatively primitive BASIC interfaces to 1980s 8-bit consumer desktop systems eg https://en.wikipedia.org/wiki/Applesoft_BASIC#Sound_and_graphics , which did not support networking (which we placed at a lower profile level). The latter suggests that maybe we should move it before profile 6; but otoh probably not b/c what about headless servers, and what about the C library.

This profile adds other I/O modalities. It is suitable for running desktop programs with simple keyboard-based user interfaces.

Boots11 (Boots Profile 11) includes everything in Profile 10 and also:

0: base Boots

1: 16-bit integer static control flow functions

2: 32-bit integer static control flow functions

3: 32-bit static control flow programs

must be present but might not have resources:

    1. Profile 0: base Boots

The base profile, standard Boots without extensions, is called "Boots0" and is defined in boots_reference.md. This profile is suitable for 15-bit unsigned integer pure calculations that don't access I/O and take place wholly in registers (without accessing memory).

  1. # Profile 1: 16-bit integer static control flow functions

This profile adds signed arithmetic and memory access. Suitable for 16-bit integer computation without multiplication, dynamic control flow, I/O, or memory growth. Profile 1 includes everything in profile 0 and adds:

instruction extensions:

syscall extensions:

    1. Profile 2: 32-bit integer static control flow

Profile 2 adds 32-bit arithmetic and integer multiply. Profile 2 is suitable for small-size executables doing 32-bit integer computation without division or dynamic control flow. Profile 2 includes everything in profile 1 and adds:

instruction extensions:

OUT-OF-DATE NOTES: Note: Profile 2 is what pyboot3 implements, except without 64-bits or floats or math

i did it in that order b/c static control flow is easier to implement in JVM and i think WASM, but JVM at least supports triglog, and WASM supports everything in 4 and before. And triglog can be implemented in a library, whereas dynamic_control_flow requires a new interpreter loop if the platform doesn't support it.
Our Lo toolchain will require dynamic control flow but not floats or 64-bits.
  1. # Profile 3: 32-bit static control flow programs

Profile 3 adds I/O, floating-point of unspecified bitwidth, embedded immediates, integer divide, exit, and memgrow. Profile 3 is suitable for 32-bit programs without dynamic control flow. Profile 3 includes everything in profile 2 and adds:

instruction extensions:

syscall extensions:

Note: Profile D is what pyboot3 implements, except need to add in,out (io syscall extension), and change malloc/free to memsz, memgrow

Note: When console I/O (device #0) is present, the compilers and interpreters in the Boot toolchain can run on Profile 3.

Note: Profile 3 is what pyboot3 implements, except need to add in,out (io syscall extension)

i did it in that order b/c static control flow is easier to implement in JVM and i think WASM, but JVM at least supports triglog, and WASM supports everything in 4 and before. And triglog can be implemented in a library, whereas dynamic_control_flow requires a new interpreter loop if the platform doesn't support it.

--- done

i took a look at the WASI spec. There are a lot of 1-byte and 2-byte fields in the defined structs. So maybe we should support 8-bit and 16-bit values in the Boot offset fields. So maybe 1 bit for each of:

that's not too useful, but then no scheme is very useful with just 4 bits and portable sizing. In LOVM if 8 or 12 bits are provided, it would have 2 or 3 bits per field, which would then become much more useful.

done.

---

hacknat 10 hours ago [–]

Normally I agree with anit-NIH sentiments, but why are we all supposed to just have to be happy with a language ABI that's been around since the 1970s?

reply

simias 6 hours ago [–]

If it ain't broke...

Having a common entrypoint into the kernel we can hook into is fairly valuable IMO, especially when said interface is for the most part stable between all unx (and to a somewhat lesser extent even on Windows).

Having this lowest common denominator ABI makes interop relatively* painless. But clearly Go wants to eat the world and doesn't seem to care very deeply about this. I suppose it's a valid approach, even if I don't necessarily agree with it.

reply

convolvatron 6 hours ago [–]

  If it ain't broke...

user supplied read addresses

getpwent() - actually really the whole notion of users

file permissions

aynch storage event completion (all asynchrony is pretty broken)

system metadata control (sysctl, /proc, interface socket messages...)

signals

ttys

errno

affinity interfaces ...

reply

---

"As an aside, I’ve found PPC hacking to be much more pleasant than the little bit of 6502 and x86 hacking I’ve done. PPC is big-endian, so it’s easy to read and write the raw bytes. Also every instruction is exactly four bytes long, so you don’t have to fret about adding NOPs or starting your decode in the middle of an instruction. Super nice to work with."

---

" There are some low-level features that Rust doesn't have a proper replacement for:

    computed goto. "Boring" uses of goto can be replaced with other constructs in Rust, like loop {break}. In C many uses of goto are for cleanup, which Rust doesn't need thanks to RAII/destructors. However, there's a non-standard goto *addr extension that's very useful for interpreters. Rust can't do it directly (you can write a match and hope it'll optimize), but OTOH if I needed an interpreter, I'd try to leverage Cranelift JIT instead.
    alloca and C99 variable-length arrays. These are controversial even in C, so Rust stays away from them.

It's worth noting that Rust currently supports only one 16-bit architecture. The tier 1 support is focused on 32-bit and 64-bit platforms. " [1]

---

" AssemblyScript? TurboFan? 298.13ms 2.9x JavaScript? TurboFan? 104.35ms 1.0x ... it didn’t sit well with me that the optimized WebAssembly? module takes about 3 times as long as JavaScript?. ...

Digging in

After quickly consulting with some folks from the V8 team and some folks from the AssemblyScript? team (thanks Daniel and Max!), it turns out that one big difference here are “bounds checks” — or the lack thereof.

V8 has the luxury of having access to your original JavaScript? code and knowledge about the semantics of the language. It can use that information to apply additional optimizations. For example: It can tell you are not just randomly reading values from memory, but you are iterating over an ArrayBuffer? using a for ... of loop. What’s the difference? Well with a for ... of loop, the language semantics guarantee that you will never try to read values outside of the ArrayBuffer?. You will never end up accidentally reading byte 11 when the buffer is only 10 bytes long, or: You never go out of bounds. This means TurboFan? does not need to emit bounds checks, which you can think of as if statements making sure you are not accessing memory you are not supposed to. This kind of information is lost once compiled to WebAssembly?, and since ASC’s optimization only happens at WebAssembly? VM level, it can’t necessarily apply the same optimization.

Luckily, AssemblyScript? provides a magic unchecked() annotation to indicate that we are taking responsibility for staying in-bounds.

But there’s more: The Typed Arrays in AssemblyScript? (Uint8Array, Float32Array, ...) offer the same API as they do on the platform, meaning they are merely a view onto an underlying ArrayBuffer?. This is good in that the API design is familiar and battle-tested, but due to the lack of high-level optimizations, this means that every access to a field on the Typed Array (like myFloatArray[23]) needs to access memory twice: Once to load the pointer to the underlying ArrayBuffer? of this specific array, and another to load the value at the right offset. V8, as it can tell that you are accessing the Typed Array but never the underlying buffer, is most likely able to optimize the entire data structure so that you can read values with a single memory access.

For that reason, AssemblyScript? provides StaticArray?<T>, which is mostly equivalent to an Array<T> except that it can’t grow. With a fixed length, there is no need to keep the Array entity separate from the memory the values are stored in, removing that indirection.

...

Sneaky defaults

Another thing that the AssemblyScript? folks pointed out to me is that the --optimize flag is equivalent to -O3s which aggressively optimizes for speed, but makes tradeoffs to reduce binary size. -O3 optimizes for speed and speed only. Having -O3s as a default is good in spirit — binary size matters on the web — but is it worth it? At least in this specific example the answer is no: -O3s ends up trading the laughable amount of ~30 bytes for a huge performance penalty: Language Variant Optimizer Average vs JS AssemblyScript? optimized O3 89.60ms 0.9x JavaScript? 104.35ms 1.0x AssemblyScript? optimized O3s 162.63ms 1.6x AssemblyScript? naive O3s 298.13ms 2.9x

One single optimizer flag makes a night-and-day difference, letting AssemblyScript? overtake JavaScript? (on this specific test case!). We made AssemblyScript? faster than JavaScript?!

But do me a favor: Don’t stop reading now.

...

Some of you may have noticed that both these examples have very few or no allocations. V8 takes care of all memory management (and garbage collection) in JavaScript? for you and I won’t pretend that I know much about it. In WebAssembly?, on the other hand, you get a chunk of linear memory and you have to decide how to use it (or rather: the language does). How much do these rankings change if we make heavy use of dynamic memory?

To measure this, I chose to benchmark an implementation of a binary heap. ... 80x slower than JavaScript??! Even slower than Ignition? Surely, there is something else going wrong here. ...

Runtimes

All data that we create in AssemblyScript? needs to be stored in memory. To make sure we don’t overwrite anything else that is already in memory, there is memory management. As AssemblyScript? aims to provide a familiar environment, mirroring the behavior of JavaScript?, it adds a fully managed garbage collector to your WebAssembly? module so that you don’t have to worry about when to allocate and when to free up memory.

By default, AssemblyScript? ships with a Two-Level Segregated Fit memory allocator and an Incremental Tri-Color Mark & Sweep (ITCMS) garbage collector. It’s not actually relevant for this article what kind of allocator and garbage collector they use, I just found it interesting that you can go look at them.

This default runtime, called incremental, is surprisingly small, adding only about 2KB of gzip’d WebAssembly? to your module. AssemblyScript? also offers alternative runtimes, namely minimal and stub that can be chosen using the --runtime flag. minimal uses the same allocator, but a more lightweight GC that does not run automatically but must be manually invoked. This could be interesting for high-performance use-cases like games where you want to be in control when the GC will pause your program. stub is extremely small (~400B gzip’d) and fast, as it’s just a bump allocator.

    My lovely memory bump: Bump allocators are extremely fast, but lack the ability to free up memory. While that sounds stupid, it can be extremely useful for single-purpose modules, where instead of freeing memory, you delete the entire WebAssembly instance and rather create a new one. If you are curious, I actually wrote a bump allocator in my article Compiling C to WebAssembly without Emscripten.

How much faster does that make our binary heap experiment? Quite significantly! Language Variant Runtime Average vs JS JavaScript? 233.68ms 1.0x AssemblyScript? optimized stub 345.35ms 1.5x AssemblyScript? optimized minimal 354.60ms 1.5x AssemblyScript? optimized incremental 18758.50ms 80.3x

Both minimal and stub get us significantly closer to JavaScripts? performance. But why are these two so much faster? As mentioned above, minimal and incremental share the same allocator, so that can’t be it. Both also have a garbage collector, but minimal doesn’t run it unless explicitly invoked (and we ain’t invoking it). That means the differentiating quality is that incremental runs garbage collection, while minimal and stub do not. I don’t see why the garbage collector should make this big of a difference, considering it has to keep track of one array.

...

" [2]

---

" It’s common to branch to a basic block that exits. It’s common to add a constant to an integer, convert the integer to a pointer, and use the pointer for loading or storing. It’s common to mask the shift amount before shifting. It’s common to do integer math with an overflow check. B3 IR and Air make common operations use just one Value or Inst in the common case. The branch-and-exit operation doesn’t require multiple basic blocks because B3 encapsulates it in the Check opcode and Air encapsulates it in a special Branch instruction. Pointers in B3 IR are just integers, and all load/store opcodes take an optional offset immediate, so that a typical FTL memory access only requires one object. Arithmetic operations in B3 IR carefully balance the semantics of modern hardware and modern languages. Masking the shift amount is a vestige of C semantics and ancient disagreements over what it means to shift an N-bit integer by more than N bits; none of this makes sense anymore since X86 and ARM both mask the amount for you and modern languages all assume that this is how shifting works. B3 IR saves memory by not requiring you to emit those pointless masks. Math with overflow checks is represented using CheckAdd?, CheckSub?, and CheckMul? opcodes in B3 and BranchAdd?, BranchSub?, and BranchMul? opcodes in Air. Those opcodes all understand that the slow path is a sudden termination of the optimized code. This obviates the need for creating control flow every time the JavaScript? program does math. " -- [3]

---

lots of good assembler directives (and similar) from Jonesforth:

http://git.annexia.org/?p=jonesforth.git;a=blob;f=jonesforth.S;h=45e6e854a5d2a4c3f26af264dfce56379d401425;hb=HEAD

e.g.

macros with arguments, e.g.: .macro PUSHRSP reg lea -4(%ebp),%ebp push reg on to return stack movl \reg,(%ebp) .endm

default arguments, e.g.: .macro defword name, namelen, flags=0, label

.text .align 4 LABEL: globl _start section .rodata .set

.int .byte .ascii

e.g.

 669         .macro defcode name, namelen, flags=0, label
 670         .section .rodata
 671         .align 4
 672         .globl name_\label
 673 name_\label :
 674         .int link               // link
 675         .set link,name_\label
 676         .byte \flags+\namelen   // flags + length byte
 677         .ascii "\name"          // the name
 678         .align 4                // padding to next 4 byte boundary
 679         .globl \label
 680 \label :
 681         .int code_\label        // codeword
 682         .text
 683         //.align 4
 684         .globl code_\label
 685 code_\label :                   // assembler code follows
 686         .endm

---

this came up with some interesting search results last time. I've already noted them, but maybe in future years it will come up with further interesting stuff:

https://www.google.com/search?client=ubuntu&channel=fs&q=wasm+compile+coroutine&ie=utf-8&oe=utf-8

---

https://despairlabs.com/posts/2021-06-16-io-uring-is-not-an-event-system/

example comparing poll, epoll, io_uring

also mentions kqueue in passing

---

on 6502 addressing modes:

Regarding the most important addressing modes, the book Apple Machine Language for Beginners by Richard Mansfield says "...there are only 6 that you should focus on and practice with until you understand their uses: immediate, absolute (plus absolute,X and , Y), zero page, and indirect Y.". Going thru each of those in turn:

In some ways, the zero page memory locations in the 6502 can be used analogously to registers in other processors with more registers. Applying this analogy, we find that:

so, according to this analogy, Mansfield's suggesting most important addressing modes correspond to:

---

syscalls in jonesforth:

EXIT OPEN CLOSE READ WRITE CREAT BRK

---

my assembly language should be regex parsable

---

 cperciva on April 14, 2020 [–]

When will C gain a mechanism for "do not leave this sensitive information laying around after this function returns"? We have memset_s but that doesn't help when the compiler copies data into registers or onto the stack.

pascal_cuoq on April 14, 2020 [–]

This is an entire language extension, as you note. The last time various people interested in this were in the same room (it was in January 2020 in a workgroup called HACS), what emerged was that the Rust people would try to add the “secret” keyword to the language first, since their language is still more agile than C, while the LLVM people would prepare LLVM for the arrival of at least one front-end that understand secret data.

Is this enough to answer your question? I can look up the names of the people that were involved and communicate them privately if you are further interested.

---

jcranmer on March 31, 2020 [–]

One of the blog posts I keep meaning to write is in the vein of "Why C is not portable assembly." Ironically, most of my points about C failures aren't related to UB at all, but rather the fact that C's internal model doesn't comport well to important details about machines:

Animats on March 31, 2020 [–]

No support for multiple return values (in registers). There's structs, but that means returning via memory because, well, see point 1.

Return values in registers are a machine level optimization. Whether something is in a "register" is up to the CPU today on most CPUs. If you set a value in the last few instruction cycles, and are now using it, it probably hasn't even made it to the L1 cache yet. Even if it was pushed on the stack. That optimization made sense on SPARC, with all those registers. Maybe. Returning a small struct is fine.

There's an argument for multiple return values being unpacked into multiple variables, like Go and Python, but that has little to do with how it works at the machine level.

Missing operations such as popcount, cttz, simultaneous div/rem, or rotates.

Now that most CPUs have hardware for those functions, perhaps more of those functions should be visible at the language level. Here's all the things current x86 descendants can do.[1] Sometimes the compiler can notice that you need both dividend and remainder, or sine and cosine, but some of those are complicated enough it's not going to recognize it.

Traps are UB instead of having more predictable semantics.

That's very CPU-dependent. X86 and successors have precise exceptions, but most other architectures do not. It complicates the CPU considerably and can slow it down, because you can't commit anything to memory until you're sure there's no exception which could require backing out a completed store.

---

i can't find it now (looked for about half an hour!), but there's that guy's comment on agner's early forum discussion of forwardcom that i recently read saying e thought RISC-V (or was it MIPS?) would be a good ISA to learn how to implement something simple, but aarch64 would be good to learn a well-designed/"good" ISA (did he also mention m68k?). Found it; adrian bocaniciu, see next section below

--- 200717

there was a comment noted in ootAssemblyNotes28.txt that i want to recall attention to:

https://www.agner.org/optimize/blog/read.php?i=424#424 by Adrian Bocaniciu

he says "There are only 2 possible choices for a set of addressing modes that would allow writing a loop with a minimum number of instructions":

the first of these takes a lot of operand bits, especially if you want to use any register or if you want big offsets (agner's forwardcom allows 8 or 32 sign-extended offsets). And we'd still need addr mode bits on top of that! Eg with:

32 registers for base or index scaling to 1,2,4,8 8-bit offsets

we have:

if we really have 3 operand fields (plus opcode plus format bits), and we want to be able to use this addressing mode in all of those at once, then even with 64-bit instructions we can't fit that (for 64 bits and 3 ortho operands, we probably don't want more than 14 operand bits per operand, more likely 12)

one thing we can do is say that having a large offset consumes a second operand. If we only have a base reg and a scaled index reg, that's only 12 bits.

Alternately/in addition, we can reduce the number of registers usable in this addressing mode. If we can only use 8, then we save 4 bits, which allows us to have a 4-bit offset. Which still isn't much (2 64-bit floats is the max size of struct it can fully address into)

an 8-bit offset would be cool. But i guess i think a 6 bit struct is sort of the minimum (but even 5 would be better than 4). However, if you multiply it by the scale factor, that's a different story; a 2-bit scale factor gives you 1,2,4,8. So here a 4-bit struct lets you index up to the 15th 8-byte field in a struct. In this case i guess a 4-bit offset is (barely) sufficient.

i'd think you might also want post- and pre- inc/dec on the index reg (and/or on the base reg, where instead of inc/dec it gets replaced by the computed address)

What if you are in a 32-bit instruction (with a 3-operand encoding)? Now your operand fields are probably at most 8 bits, probably less (let's say 6 bits). If you have 2 base reg bits, 2 index reg bits, and 2 scale bits, that's it. No space for an offset at all, and you have only 4 regs which can be base regs, and 4 regs which can be index regs. It's not so orthogonal (in terms of register choice), but honestly 4 index regs is probably enough for most cases, so you'd probably still get some speedup out of these. You can now trade a second operand for a 6- or 8-bit offset.

another thing we can do is use a scale register instead of putting the scale in the instruction. This has the added advantage of making some very simple types of loops generic over scale factor.

glacing at boot_reference.md's address modes section, i can't immediate tell if i have either of these in there. In fact, what i do there is kind of weird and i can't quickly read it. But i recall i had thought about it a bit, so maybe it's good.

---

i also see that i had given up on the idea of the BootS? instruction encoding being a subset of the Boot instructions encoding. I guess that's ok, as long as the actual instructions (semantically) are a subset.

after a moment's more consideration, i guess the way i divided it up by number is okay; that just means that some addr modes have more operand bits and others have more addr mode bits. But it would be clearer if i just listed that table, with the addr mode name, at the top, and then did the long descriptions below.

i didn't look thru the modes in detail but they also look okay, except that i don't see why we need both stack ptr mode and frame ptr mode.

---

josefx 8 hours ago [–]

What the hell is systemd doing that a 8MB long file path can exhaust its stack? Is it doing some recursive parsing or is it just doing something plain stupid like using a VLA to store user provided data?

reply

megous 7 hours ago [–]

Probably unbounded alloca() as always.

reply

megous 7 hours ago [–]

Yep: https://github.com/systemd/systemd/commit/b34a4f0e6729de292c...

strdupa(input) without any length check

Fix is to replace it with unbounded malloc() instead of checking for sane length first.

reply

thayne 1 hour ago [–]

In rust you can't currently dynamically allocate on the stack, although that's probably something that will be added in the future. And as others have pointed out, allocating on the stack is a fairly reasonable optimization here.

I don't think you could even call strdupa through libc in rust. I would guess that strdupa is either a macro that uses the alloca compiler intrinsic or is itself a compiler intrinsic. Even if it isn't, it will break assumptions the rust compiler makes about the size of the stack frame.

reply

alerighi 5 hours ago [–]

It's not a mistake. Allocating that string on the stack it is not a bad idea. Most of the time the string will be short, and thus an allocation on the stack is faster.

Consider that in Linux a path is defined to be a maximum length of PATH_MAX, that is defined to 4096 bytes, and a filename (and directory name) shouldn't be longer than FILE_MAX that is 255 bytes. This limits are defined in the headers and I use them always in writing my C programs (if it crashes... you are doing something really wrong!).

So how the hell do you have a directory that is more than 8Mb? You shouldn't! The filesystem doesn't support it. It's a matter of the filesystem driver that should reject a path that long in my opinion.

Systemd should be fast. It's at the base of the operating system. Also it should consume little memory. You can say, who cares about allocating dynamically a string, or allocating a static buffer of 16Mb, yes we should care, I use Linux computer with 16Mb of RAM, total. Of course they don't run systemd nowadays since it's too big, but in my opinion systemd is good, and I would like to see it more in the embedded world.

reply

---

stjohnswarts 7 hours ago [–]

Good find thanks for sharing. And everyone at work gripes about me carrying the size around with a lot of my variables in the form of a struct. It's strictly a reminder to always be checking the size since I'm juggling with shotguns.

reply

thayne 1 hour ago [–]

The fact that c doesn't have a native concept of an array with length and strings usually use a null byte to determine the end is, IMO c's biggest failing, and it's worst legacy on the wider software world.

reply

---

sinsterizme 5 hours ago [–]

Shouldn't this PR also use `free` on the duped string before returning? (I never use C so probably missing something but just based on the docs of strdupa...)

reply

rtldg 5 hours ago [–]

The variable p is now declared with "_cleanup_free_" which is using some compiler cleanup/destructor attribute stuff to run free

reply

sinsterizme 4 hours ago [–]

ah okay, thank you :)

reply

---

should check out algol 68 and pl/0: http://www.jemarch.net/back-to-the-future-a68.pdf

---

picoJava had a 64 entry stack cache [4]

some other stack computers are described in https://www.cpushack.com/CPU/cpu7.html

---

" The Mill feels a lot like VLIW in origin: leave difficult things out of the architecture and punt them to the compiler. This generally gives you great benchmark numbers and terrible real-world performance. The compiler can spend a lot more time optimising things than the CPU can (it’s fine for the compiler to spend a few tens of milliseconds on a function, that latency in the CPU would make performance completely unacceptable) so in theory it can do more, but in practice it can’t take advantage of any dynamic behaviour. The Mill doesn’t do register rename at all, so can be almost entirely execution logic, but it punts a lot to the compiler and it’s not clear to me that it will do any better in the real world than EDGE architectures.

Compiler ISAs are now really dataflow encodings. Registers aren’t really registers, they’re just locally scoped names for data flow. This is something that really bugs me because a typical compiler IR is an SSA form, which it then maps into some fixed-number-of-registers encoding and the register rename engine then tries to re-infer an SSA representation. It feels as if there really ought to be a better intermediate between two SSA forms but so far every attempt that I’m aware of has failed.

The rule of thumb for C/C++ code is that you get, on average, one branch per 7 instructions. This was one of the observations that motivated some of the RISC I and RISC II design and I’ve had students verify experimentally that it’s more or less still true today. This means that the most parallelism a compiler can easily extract on common workloads is among those 7 instructions. A lot of those have data dependencies, which makes it even harder. In contrast, the CPU sees predicted and real branch targets and so can try to extract parallelism from more instructions. nVidia’s Project Denver cores are descendants of TransMeta’s? designs and try to get the best of both worlds. They have a simple single-issue Arm decoder that emits one VLIW instruction per Arm instruction (it might do some trivial folding) and some software that watches the in instruction stream and generates optimised traces from hot code paths. Their VLIW design is also interesting because each instruction is offset by one cycle and so you could put a bunch of instructions with data dependencies between them into the same bundle. This makes it something more like a VLIW / EDGE hybrid but the software layer means that you can avoid all of the complicated static hyperblock formation problems that come from EDGE architectures and speculatively generate good structures for common paths that you then roll back if you hit a slow path. " -- https://lobste.rs/s/zgqhjl/what_are_some_differences_between_x86_arm#c_pzorre

---

" 2. Go’s assembler documentation is barebones

Writing instructions so that Go’s assembler is happy with the instructions you’re giving it is a bit frustrating.

There’s a good gloss at golang.org/pkg/cmd/internal/obj/arm64 which gives an overview of many of the differences. For example, all vector instructions start with V, different than what ARM64 switched to (they used to commonly start with V on 32-bit ARM) – so while I understand the desire for continuity (and even subtly like it, knowing it’s a vector op) it’s just makes another little difference to remember vs. the original documentation.

But more frustrating is, even if your instruction is supported (and most of them are) knowing how to use the instruction in Go assembler boils down to “assume data goes left-to-right and hope there’s an example in the test suite”

I’m a big Go fan, yet Go’s history into Plan 9 and accompanying assembler (and, relatedly, odd calling conventions) is one of my gripes about Go, even more than lack of generics (which is a topic for another day). Sure, there were some good ideas in Plan 9 that influenced the design of Go – from a design level, it’s great! – but on the implementation level, this is one place where I kinda wish it had followed precident. Take whichever side you want in the Intel vs GNU syntax debate, creating a third option means relearning all the quirks from scratch, and ignoring any documentation that already exists. " [5]

---

"Arm does not have subregisters. On x86, you have AL, AH, AX, EAX, and RAX all update the same registers. For anything shorter than RAX, this means you need a read-modify-write operation on the rename register. This adds complexity in the rename engine. Arm briefly had a floating-point mode that let you treat the FPU registers as either 32 64-bit or 16 64-bit floating point registers. This caused similar pain and was abandoned (32-bit FPU ops needed to track the enclosing register pair so that they correctly half-overwrote or were overwritten by 64-bit operations). " [6]

---

"Imagine you need to clear the upper word of a variable (var &= 0xFFFFFFFF). This is very common when doing arithmetic on size smaller than the word size (e.g. 32-bit on a 64-bit computer). On RISC-V it is done with two instructions: a shift left followed by a shift right...As a point of comparison, on x86 this would be ADD ax, bx, which is encoded as 3 bytes."

---

https://incoherency.co.uk/collapseos/hal.txt.html

---

in some ways BASIC with its GOTO is a better high level assembly then Pascal, Oberon, or C with their structured programming

26 letters is pretty close to 32 registers -- if there were six special registers that used two letters or symbols, for example, accumulator, top-of-stack, top-of-return stack, second-to-top-of-stack, 0, PC, then we have a more concise notation than eg R12.

---

"

SwiftForth’s? CPU register usage Register Purpose EBX top of stack ESI user area pointer EDI executable base address EBP data stack pointer ESP return stack pointer

All other CPU registers are available for use without saving and restoring. " [7]

---

so... special regs could be:

that's already 8.. so if there are 32 regs total, there are now 24 left

and of course i also have my ERR register

and one tactic for making a compressed instruction set is to designate a set of 'preferred' registers that function analogously to 6502 registers such as accumulator and indexing registers (so the compressed instruction set then only provides many instructions for the preferred regs)

e.g. RVC (RISC-V compressed ISA) uses the 8 most popular regs: " RVC uses a simple compression scheme that offers shorter 16-bit versions of common 32-bit RISC-V instructions when: • the immediate or address offset is small, or • one of the registers is the zero register (x0), the ABI link register (x1), or the ABI stack pointer (x2), or • the destination register and the first source register are identical, or • the registers used are the 8 most popular ones. " -- https://github.com/riscv/riscv-isa-manual/releases/download/Ratified-IMAFDQC/riscv-spec-20191213.pdf ---

Is it better to have two stacks? Author: Freddie Witherden Date: 2016-06-22 11:40 With regards to exposing a second stack it is worth considering a recent proposal from Intel term Control-flow enforcement technology: https://software.intel.com/sites/default/files/managed/4d/2a/control-flow-enforcement-technology-preview.pdf

The principle is simple: require that the instruction immediately following a jump is an ENDBRANCH instruction (a NOP on current systems). There are also a series of extensions to enable the shadow stack to be manipulated by debuggers and the like. It seems like a reasonable model of how someone would go about exposing and managing a second return call stack.

---

https://en.wikipedia.org/wiki/CHIP-8

PICO-8 etc inspired by that

ran on the https://en.wikipedia.org/wiki/RCA_1802

but CHIP-8 probably more useful for our purposes

---

that 64-bit format has a lot of unused space for most instruction (where we don't need 8-bit or 12-bit operands).

maybe have a VLIW format? How much ILP (instruction-level parallelism) is available in typical programs?

Limits of Instruction-Level Parallelism by David W. Wall finds a median of no more than about 5.

What did Itanium do?

https://en.wikipedia.org/wiki/IA-64 says "Ideally, the compiler can often group instructions into sets of six that can execute at the same time"

" Each 128-bit instruction word is called a bundle, and contains three slots each holding a 41-bit instruction, plus a 5-bit template indicating which type of instruction is in each slot. Those types are M-unit (memory instructions), I-unit (integer ALU, non-ALU integer, or long immediate extended instructions), F-unit (floating-point instructions), or B-unit (branch or long branch extended instructions). The template also encodes stops which indicate that a data dependency exists between data before and after the stop. All instructions between a pair of stops constitute an instruction group, regardless of their bundling, and must be free of many types of data dependencies; this knowledge allows the processor to execute instructions in parallel without having to perform its own complicated data analysis, since that analysis was already done when the instructions were written.

Within each slot, all but a few instructions are predicated, specifying a predicate register, the value of which (true or false) will determine whether the instruction is executed. Predicated instructions which should always execute are predicated on pr0, which always reads as true. ... The fetch mechanism can read up to two bundles per clock from the L1 cache into the pipeline. When the compiler can take maximum advantage of this, the processor can execute six instructions per clock cycle. ... The execution unit groups include:

    Six general-purpose ALUs, two integer units, one shift unit
    Four data cache units
    Six multimedia units, two parallel shift units, one parallel multiply, one population count
    Two 82-bit floating-point multiply–accumulate units, two SIMD floating-point multiply–accumulate units (two 32-bit operations each)[39]
    Three branch units"

What does Xtensa do?

"Xtensa processors range from small, low-power cache-less microcontroller to high-performance 16-way SIMD processors, 3-issue VLIW DSP cores, or 1 TMAC/sec neural network processors."

https://news.ycombinator.com/item?id=14331789 "Right now we can take any code that compiles through LLVM and run it on a single one of our cores while finding a good chunk of ILP (we are quad issue, averaging close to 2.5 IPC, sometimes higher), and be beating out the performance of some ARM and DSP cores that are 10x the size" (presumably he is talking about REX computing)

Given that static ILP has a median of 5, but this assumes excellent compilers, it may not be worth the complexity of supporting all of it -- ppl say that a downfall of Itanium was the effective need for smart compilers (see eg https://news.ycombinator.com/item?id=14331342 ).

So maybe 2-way or 4-way VLIW would be worth it in some cases. Itanium's 3-instruction bundles may be about right.

So, i'd say, if possible, aim for 2-way or 3-way VLIW. 2-instruction bundles sound pretty simple; then you just have to reserve one bit in each bundle to say either "these two instructions are independent and can be executed in parallel" or "these two instructions must be executed in serial". But more generally, the idea of "stops" is probably better; after each instruction, have one bit that is like a fence which means either (if 1) "everything before this instruction must be executed before everything after this instruction" or (if 0) "this instruction and everything prior to it, until the previous stop, may be executed in parallel with everything following this instruction, until the next stop".

The idea of only using 'stops', and using them after every instruction, decouples the amount of ILP encoded from the amount of instructions per bundle, and in fact removes the need for bundles at all. So that's a pretty strong motivation to include them, if possible.

---

another wart is the 3 source operands required by a compare-and-swap.

---

so should we add both a 'stop bit' and a 'predication bit' to every 32-bit instruction in Boot?

Cons:

see https://www.reddit.com/r/programming/comments/1n991v/arm64_and_you/

"There is, apparently, a select-by-CC instruction in there, which combined with a larger register file and (presumably) more load/store slots sounds like an adequate substitute for a rarely-utilized feature."

"Predication for "most" instructions is gone, but there is a new class of instructions called "conditional select", that replace predication for if/else style control flow (where predication is most powerful). It is NOT due to branch prediction though.

Main reasons for removing the predication for most instructions are (as I see it): (1) Predication is great for in-order cores, but in out of order (OOO) execution, it's a source of complexity and a performance drain (or, in other words, a massive pain) for the microarchitecture. And higher end Cortex-A series have been OOO for a while now. (2) (As you mentioned) freeing up bits in instruction encodings.

Note: predication avoids the dependency on the branch predictor (and in turn does not pollute its state), which is very useful for simple if/else control flow. Note2: What conditional select instructions let you do is execute both paths of the if/else without any dependencies on flag state, selecting the correct result to retire at the end. This is much more friendly to the OOO architectures, with the usual heap of benefits such as increased performance, less complexity (and thus very slightly easier verification of the design)."

"In ARMv8, while there are not general conditionally executed instructions there are a few instructions that perform the specific task I am describing, namely allowing you to avoid branching for short dumb jumps; CSEL is the most obvious example, though there are other cases (e.g. conditional setting of conditions) to handle other common patterns (in that case the pattern of C short-circuited expression evaluation). "

https://modexp.wordpress.com/2018/10/30/arm64-assembly/#conditional :

"CSEL is essentially like the ternary operator in C. Probably my favorite instruction of ARM64 since it can be used to replace two or more opcodes. Mnemonic Operands Instruction CCMN (immediate) Rn, #imm, #nzcv, cond Conditional Compare Negative (immediate) sets the value of the condition flags to the result of the comparison of a register value and a negated immediate value if the condition is TRUE, and an immediate value otherwise. CCMN (register) Rn, Rm, #nzcv, cond Conditional Compare Negative (register) sets the value of the condition flags to the result of the comparison of a register value and the inverse of another register value if the condition is TRUE, and an immediate value otherwise. CCMP (immediate) Rn, #imm, #nzcv, cond Conditional Compare (immediate) sets the value of the condition flags to the result of the comparison of a register value and an immediate value if the condition is TRUE, and an immediate value otherwise. CCMP (register) Rn, Rm, #nzcv, cond Conditional Compare (register) sets the value of the condition flags to the result of the comparison of two registers if the condition is TRUE, and an immediate value otherwise. CSEL Rd, Rn, Rm, cond Conditional Select returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register. CSINC Rd, Rn, Rm, cond Conditional Select Increment returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the value of the second source register incremented by 1. Used by CINC and CSET. CSINV Rd, Rn, Rm, cond Conditional Select Invert returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the bitwise inversion value of the second source register. Used by CINV and CSETM. CSNEG Rd, Rn, Rm, cond Conditional Select Negation returns, in the destination register, the value of the first source register if the condition is TRUE, and otherwise returns the negated value of the second source register. Used by CNEG. CSET Rd, cond Conditional Set sets the destination register to 1 if the condition is TRUE, and otherwise sets it to 0. CSETM Rd, cond Conditional Set Mask sets all bits of the destination register to 1 if the condition is TRUE, and otherwise sets all bits to 0. CINC Rd, Rn, cond Conditional Increment returns, in the destination register, the value of the source register incremented by 1 if the condition is TRUE, and otherwise returns the value of the source register. CINV Rd, Rn, cond Conditional Invert returns, in the destination register, the bitwise inversion of the value of the source register if the condition is TRUE, and otherwise returns the value of the source register. CNEG Rd, Rn, cond Conditional Negate returns, in the destination register, the negated value of the source register if the condition is TRUE, and otherwise returns the value of the source register. "

so, sounds like a stop bit might be useful, but we can have conditional instructions instead of a predication bit.

---

So mb we should have all of:

simplest might be to have 8 (register-like immediately accessible locations within) each of those. So:

---

(i copied this to the todos in boot_reference):

---

like the 4stack, may want to partition some of the regs so that it is only accessible by one parallel (ILP) "thread". 4stack has 4 threads, each with its own stack, each stack of length 8, with the top 4 stack locs of each one writable only by its own thread but addressable all threads, and the bottom 4 only readable or writable by its own thread

i don't really see how to do that using our 'stops' model, tho, where we don't specify how many threads are actually available, and don't give different threads local registers.

---

why have separate integer and fp reg banks again?

" That makes it a good idea to put registers close to the arithmetic unit that is going to use their contents. With separate integer and floating point registers the processor can have integer registers close to the general ALU, and floating point registers close to the floating point unit.

There are also issues of limited numbers of paths for reading and writing registers. With separate register banks, the ALU and the floating point unit have independent register access paths, allowing for more things to happen at the same time. Cycle times are no longer dropping rapidly, and one of the other sources of processor speed improvement is doing more in parallel. "

couldnt the same argument be made for separate pointer and integer registers?

---

" Yes. The memory management system of ForwardCom? divides a program's memory into two blocks: (1) Code and constants, (2) read/write data including static data, stack and heap. Block (1) is addressed relative to the instruction pointer, while block (2) is addressed relative to a special register called data section pointer (DATAP) and the stack pointer. Nothing in one block is addressed relative to something in the other block. Each of the two blocks can be placed at any address independently of the other block. We can easily place block (1) in non-volatile memory if only it has fast read access. It doesn't need fast write access. Block (2) should be placed in volatile RAM. Microcontrollers with Harvard architecture are like this anyway. "

---

Matrix multiplication Author: Hubert Lamontagne Date: 2016-07-29 09:33 ARM's simd has instructions that can help matrix multiply and are also useful for a whole bunch of other simd tasks:

VMUL, VMLA, VMLS with broadcasted operand - Vector multiply by broadcasted scalar extracted from a vector register (+accumulating and subtracting versions) infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489e/CIHJDBDJ.html

VTRN - Matrix transpose infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489e/CIHDJAEA.html

VPADD - Pairwise add infocenter.arm.com/help/topic/com.arm.doc.dui0489e/CIHGDEGD.html

VZIP, VUZP - Interleave/deinterleave vectors infocenter.arm.com/help/topic/com.arm.doc.dui0489e/CIHCACAB.html

VDUP - Broadcast scalar extracted from a vector register to whole vector register infocenter.arm.com/help/topic/com.arm.doc.dui0489e/CIHDIGCC.html

-- [8]

---

in forwardcom, "All code is addressed relative to the instruction pointer and all read/write memory is addressed relative to the data pointer and the stack pointer. This means that everything is addressed relative to just three registers which can quickly be changed." -- [9]

---

" llvm-mos compiles a considerable subset of C++ to 6502 machine code out of the box. IIRC it's just missing "runtime support"; things like RTTI, exceptions, and the runtime layout of VTables. The last one is kinda important, but I don't think it's too likely that there are any compiler changes necessary to get that working; just "standard library" stuff. Should be roughly the equivalent of writing crt0.s for C.

Oh, and static constructors/destructors running before/after main. That would also take some _start work.

Still, quite a lot does work out of the box. We've even implemented our C multiplication libcalls in C++ for our standard library. C++ templates really reduce code duplication, now even for the 6502!

http://llvm-mos.org " [10]

---

https://www.agner.org/optimize/blog/read.php?i=795#918

 Number of register file ports in implementationsAuthor: Hubert Lamontagne 	Date: 2017-08-22 12:46 I've taken another quick look today at the specs to figure out how feasible a Verilog implementation running on a FPGA would be, and I'm finding that register file read port counts are on the higher end... 2 ports per AGU (address generating unit), plus 4 ports per ALU (3 operands + 1 mask) and 4 ports per vector unit (3 operands + 1 mask)... This also affects the size of the input dependency tracking mechanism (and, I guess, the register renamer).

So I'm wondering, how many ports can modern register files manage, how does it scale with chip size and frequency, and how many ports are realistic on a modern mid-sized FPGA? (like, a 30$ FPGA or something?) Reply To This Message

Number of register file ports in implementations Author: Agner Date: 2017-08-23 00:37 Hubert Lamontagne wrote:

    register file read port counts are on the higher end... 2 ports per AGU (address generating unit), plus 4 ports per ALU (3 operands + 1 mask) and 4 ports per vector unit (3 operands + 1 mask)... This also affects the size of the input dependency tracking mechanism (and, I guess, the register renamer). 

The idea of ForwardCom? is that it should be a high-end vector processor comparable to x86, ARM, etc. but with a better design. The AGU needs addressing modes with base and index as well as the special addressing mode for vector loops. The ALU needs 3 operands for fused multiply-and-add which is required for a state-of-the-art processor, and I think that masking is also required for a high-performance vector processor.

I have limited the number of output registers to one in order to simplify dependency tracking and register renaming, following our previous discussion. This means no flags register, no conventional add-with-carry instruction, and no single-instruction stack POP.

Current x86 processors have at least the same number of register inputs, plus segment register, and at least two register write ports.

My idea is to combine the best from RISC and CISC. The weakness of CISC is that the decoding is complicated, and it is difficult to decode multiple instructions in one clock cycle. They are caching decoded instructions to deal with this, which is a waste of silicon space. ForwardCom? deals with this by making decoding simple. The weakness of RISC is that you need more instructions to do the same. For example, RISC-V needs a long sequence of instructions just to calculate the address of a memory operand. If you can do the same with a single instruction, you get a much higher overall throughput.

Why do you think register read ports are so expensive? Reply To This Message

Number of register file ports in implementations Author: Hubert Lamontagne Date: 2017-08-27 10:12 Agner wrote:

    The idea of ForwardCom is that it should be a high-end vector processor comparable to x86, ARM, etc. but with a better design. The AGU needs addressing modes with base and index as well as the special addressing mode for vector loops. The ALU needs 3 operands for fused multiply-and-add which is required for a state-of-the-art processor, and I think that masking is also required for a high-performance vector processor.
    I have limited the number of output registers to one in order to simplify dependency tracking and register renaming, following our previous discussion. This means no flags register, no conventional add-with-carry instruction, and no single-instruction stack POP.
    Current x86 processors have at least the same number of register inputs, plus segment register, and at least two register write ports.
    My idea is to combine the best from RISC and CISC. The weakness of CISC is that the decoding is complicated, and it is difficult to decode multiple instructions in one clock cycle. They are caching decoded instructions to deal with this, which is a waste of silicon space. ForwardCom deals with this by making decoding simple. The weakness of RISC is that you need more instructions to do the same. For example, RISC-V needs a long sequence of instructions just to calculate the address of a memory operand. If you can do the same with a single instruction, you get a much higher overall throughput.
    Why do you think register read ports are so expensive? 

I think this is one of those kind of things where it really depends on which tech you're aiming... Intel and AMD have huge register files with tons of ports on their modern cores. That probably takes up tons of silicon and suck out tons of power (and is probably hard to get running at very high clock rates). But that kind of chip is your target, so it's appropriate. Come to think of it, you're reminding me of the whole segment thing, and that's making me quite curious as to how x86 cores track that - it's the kind of thing you'd possibly want to keep in a separate register file and give it its own register renamer to reduce the level of contention with the rest of the registers.

On FPGAs, which I think would make sense for a test implementation, register files with lots of ports are an issue... especially with multiple write ports with no limitations on which registers are concurrently written to. Here's a paper on this.

As for RISC-V, they're obviously targeting smaller implementations (kindof like ARM), which are often in-order with 1 or 2 instructions per cycle. At that performance point (and in small out-of-order CPUs), this MIPS-inspired design is pretty much the optimal design (along with ARM-style), as far as I can tell. Though you do have a point, that they might be overdoing simplicity by not having R+R*n addressing modes - although the performance hit of that probably varies a lot per application. I think a very large proportion of memory reads in non vector code are stack pointer or object pointer + offset, which probably reduces the impact of lacking more complex addressing modes.

One thing ARM has done is to keep the number of input dependencies per uop low at 2 for the integer part of the core (these ops HAVE to run at low latency), and let vector ops have 4 inputs (you can see the cost of this if you look at result latency for NEON ops - often 3+ cycles!).

---

forth style metaprogramming is appropriate for any language that is linear, for example, assembly language or virtual machine. So mb oot assemble, or mb ovm

---

do we even need Boot once we have BootS??

the point of BootS? is to be dead simple to implement. Boot is supposed to be an awesome assembly language. Why do we even need an awesome assembly language? The OVM/OVMlow implementation source code will be written in a HLL that compiles to Boot anyhow, so who even cares if the target is awesome? Yes, having Boot around gives me the discipline to really anything not dead-simple from BootS? b/c i can put it in Boot. Yes, Boot will run much more efficiently than BootS? b/c it is more expressive, but really if you care about efficiency, compile OVM to native code. Yes, x86 is too complicated to be portable, but we could just target RISC-V instead. Yes, Boot will be super dense but does it matter enough to justify creating a new assembly language? Yes RISC-V is slightly complicated in instruction formatting but the addressing mode complexity of Boot outweighs that, right? Yes, even MIR has addressing modes in terms of memory operands having displacement, base, index, and scale, but do we really need that, and also, why not just target MIR if that's the only gain?

---

regarding SSA registers vs stack SSA registers:

actually i guess stack ops are immutable in stack machines anyhow, the idea of writing to them directly is our innovation, so yeah maybe consider just having immutable stack regs. Also, recall that one of my original motivations for having stack regs in Boot was so that i could implement more complicated Boot instructions in terms of macros that expand to simpler Boot instructions -- and having a stack allows one stateful macro to call another stateful macro while using the stack to keep track of its state across the call. Macro expansion is like inlining. I think this property (the ability to costlessly/easily refactor between structured programming and inlining) may actually be the same as the 'concatenative' in concatenative programming languages, so yeah there's a reason for the stack. And i also had some hand-wavy idea about advanced compilers further optimizing away some register moves etc in such code if it kept track of stack maps -- but this (allow the compiler to track stuff more easily) is very similar to why SSA form is desirable so yes that's another reason to have immutable stacks.

otoh is it really so bad to mutate the stack in this context? As long as a macro never mutates the stack outside of its own scope (that is, the part of the stack that it placed there itself), we are still in structured programming land, and the compiler can still do the SSA renumbering, just creating a new SSA variable after the mutation. It seems like you might lose the ability to see past false data dependencies when parallelizing, which is one reason why CPUs do register renaming, but i'm not sure b/c don't you have to do renaming anyways with stacks don't you? (and also you lose the ability for the implementation to blindly speculate; if the stack is immutable then the implementation can speculatively execute both sides of a branch in parallel without making two copies of anything deep in the stack -- but if the stack is mutable then any deep stack locs which are mutated by either branch must be copied in that scenario (like register renaming in CPUs)).

i guess the idea of immutable registers in hardware, stacks or no, is that it encourages or totally prevents the program from creating false data dependencies. E.g. if a compiler needs to do c=a+b and also z=x+y, where a,b,x,y need to be loaded from memory, it might do these operations in sequence and reuse the same register to temporarily hold a and x. This creates a false data dependency between these operations, preventing parallelization, unless you rename (which is like converting into SSA form).

another way to look at it is that it allows you to convert into SSA while providing an upper bound on the number of renamed registers that you get at the end. This is important in CPUs, where these registers become physical registers actually holding state, but also if one wanted to have an encoding/storage format for your SSA-ified instructions with fixed-length fields. but this isn't actually true; in actual SSA, as the program gets longer, stack-based or not, you need more and more variables i guess the question is, are there variants of SSA optimizations that could still be used on this sort of 'locally SSA' representation without globally renaming? I suspect yes...

i guess since i don't know much about compiler optimizations, my ideas about the value of SSA-like things are very speculative (no pun intended). I do understand how stacks are valuable for concatenative programming which is valuable for (easily) inlining macros, however. And there the value is as much about the instruction encoding as it is about state; if some function eats two things from the stack and puts two new things there instead, creating a bunch of temporaries for its own use and then consuming them in the meantime, with a stack-based ISA you can inline by just plopping that function in whereever it's called, whereas with registers you have to save the caller-saved stuff to memory first (which will include the caller's arguments).

so are stack registers valuable for anything besides a way to set aside some storage space for use in temporaries as part of allowing the implementation to easily inline nested macros? Well yes, they are also valuable for compressed encodings.

neither of these are disrupted by allowing mutation deep in the stack, at least, not any more than allowing reads deep in the stack.

so the value of immutable registers, if any, may be orthogonal.

---

in a way, the value of a register representation is orthogonal to a stack representation. The stack representation is more compressed for expression evaluation, and is amenable for easily inlining. But a register representation is probably more amenable for easily doing various static optimizations, since the label (number) that identifies a storage location doesn't change over time. Seen that way, an SSAish immutable register representation is the worst of both worlds, since the number identifing a register does change over time.

---

also, when permuting registers in a single instruction, you have to watch how much bits it takes to encode the permutation. Reordering 6 items, you have 6! = 720 permutations, which is more than can be represented in one byte (8 bits = 256 possibilities). 5! = 120. In Boot, we have 7 bits per operand, so we can fit that. However i guess we actually have 3 operands, for 21 bits total, which can fit the 9! permutations of 9 items. Otoh if we make this instruction available in a compressed encoding we might have less operand bits than that. Also in hardware it might be expensive to support permuting too many items. Also, sticking with powers of 2 might be nice for some reason.

So i guess it's safe to say that we could support permuting up to 4 items. There's only 24 such permutations, ie. 5 bits. And specifying a permutation of 3 items takes only 3 bits.

---

see also Queue Machines:

https://dl.acm.org/doi/10.1007/s11227-005-0160-z https://web.archive.org/web/20160610142459/https://ir.lib.uec.ac.jp/infolib/user_contents/9000000330/9000000330.pdf https://web-ext.u-aizu.ac.jp/~benab/research/projects/QueueCore/

---

more queue machines (are these the same?)

https://es.cs.uni-kl.de/publications/datarsg/Huse16.pdf

these guys apply the consuming operation at the TAIL of the queue (that could be wrong, i just glanced at it)

---

and these guys:

https://users.ece.cmu.edu/~blevine/documents/fccm02_queue.pdf

Queue Machines: Hardware Compilation in Hardware Herman Schmit, Benjamin Levine and Benjamin Ylvisaker

seems at first glance different from the previous? seems like operands are taken from the head of the queue, but

it's not exactly FIFO or LIFO b/c the result of operations are pushed onto the TAIL of the queue, rather than the head? (that could be wrong, i just glanced at it)

https://users.ece.cmu.edu/~blevine/documents/fccm02_queue.pdf

actually i guess that is FIFO; what i am thinking of is something else, something cache-like

---

there's a few ways of representing reads from arbitrary queue SSA-like registers in instruction encoding

---

i guess the real difference between SSA-like immutable registers/stack/queues and mutable registers is that, with an immutable register, the later register read instruction corresponds in some way to the instruction that wrote to that register, and this mapping is staticly known, but with a mutable register, if you assign to the register and then later read from it, maybe some other instruction wrote to it in the middle, and this maybe could be conditional, so the source of the data which is read in the later register read may depend on which way the conditional goes at runtime. You can see how this could related to basic blocks etc; eg within a basic block since you don't have branches, the only way to have conditionality write to a mutable register is with predicated instructions like CMOV; some of these (like CMOV) would clearly be disallowed with immutable registers (because if the MOV is not done, then what is the value of the immutable register if later read? Instead, you'd replace CMOV with something that assigns 0 on one pseudobranch, and MOVs on the other pseudobranch).

which gets at the concern that mapping the SSA-like immutable registers to actual SSA is easy within a basic block but more complicated across basic blocks.

with stack machines, you have a similar problem with things that conditionally push to the stack. For the SSA stuff to work, you want to have a single staticly known stack map at every point in the program.

also with both, an issue related to basic blocks is that when returning from a call you want to demolish all the temporaries except for the return argument(s), and leave the registers/stacks/queues etc in the state that you found them, except that now the return arguments are there.

---

so, keeping in mind the previous entry, i'm beginning to think that immutable stack machine might be the way to go. The concatenative property is big for easy inlining of macros. The benefit of FIFO is probably that when you are done with something in a FIFO queue you can just ignore it and eventually it gets pushed off the end of the queue, rather than having to POP it explictly in the stack machine. But the only time POPing it would be laborious is if you wanted to selectively POP something deep in the stack (in which case you'd have to ROT stuff, etc), and the only time that would happen is if you are reusing that value many times, and if you're doing that you'd probably just use the GPRs for that.

mb not POPing could also be useful for ILP (parallelization)? but it seems like the compiler/interpreter has to keep track of renaming as new things are pushed into the queue anyhow, so mb not

---

a disadvantage of a strict stack machine is the need to explictly DUP. But if you allow non-destructive reads from deep in the stack, then that's taken care of

a bigger disadvantage is the need to actively delete things after you are done with them

---

straight CMOV should be prohibited from writing to immutable registers

if needed, there should be another CMOV for that (so that the implementation can know that the SSA is broken at that point, like with a branch instruction)

similar, conditional stack pushes should be easily identified with special instructions

---

mb only a special ERR instructiion can write to ERR (in addition to insns that use it for carry, etc)

---

ok i'm leaning back towards having all of:

why not only stack regs, not cache-like immutable regs? b/c you have to actively consume the stack regs, while preserving the things you want to save; in some cases using the stack regs you would end up with some long-lived temporaries deep in the stack that you want to clear (upon a fn return, eg) while preserving some return arguments shallow in the stack. In such cases the cache-like immutable regs will be more efficient because they will be automatically destroyed. Also, the stack is sort of callee-saved (upon return to the caller, the stuff they left on the stack is still there), whereas the cache-like regs are caller-saved.

but yeah, let's not allow writes to deep in the stack, except possibly with special instructions (b/c you can always do a bunch of stack ops that are equivalent to a deep-stack write; but the point is that doing this screws up the SSA so it should be easy to detect). But even better is probably to stick with only affecting the top 3 elements (like ROT).

this prohibition of writes to the immutable regs (both stack and cache-like) frees up the instruction encoding a little for more addr mode-like semantics, btw (such as the 'copy to the front of the cache' mode that i described above).

---

I placed an extended quote of the most relevant parts of Second-Generation Stack Computer Architecture by Charles Eric LaForest into [[plChStackMachines?]].

Here is a summary of my conclusions from that for these purposes:

More recent work by Charles Eric LaForest? on stack machines is at:

http://fpgacpu.ca/stack/index.html

e also does a lot of work on FPGAs (hence his website name, http://fpgacpu.ca/ ). His job is an FPGA designer for hire.

---

"One last idea I toyed with in a graduate OS design course, was to add support for system calls and memory range protection to a stack machine. Needs 3 extra registers and one mode bit, uses the stacks for system calls and traps, and should grant full machine virtualization support for simple stack machines." [12]

---

so some things we want to encode in operands:

read:

write:

both:

---

another perspective on Oot Assembly (Boot) is that maybe we should ditch all the 'cool stuff' (eg stack regs, SSA cache regs) and just stick to something like QBE or MIR, except with a defined encoding as 32-bit instructions instead of a textual format (MIR has a binary format but afaict it's undocumented). Why not just use QBE or MIR? I feel like it would be too much work to reimplement them on a new platform; if there is a simple 32-bit instruction format that would make things easier for people making new implementations.

---

"stack size is invisible in C, and especially it's not part of the portable C API and generally not part of either the platform API or ABI. Unlike malloc(), which can at least officially fail, the stack is magic; your code can neither guard against hitting its size limit nor check its limits in any portable way. Nor can you portably measure how much stack size you're using or determine how much stack size it may require to call library functions (this is part of how the C library API is under-specified)."

---

~ david_chisnall 1 hour ago

link flag

I don’t understand the musl decision to have such small stacks. FreeBSD? uses 1 MiB? * sizeof(void*) as the default stack size (I think thread stacks are smaller by default, but they’re controllable via the pthread APIs). On 32-bit platforms, a 4 MiB? stack for every thread can be a bit too much overhead. On a real 32-bit system (as opposed to a 32-bit userspace on a 64-bit kernel) you typically have only 2 GiB? of virtual address space and so a 4 MiB? stack per thread would limit you to at most 512 threads (assuming TLS is allocated in the stack and you have no other consumers of memory such as code or the heap), which is a bit constrained. On a modern CPU with a 48-bit address space, you have space for 2^15 8 MiB? stacks, which is plenty (if you have anything like that many kernel threads, you have other problems).

On a system that can allocate physical memory on demand, there’s no reason to not reserve quite a lot of virtual address space for stacks.

---

~ drobilla 53 minutes ago

link flag

GCC has -Wstack-usage=BYTES and -Wframe-larger-than=BYTES warnings which can be used to put some guardrails around stack usage at compile time.

They aren’t perfect, but I’ve started setting them (usually to the lowest power of 2 that still works) in new projects, and found it handy that they will say “hey, that thing you just did increased the stack usage significantly” when necessary, while being easy to adjust or turn off and not requiring any elaborate mechanism in the code to attempt to do this dynamically.

~ jado 1 hour ago

link flag

It looks like clang outputs this in an ELF metadata section (and this change was originally made by the PS4 team) which could be read by whoever is trying to run the program. I wonder if the best way for you to make your app the most portable™️ would be to provide many different versions using different max stack sizes.

---

~ malxau 58 minutes ago

link flag

The other unspecified thing is if you have a thread, it’s impossible to know how much stack it uses unless it doesn’t call into any external library. The moment it calls into libc, there’s now an ABI contract that the function you call can complete in N bytes of stack or less, except libc never promised as much. A fully specified system would need each function to indicate how much stack it uses on entry, with compile time checking that each function it calls still has sufficient remaining stack after the compiler determines that function’s stack usage.

---

" Here is a table of common x86_64 platforms and their default stack sizes for the main thread (process) and child threads: OS Process Stack Size Thread Stack Size Darwin (macOS, iOS, etc) 8 MiB? 512 KiB? FreeBSD? 8 MiB? 2 MiB? OpenBSD? (before 4.6) 8 MiB? 64 KiB? OpenBSD? (4.6 and later) 8 MiB? 512 KiB? Windows 1 MiB? 1 MiB? Alpine 3.10 and older 8 MiB? 80 KiB? Alpine 3.11 and newer 8 MiB? 128 KiB? GNU/Linux 8 MiB? 8 MiB? " [13]

" The article highlights “OpenBSD before 4.6” as having the smallest stack space, but OpenBSD? 4.6 is from 2009 and essentially not really worth considering IMHO. I certainly wouldn’t call a 12-year old system a “common x86_64 platform”.

So Alpine actually has the smallest stack space with 128K in that table, followed by OpenBSD? and Darwin/macOS with 512K.

I also looked up the stack size for some other platforms (quick search, may be outdated for some):

    Solaris: 2M (64-bit) or 1M (32-bit)
    z/OS: 1M
    QNX: 256K (amd64), 512K (arm64) or 128K (32bit)
    HP-UX: 256K (but can be changed easily with PTHREAD_DEFAULT_STACK_SIZE env var).
    Minix: 132K
    AIX: 96K"

---

    ~
    classichasclass 46 minutes ago | link | flag | 

On architectures with lots of callee-save registers, stack space requirements also increase. For the 32-bit PowerPC? JavaScript? JIT in TenFourFox?, because IonMonkey? likes non-volatile registers, we have to save nearly all of them as part of every generated function prologue. This really bloats stack frames and means a fair bit of recursion can run you out of stack space in a hurry. It ran with a 1GB stack space for a very long time, much more than the default on OS X. Sure, we could use less registers, or use more volatile ones, but then it’s still spilling them to the stack with the added performance issues that entails besides fewer registers in flight.

---

"On the other hand Rust’s std::thread exposes stack size, and the Rayon crate exposes stack size too. 😁"

---

jazelle notes:

https://hackspire.org/index.php?title=Jazelle

" Register usage

    r0-r3: Holds up to 4 words of the stack
    r4: Copy of local variable 0. Only valid when local variable 0 is a single word (i.e. int, float, or Object; not long or double)
    r5:
        Bits 0-1: If any of the stack is held in registers, this specifies which of the registers in r0-r3 is currently the top of stack, otherwise this is 0
        Bits 2-4: Number of stack words held in registers
        Bit 5: Seems to get cleared upon entry to Jazelle state, unless a prefetch abort or data abort happens before any bytecode is read
        Bits 6-9: Cleared immediately upon entry to Jazelle state
        Bits 10-31: Pointer to software handler table (must be 1024-byte aligned)
    r6: Pointer to top of stack (Note that the Java stack grows up, not down)
    r7: Pointer to current method's local variables
    r8: Pointer to constant pool? (haven't checked this yet)
    r9-r11: Do not appear to be used; available for the JVM's own use
    r12: Temporary
    sp: Probably not used (the JVM's ARM code needs this for its own stack, after all)
    lr: The bxj instruction uses this as the address of the bytecode to jump to, and when an exception handler is called the address of the faulting instruction is stored here. (During actual Jazelle execution this is used as another temporary, but this is not visible to the JVM, so from the JVM's point of view this can be thought of as the Java program counter)
    "

" Software handler table

This contains routines that Jazelle calls when it encounters a bytecode it doesn't recognize, as well as in some exceptional conditions.

    +000-3FF: Unhandled bytecodes
        The stack is flushed to memory before calling any of these handlers, so they may modify r0-r3 freely
    +400: Null pointer exception
    +404: Array index out of bounds exception
    +40C: Jazelle mode entered with JE = 0
    +410: Configuration invalid (Jazelle mode entered with CV = 0)
        CV is automatically set to 1 on entering this handler
    +414: Prefetch abort occurred in middle of instruction
    "

---

is there any use in assigning the last register (R31) to the PC, if you aren't allowed to directly read and write the PC? (beside the fact that it reduces the number of available registers and so leaves some room for the implementation; because if that's all you wanted, just reduce the # of registers)

yes, maybe; it could still be useful if there are addressing modes. Eg if you have an indexed addressing mode that expresses "effective_address = base + offset*scale", where base is a register containing a pointer and offset is a register containing an integer and scale is an integer, then sometimes it will be useful to set base=R31 to indicate "effective_address = PC + offset*scale" -- in other words, in this case, saying that R31 holds the PC allows us to save encoding space for a consistent way to represent PC-relative addressing.