proj-oot-ootAssemblyNotes30

--- old

removed from boots_extension_reference:

A big issue with this is that it's just not really linear: - you can have floats and filesystems and UIs and multiprocessing without 64-bits or even 32-bits (old computers) - you can have filesystems and UIs without multiprocessing (serial computers) - you can have multiprocessing without UIs and filesystems (embedded OSs)

so either we give up on linearity, or we look at linearity as our specific way of organizing the world particular to our use-case and/or aesthetic.

One thing to focus on is emulatable vs I/O. E.g. 32-bit, 64-bit, and concurrency are all emulatable. Filesystems, TUIs, and graphics are I/O. It's not such a big deal to get the linearization wrong for the emulatable part b/c it can always be emulated. But if a profile insists that graphics are present and they're not, the program simply can't be run.

Ideas: - let the I/Os indicate unavailability if the desired device type doesn't exist on this system - focus on our use-cases and don't worry about whether the profiles are useful for others, others can always use the feature indicator bits directly - by the way, do we really need to include profile #s in the VM at all? Mb the VM should only have feature indicator bits.Profiles could just be a thing defined in the spec.

  1. # Profile 12 TODO: not sure if this level and up should even be part of the Boot project. Higher-level languages and frameworks can add this stuff. It is not needed for bootstrapping. It is not in the C standard library, and was not available on 1980s 8-bit consumer desktop systems. Suggest moving to LOVM or higher.

This profile adds more I/O modalities. It is suitable for running modern desktop programs.

Boots12 (Boots Profile 12) includes everything in Profile 11 and also:

syscall extensions:

  1. # Profile 13 TODO: not sure if this level and up should even be part of the Boot project. Higher-level languages and frameworks can add this stuff. It is not needed for bootstrapping. It is not in the C standard library, and was not available on 1980s 8-bit consumer desktop systems. Suggest moving to LOVM or higher.

This profile adds more I/O modalities.

Boots13 (Boots Profile 13) includes everything in Profile 12 and also:

syscall extensions:

  1. # Profile 11 TODO: not sure if this level and up should even be part of the Boot project. Higher-level languages and frameworks can add this stuff. It is not needed for bootstrapping. It is not in the C standard library, although it was available on relatively primitive BASIC interfaces to 1980s 8-bit consumer desktop systems eg https://en.wikipedia.org/wiki/Applesoft_BASIC#Sound_and_graphics , which did not support networking (which we placed at a lower profile level). The latter suggests that maybe we should move it before profile 6; but otoh probably not b/c what about headless servers, and what about the C library.

This profile adds other I/O modalities. It is suitable for running desktop programs with simple keyboard-based user interfaces.

Boots11 (Boots Profile 11) includes everything in Profile 10 and also:

0: base Boots

1: 16-bit integer static control flow functions

2: 32-bit integer static control flow functions

3: 32-bit static control flow programs

must be present but might not have resources:

    1. Profile 0: base Boots

The base profile, standard Boots without extensions, is called "Boots0" and is defined in boots_reference.md. This profile is suitable for 15-bit unsigned integer pure calculations that don't access I/O and take place wholly in registers (without accessing memory).

  1. # Profile 1: 16-bit integer static control flow functions

This profile adds signed arithmetic and memory access. Suitable for 16-bit integer computation without multiplication, dynamic control flow, I/O, or memory growth. Profile 1 includes everything in profile 0 and adds:

instruction extensions:

syscall extensions:

    1. Profile 2: 32-bit integer static control flow

Profile 2 adds 32-bit arithmetic and integer multiply. Profile 2 is suitable for small-size executables doing 32-bit integer computation without division or dynamic control flow. Profile 2 includes everything in profile 1 and adds:

instruction extensions:

OUT-OF-DATE NOTES: Note: Profile 2 is what pyboot3 implements, except without 64-bits or floats or math

i did it in that order b/c static control flow is easier to implement in JVM and i think WASM, but JVM at least supports triglog, and WASM supports everything in 4 and before. And triglog can be implemented in a library, whereas dynamic_control_flow requires a new interpreter loop if the platform doesn't support it.
Our Lo toolchain will require dynamic control flow but not floats or 64-bits.
  1. # Profile 3: 32-bit static control flow programs

Profile 3 adds I/O, floating-point of unspecified bitwidth, embedded immediates, integer divide, exit, and memgrow. Profile 3 is suitable for 32-bit programs without dynamic control flow. Profile 3 includes everything in profile 2 and adds:

instruction extensions:

syscall extensions:

Note: Profile D is what pyboot3 implements, except need to add in,out (io syscall extension), and change malloc/free to memsz, memgrow

Note: When console I/O (device #0) is present, the compilers and interpreters in the Boot toolchain can run on Profile 3.

Note: Profile 3 is what pyboot3 implements, except need to add in,out (io syscall extension)

i did it in that order b/c static control flow is easier to implement in JVM and i think WASM, but JVM at least supports triglog, and WASM supports everything in 4 and before. And triglog can be implemented in a library, whereas dynamic_control_flow requires a new interpreter loop if the platform doesn't support it.

--- done

i took a look at the WASI spec. There are a lot of 1-byte and 2-byte fields in the defined structs. So maybe we should support 8-bit and 16-bit values in the Boot offset fields. So maybe 1 bit for each of:

that's not too useful, but then no scheme is very useful with just 4 bits and portable sizing. In LOVM if 8 or 12 bits are provided, it would have 2 or 3 bits per field, which would then become much more useful.

done.

---

hacknat 10 hours ago [–]

Normally I agree with anit-NIH sentiments, but why are we all supposed to just have to be happy with a language ABI that's been around since the 1970s?

reply

simias 6 hours ago [–]

If it ain't broke...

Having a common entrypoint into the kernel we can hook into is fairly valuable IMO, especially when said interface is for the most part stable between all unx (and to a somewhat lesser extent even on Windows).

Having this lowest common denominator ABI makes interop relatively* painless. But clearly Go wants to eat the world and doesn't seem to care very deeply about this. I suppose it's a valid approach, even if I don't necessarily agree with it.

reply

convolvatron 6 hours ago [–]

  If it ain't broke...

user supplied read addresses

getpwent() - actually really the whole notion of users

file permissions

aynch storage event completion (all asynchrony is pretty broken)

system metadata control (sysctl, /proc, interface socket messages...)

signals

ttys

errno

affinity interfaces ...

reply

---

"As an aside, I’ve found PPC hacking to be much more pleasant than the little bit of 6502 and x86 hacking I’ve done. PPC is big-endian, so it’s easy to read and write the raw bytes. Also every instruction is exactly four bytes long, so you don’t have to fret about adding NOPs or starting your decode in the middle of an instruction. Super nice to work with."

---

" There are some low-level features that Rust doesn't have a proper replacement for:

    computed goto. "Boring" uses of goto can be replaced with other constructs in Rust, like loop {break}. In C many uses of goto are for cleanup, which Rust doesn't need thanks to RAII/destructors. However, there's a non-standard goto *addr extension that's very useful for interpreters. Rust can't do it directly (you can write a match and hope it'll optimize), but OTOH if I needed an interpreter, I'd try to leverage Cranelift JIT instead.
    alloca and C99 variable-length arrays. These are controversial even in C, so Rust stays away from them.

It's worth noting that Rust currently supports only one 16-bit architecture. The tier 1 support is focused on 32-bit and 64-bit platforms. " [1]

---

" AssemblyScript? TurboFan? 298.13ms 2.9x JavaScript? TurboFan? 104.35ms 1.0x ... it didn’t sit well with me that the optimized WebAssembly? module takes about 3 times as long as JavaScript?. ...

Digging in

After quickly consulting with some folks from the V8 team and some folks from the AssemblyScript? team (thanks Daniel and Max!), it turns out that one big difference here are “bounds checks” — or the lack thereof.

V8 has the luxury of having access to your original JavaScript? code and knowledge about the semantics of the language. It can use that information to apply additional optimizations. For example: It can tell you are not just randomly reading values from memory, but you are iterating over an ArrayBuffer? using a for ... of loop. What’s the difference? Well with a for ... of loop, the language semantics guarantee that you will never try to read values outside of the ArrayBuffer?. You will never end up accidentally reading byte 11 when the buffer is only 10 bytes long, or: You never go out of bounds. This means TurboFan? does not need to emit bounds checks, which you can think of as if statements making sure you are not accessing memory you are not supposed to. This kind of information is lost once compiled to WebAssembly?, and since ASC’s optimization only happens at WebAssembly? VM level, it can’t necessarily apply the same optimization.

Luckily, AssemblyScript? provides a magic unchecked() annotation to indicate that we are taking responsibility for staying in-bounds.

But there’s more: The Typed Arrays in AssemblyScript? (Uint8Array, Float32Array, ...) offer the same API as they do on the platform, meaning they are merely a view onto an underlying ArrayBuffer?. This is good in that the API design is familiar and battle-tested, but due to the lack of high-level optimizations, this means that every access to a field on the Typed Array (like myFloatArray[23]) needs to access memory twice: Once to load the pointer to the underlying ArrayBuffer? of this specific array, and another to load the value at the right offset. V8, as it can tell that you are accessing the Typed Array but never the underlying buffer, is most likely able to optimize the entire data structure so that you can read values with a single memory access.

For that reason, AssemblyScript? provides StaticArray?<T>, which is mostly equivalent to an Array<T> except that it can’t grow. With a fixed length, there is no need to keep the Array entity separate from the memory the values are stored in, removing that indirection.

...

Sneaky defaults

Another thing that the AssemblyScript? folks pointed out to me is that the --optimize flag is equivalent to -O3s which aggressively optimizes for speed, but makes tradeoffs to reduce binary size. -O3 optimizes for speed and speed only. Having -O3s as a default is good in spirit — binary size matters on the web — but is it worth it? At least in this specific example the answer is no: -O3s ends up trading the laughable amount of ~30 bytes for a huge performance penalty: Language Variant Optimizer Average vs JS AssemblyScript? optimized O3 89.60ms 0.9x JavaScript? 104.35ms 1.0x AssemblyScript? optimized O3s 162.63ms 1.6x AssemblyScript? naive O3s 298.13ms 2.9x

One single optimizer flag makes a night-and-day difference, letting AssemblyScript? overtake JavaScript? (on this specific test case!). We made AssemblyScript? faster than JavaScript?!

But do me a favor: Don’t stop reading now.

...

Some of you may have noticed that both these examples have very few or no allocations. V8 takes care of all memory management (and garbage collection) in JavaScript? for you and I won’t pretend that I know much about it. In WebAssembly?, on the other hand, you get a chunk of linear memory and you have to decide how to use it (or rather: the language does). How much do these rankings change if we make heavy use of dynamic memory?

To measure this, I chose to benchmark an implementation of a binary heap. ... 80x slower than JavaScript??! Even slower than Ignition? Surely, there is something else going wrong here. ...

Runtimes

All data that we create in AssemblyScript? needs to be stored in memory. To make sure we don’t overwrite anything else that is already in memory, there is memory management. As AssemblyScript? aims to provide a familiar environment, mirroring the behavior of JavaScript?, it adds a fully managed garbage collector to your WebAssembly? module so that you don’t have to worry about when to allocate and when to free up memory.

By default, AssemblyScript? ships with a Two-Level Segregated Fit memory allocator and an Incremental Tri-Color Mark & Sweep (ITCMS) garbage collector. It’s not actually relevant for this article what kind of allocator and garbage collector they use, I just found it interesting that you can go look at them.

This default runtime, called incremental, is surprisingly small, adding only about 2KB of gzip’d WebAssembly? to your module. AssemblyScript? also offers alternative runtimes, namely minimal and stub that can be chosen using the --runtime flag. minimal uses the same allocator, but a more lightweight GC that does not run automatically but must be manually invoked. This could be interesting for high-performance use-cases like games where you want to be in control when the GC will pause your program. stub is extremely small (~400B gzip’d) and fast, as it’s just a bump allocator.

    My lovely memory bump: Bump allocators are extremely fast, but lack the ability to free up memory. While that sounds stupid, it can be extremely useful for single-purpose modules, where instead of freeing memory, you delete the entire WebAssembly instance and rather create a new one. If you are curious, I actually wrote a bump allocator in my article Compiling C to WebAssembly without Emscripten.

How much faster does that make our binary heap experiment? Quite significantly! Language Variant Runtime Average vs JS JavaScript? 233.68ms 1.0x AssemblyScript? optimized stub 345.35ms 1.5x AssemblyScript? optimized minimal 354.60ms 1.5x AssemblyScript? optimized incremental 18758.50ms 80.3x

Both minimal and stub get us significantly closer to JavaScripts? performance. But why are these two so much faster? As mentioned above, minimal and incremental share the same allocator, so that can’t be it. Both also have a garbage collector, but minimal doesn’t run it unless explicitly invoked (and we ain’t invoking it). That means the differentiating quality is that incremental runs garbage collection, while minimal and stub do not. I don’t see why the garbage collector should make this big of a difference, considering it has to keep track of one array.

...

" [2]

---

" It’s common to branch to a basic block that exits. It’s common to add a constant to an integer, convert the integer to a pointer, and use the pointer for loading or storing. It’s common to mask the shift amount before shifting. It’s common to do integer math with an overflow check. B3 IR and Air make common operations use just one Value or Inst in the common case. The branch-and-exit operation doesn’t require multiple basic blocks because B3 encapsulates it in the Check opcode and Air encapsulates it in a special Branch instruction. Pointers in B3 IR are just integers, and all load/store opcodes take an optional offset immediate, so that a typical FTL memory access only requires one object. Arithmetic operations in B3 IR carefully balance the semantics of modern hardware and modern languages. Masking the shift amount is a vestige of C semantics and ancient disagreements over what it means to shift an N-bit integer by more than N bits; none of this makes sense anymore since X86 and ARM both mask the amount for you and modern languages all assume that this is how shifting works. B3 IR saves memory by not requiring you to emit those pointless masks. Math with overflow checks is represented using CheckAdd?, CheckSub?, and CheckMul? opcodes in B3 and BranchAdd?, BranchSub?, and BranchMul? opcodes in Air. Those opcodes all understand that the slow path is a sudden termination of the optimized code. This obviates the need for creating control flow every time the JavaScript? program does math. " -- [3]

---

lots of good assembler directives (and similar) from Jonesforth:

http://git.annexia.org/?p=jonesforth.git;a=blob;f=jonesforth.S;h=45e6e854a5d2a4c3f26af264dfce56379d401425;hb=HEAD

e.g.

macros with arguments, e.g.: .macro PUSHRSP reg lea -4(%ebp),%ebp push reg on to return stack movl \reg,(%ebp) .endm

default arguments, e.g.: .macro defword name, namelen, flags=0, label

.text .align 4 LABEL: globl _start section .rodata .set

.int .byte .ascii

e.g.

 669         .macro defcode name, namelen, flags=0, label
 670         .section .rodata
 671         .align 4
 672         .globl name_\label
 673 name_\label :
 674         .int link               // link
 675         .set link,name_\label
 676         .byte \flags+\namelen   // flags + length byte
 677         .ascii "\name"          // the name
 678         .align 4                // padding to next 4 byte boundary
 679         .globl \label
 680 \label :
 681         .int code_\label        // codeword
 682         .text
 683         //.align 4
 684         .globl code_\label
 685 code_\label :                   // assembler code follows
 686         .endm

---

this came up with some interesting search results last time. I've already noted them, but maybe in future years it will come up with further interesting stuff:

https://www.google.com/search?client=ubuntu&channel=fs&q=wasm+compile+coroutine&ie=utf-8&oe=utf-8

---

https://despairlabs.com/posts/2021-06-16-io-uring-is-not-an-event-system/

example comparing poll, epoll, io_uring

also mentions kqueue in passing

---

on 6502 addressing modes:

Regarding the most important addressing modes, the book Apple Machine Language for Beginners by Richard Mansfield says "...there are only 6 that you should focus on and practice with until you understand their uses: immediate, absolute (plus absolute,X and , Y), zero page, and indirect Y.". Going thru each of those in turn:

In some ways, the zero page memory locations in the 6502 can be used analogously to registers in other processors with more registers. Applying this analogy, we find that:

so, according to this analogy, Mansfield's suggesting most important addressing modes correspond to:

---

syscalls in jonesforth:

EXIT OPEN CLOSE READ WRITE CREAT BRK

---

my assembly language should be regex parsable

---

 cperciva on April 14, 2020 [–]

When will C gain a mechanism for "do not leave this sensitive information laying around after this function returns"? We have memset_s but that doesn't help when the compiler copies data into registers or onto the stack.

pascal_cuoq on April 14, 2020 [–]

This is an entire language extension, as you note. The last time various people interested in this were in the same room (it was in January 2020 in a workgroup called HACS), what emerged was that the Rust people would try to add the “secret” keyword to the language first, since their language is still more agile than C, while the LLVM people would prepare LLVM for the arrival of at least one front-end that understand secret data.

Is this enough to answer your question? I can look up the names of the people that were involved and communicate them privately if you are further interested.

---

jcranmer on March 31, 2020 [–]

One of the blog posts I keep meaning to write is in the vein of "Why C is not portable assembly." Ironically, most of my points about C failures aren't related to UB at all, but rather the fact that C's internal model doesn't comport well to important details about machines:

Animats on March 31, 2020 [–]

No support for multiple return values (in registers). There's structs, but that means returning via memory because, well, see point 1.

Return values in registers are a machine level optimization. Whether something is in a "register" is up to the CPU today on most CPUs. If you set a value in the last few instruction cycles, and are now using it, it probably hasn't even made it to the L1 cache yet. Even if it was pushed on the stack. That optimization made sense on SPARC, with all those registers. Maybe. Returning a small struct is fine.

There's an argument for multiple return values being unpacked into multiple variables, like Go and Python, but that has little to do with how it works at the machine level.

Missing operations such as popcount, cttz, simultaneous div/rem, or rotates.

Now that most CPUs have hardware for those functions, perhaps more of those functions should be visible at the language level. Here's all the things current x86 descendants can do.[1] Sometimes the compiler can notice that you need both dividend and remainder, or sine and cosine, but some of those are complicated enough it's not going to recognize it.

Traps are UB instead of having more predictable semantics.

That's very CPU-dependent. X86 and successors have precise exceptions, but most other architectures do not. It complicates the CPU considerably and can slow it down, because you can't commit anything to memory until you're sure there's no exception which could require backing out a completed store.

---

i can't find it now (looked for about half an hour!), but there's that guy's comment on agner's early forum discussion of forwardcom that i recently read saying e thought RISC-V (or was it MIPS?) would be a good ISA to learn how to implement something simple, but aarch64 would be good to learn a well-designed/"good" ISA (did he also mention m68k?). Found it; adrian bocaniciu, see next section below

--- 200717

there was a comment noted in ootAssemblyNotes28.txt that i want to recall attention to:

https://www.agner.org/optimize/blog/read.php?i=424#424 by Adrian Bocaniciu

he says "There are only 2 possible choices for a set of addressing modes that would allow writing a loop with a minimum number of instructions":

the first of these takes a lot of operand bits, especially if you want to use any register or if you want big offsets (agner's forwardcom allows 8 or 32 sign-extended offsets). And we'd still need addr mode bits on top of that! Eg with:

32 registers for base or index scaling to 1,2,4,8 8-bit offsets

we have:

if we really have 3 operand fields (plus opcode plus format bits), and we want to be able to use this addressing mode in all of those at once, then even with 64-bit instructions we can't fit that (for 64 bits and 3 ortho operands, we probably don't want more than 14 operand bits per operand, more likely 12)

one thing we can do is say that having a large offset consumes a second operand. If we only have a base reg and a scaled index reg, that's only 12 bits.

Alternately/in addition, we can reduce the number of registers usable in this addressing mode. If we can only use 8, then we save 4 bits, which allows us to have a 4-bit offset. Which still isn't much (2 64-bit floats is the max size of struct it can fully address into)

an 8-bit offset would be cool. But i guess i think a 6 bit struct is sort of the minimum (but even 5 would be better than 4). However, if you multiply it by the scale factor, that's a different story; a 2-bit scale factor gives you 1,2,4,8. So here a 4-bit struct lets you index up to the 15th 8-byte field in a struct. In this case i guess a 4-bit offset is (barely) sufficient.

i'd think you might also want post- and pre- inc/dec on the index reg (and/or on the base reg, where instead of inc/dec it gets replaced by the computed address)

What if you are in a 32-bit instruction (with a 3-operand encoding)? Now your operand fields are probably at most 8 bits, probably less (let's say 6 bits). If you have 2 base reg bits, 2 index reg bits, and 2 scale bits, that's it. No space for an offset at all, and you have only 4 regs which can be base regs, and 4 regs which can be index regs. It's not so orthogonal (in terms of register choice), but honestly 4 index regs is probably enough for most cases, so you'd probably still get some speedup out of these. You can now trade a second operand for a 6- or 8-bit offset.

another thing we can do is use a scale register instead of putting the scale in the instruction. This has the added advantage of making some very simple types of loops generic over scale factor.

glacing at boot_reference.md's address modes section, i can't immediate tell if i have either of these in there. In fact, what i do there is kind of weird and i can't quickly read it. But i recall i had thought about it a bit, so maybe it's good.

---

i also see that i had given up on the idea of the BootS? instruction encoding being a subset of the Boot instructions encoding. I guess that's ok, as long as the actual instructions (semantically) are a subset.

after a moment's more consideration, i guess the way i divided it up by number is okay; that just means that some addr modes have more operand bits and others have more addr mode bits. But it would be clearer if i just listed that table, with the addr mode name, at the top, and then did the long descriptions below.

i didn't look thru the modes in detail but they also look okay, except that i don't see why we need both stack ptr mode and frame ptr mode.

---

josefx 8 hours ago [–]

What the hell is systemd doing that a 8MB long file path can exhaust its stack? Is it doing some recursive parsing or is it just doing something plain stupid like using a VLA to store user provided data?

reply

megous 7 hours ago [–]

Probably unbounded alloca() as always.

reply

megous 7 hours ago [–]

Yep: https://github.com/systemd/systemd/commit/b34a4f0e6729de292c...

strdupa(input) without any length check

Fix is to replace it with unbounded malloc() instead of checking for sane length first.

reply

thayne 1 hour ago [–]

In rust you can't currently dynamically allocate on the stack, although that's probably something that will be added in the future. And as others have pointed out, allocating on the stack is a fairly reasonable optimization here.

I don't think you could even call strdupa through libc in rust. I would guess that strdupa is either a macro that uses the alloca compiler intrinsic or is itself a compiler intrinsic. Even if it isn't, it will break assumptions the rust compiler makes about the size of the stack frame.

reply

alerighi 5 hours ago [–]

It's not a mistake. Allocating that string on the stack it is not a bad idea. Most of the time the string will be short, and thus an allocation on the stack is faster.

Consider that in Linux a path is defined to be a maximum length of PATH_MAX, that is defined to 4096 bytes, and a filename (and directory name) shouldn't be longer than FILE_MAX that is 255 bytes. This limits are defined in the headers and I use them always in writing my C programs (if it crashes... you are doing something really wrong!).

So how the hell do you have a directory that is more than 8Mb? You shouldn't! The filesystem doesn't support it. It's a matter of the filesystem driver that should reject a path that long in my opinion.

Systemd should be fast. It's at the base of the operating system. Also it should consume little memory. You can say, who cares about allocating dynamically a string, or allocating a static buffer of 16Mb, yes we should care, I use Linux computer with 16Mb of RAM, total. Of course they don't run systemd nowadays since it's too big, but in my opinion systemd is good, and I would like to see it more in the embedded world.

reply

---

stjohnswarts 7 hours ago [–]

Good find thanks for sharing. And everyone at work gripes about me carrying the size around with a lot of my variables in the form of a struct. It's strictly a reminder to always be checking the size since I'm juggling with shotguns.

reply

thayne 1 hour ago [–]

The fact that c doesn't have a native concept of an array with length and strings usually use a null byte to determine the end is, IMO c's biggest failing, and it's worst legacy on the wider software world.

reply

---

sinsterizme 5 hours ago [–]

Shouldn't this PR also use `free` on the duped string before returning? (I never use C so probably missing something but just based on the docs of strdupa...)

reply

rtldg 5 hours ago [–]

The variable p is now declared with "_cleanup_free_" which is using some compiler cleanup/destructor attribute stuff to run free

reply

sinsterizme 4 hours ago [–]

ah okay, thank you :)

reply

---

should check out algol 68 and pl/0: http://www.jemarch.net/back-to-the-future-a68.pdf

---

picoJava had a 64 entry stack cache [4]