I'm annoyed by the waste of having the SMALLSTACK, DATASTACK and CALLSTACK register operands only usable for four instructions (POP, PUSH, LOAD, STORE). It would be nice to find a semantics for those for the other instructions.

How about either:

i guess i prefer the first of these, because it frees up a register.

To make implementation easier, we could also reserve memory locations 0x0000, 0x0001, and 0x0002 for internal use, so that you would never need to LOAD 0x0000, for example. This would allow the PUSH, POP, LOAD, STORE instructions to just check their argument, rather than having to know what their operand was. (it would also mean that placing 0x0000 in a different register and then POPing that register would have the effect of popping SMALLSTACK; we probably don't want to specify this behavior though.).

but another thing we could do is:

I don't think we want to do implicit stack-swapping. That's not very RISCy at all.

So we're down to two alternatives:

One problem with the second one is that we don't want to impose upon implementations the requirement of offering registers with the TOSs of DATASTACK and CALLSTACK. Why? Because it requires them to either do something non-RISCy (to do an implicit main memory access when these registers are accessed in order to load/store the TOSs), or to have a non-naive implementation of these stacks (where the top item is treated differently than the rest). By contrast, we don't mind forcing implementations to have a non-naive implementation of SMALLSTACK, because that stack is size-limited to 15 (or 16?) items.

we could kind of mix these:

The disadvantage here is that stack-aware instructions can't directly access the TOS of SMALLSTACK themselves; worse, this hurts orthogonality; since the stack-aware instructions do something different here, compilers and analyzers have to remember to have that special case when they see stack-aware instructions accessing it. Somehow it seems that multiplexing constant registers and stacks is more acceptable (less liable to careless error) here than multiplexing stacks and TOS aliases.

(a minor nice thing here is that now we only have to reserve memory location 0x0000, which should be reserved anyways as the 'null pointer'; although a not-nice thing is that now if we want a LOAD or STORE of 0x000 to cause a null pointer exception, then upon seeing LOAD R0 we can't just do the naive thing and resolve the contents of R0 to 0 and then pass that to our LOAD implementation -- this may be bad enough to motivate us not to multiplex the zero register and any stack at all).

hmm... as nice as it would be to free up a register here, i think we just can't:

What we can do is multiplex SMALLSTACK, DATASTACK, CALLSTACK to constants 1,2,3.


another annoyance is, if we have TOS of SMALLSTACK, then what happens when the stack is empty?

Some alternatives:

I'm leaning towards the first of these options. The second option deprives other instructions of a useful temporary register (and since a lot of instructions use that, this forces the compiler to be aware of when the stack is empty), and the third option forces the implementation (and anyone analyzing the code) to not assume that a stack push transfers TOS onto the second item on the stack, which seems irregular.

This means that the 'current stack size' of SMALLSTACK should be reported as 0 through 15, but 0 really means 1, 1 really means 2, 15 really means 16, etc (if TOS is counted as part of SMALLSTACK).

we should maybe call the 'SMALLSTACK TOS' register 'T' instead of 'TOS' because otherwise it will be confusing when people write about 'TOS of DATASTACK'.


how many process IDs can Linux have? how many open files?

cat /proc/sys/kernel/pid_max 32768

cat /proc/sys/fs/file-max 1617296

(i think maybe you can raise one or both of these, but this is what mine says right now)

ulimit -a

... pending signals (-i) 63332 ... open files (-n) 1024 ... pipe size (512 bytes, -p) 8 ... stack size (kbytes, -s) 8192 ... max user processes (-u) 63332

so i think putting process IDs and channel IDs (which are also open file IDs) in the same 'device ID' namespace is fine. This gives a max of around 64k for all of them on 16-bit systems, and more on systems with wider UINTs.


hmm, in writing in my Boot Design Choices document about why we couldn't combine SMALLSTACK, DATASTACK, CALLSTACK with one of the other registers (the reason is that we can't combine with an register that might hold an address that we might want to LOAD or STORE to), i realized that this reason doesn't hold for ERR, since ERR will rarely be holding an address. And one of the reasons for having a RESERVED register was so that we could have a CAP register later, but now we can do that with devices. And one reason that wasing a register on RESERVED wasn't so bad is that otherwise we'd have an odd number of special registers, but if we can combine SMALLSTACK and ERR, that doesn't hold either.

So let's combine SMALLSTACK with ERR, and eliminate RESERVED, and then have 10 GPRs instead of 8. This means that we lose constant 3, also, so we may want to have special system device 2 be SYSMODE instead of STDERR.

wait a minute, one issue; now naive implementations can't just pass a constant 1,2,3 to the LOAD/STORE/PUSH/POP implementation to indicate that it's supposed to be a built-in stack; instead they must branch within the instruction implementation based on the operand itself, rather than resolving the register first. I guess that's not a huge deal because we do have immediate mode instructions, so it's not as if the interpreter loop can just blindly resolve every operand to every instruction as if it were a register operand and then call the instruction implementations. Still, it's a little annoying. But probably adding one GPR is worth it; 8 GPRs is too little.

This does also prevent us from claiming that our registers are fully regular, because now we can't use ERR from LOAD/STORE/PUSH/POP. Oh well, i think adding a GPR is worth it here. It's not like ERR was a GPR anyways; the GPRs are fully regular/general purpose.

So maybe now we can remove that stuff about memory locations 1,2,3 being verboten. Nice because that was ugly, but too bad because i liked the idea of having 4 sentinel values instead of just 1. It's worth it for one more GPR...

ooo and wait now that we've given up letting naive implementations pass constants to the stack-aware instructions, we can multiplex R0 too. So 5 special registers, and 11 GPRs. This is looking downright reasonable.

i removed the following obsolete verbiage:

" Do not invoke a stack-aware instruction with a stack operand of a register other than R1, R2, R3 which contains the values 0x0001, 0x0002, or 0x0003; this is an invalid operation. Implementations may treat this invalid operation in any of the following ways:


Memory locations 0 through 3

Access/manipulation of memory locations 0 through 3 is an invalid operation. Implementations may treat this invalid operation by:


Why are STDIN, STDOUT, STDERR at devices 1,2,3 instead of 0,1,2 like POSIX?

Since registers R0, R1, R2 give us immediate constants 0,1,2, we can most efficiently access devices 0,1,2. We want SYSMODE to be one of these devices. "

about R0: " We don't combine it with SMALLSTACK because some implementations might implement LOADs and STOREs into SMALLSTACK by passing the LOAD/STORE instruction implementation a 1 for the address (because 1 is the constant that R1 holds when it is not being used to indicate SMALLSTACK to stack-aware instructions) -- if we did the same thing with passing 0, we'd remove the opportunity for these implementations to notice that LOAD/STORE is accessing memory location 0 and raise a null-pointer exception. "

" And we reserve one special register for future use; which isn't a huge waste because otherwise the number of special registers is odd and wouldn't fall on a power-of-two boundary, which on some implementations might make quickly checking whether a register number is special or GPR (something the implementation will be doing a lot in the inner loop) slightly slower.

Why can't we combine SMALLSTACK, DATASTACK, CALLSTACK with other registers, eliminating the not-really-needed constant 1,2,3s and having some more GPRs instead? We explained above why we don't combine these with R0. Since the semantics of RESERVED aren't chosen yet, we should probably leave that one open. This leaves PC, T, and ERR as alternatives. Since LOAD and STORE are stack-aware, we can't combine with a register that might hold an address that we might like to LOAD or STORE. Clearly we might want to LOAD and STORE on an offset from an address in PC, for PC-relative addressing, so that's out. Since T is used a lot we'll probably often have addresses in there that we want to access, so that's out.

This leaves combining ERR and one of SMALLSTACK, DATASTACK, CALLSTACK as the remaining possibility. Maybe we could get rid of RESERVED, too. hmm i kinda like that, todo. "


i think we can dispense with 'walk' -- 'walk' was from plan9 and i think it was an alternative to using a separator like '/' to separate directory components. We don't mind using a separator.

i removed it.


i modified the semantics of SUB-UINT in case of a negative result:

the old text was:

" op0 := op2 - op1

If op2 is strictly greater than op1, op0 will be set to op2 - op1, and ERR will be set to 1. Otherwise ERR will be set to 0. "


I'm having some second thoughts about boxed addressing mode (in OVM). Currently, in OVM, we only have 3 addressing mode bits, so 8 addressing modes (this allows us to have immediate constants up to 4k (or 2k signed) instead of 2k (or 1k signed).

I'm currently thinking that yes, we do want stack addressing in OVM, and also i think some other complex addressing modes would also be useful. We already have 5: immediate, register-direct, register-indirect, stack, constant. And some others that might be useful are indirect-indexed, post-increment. 'Constant' can be thought of as a 'boxed' version of 'immediate', but indexed-indirect and post-increment can't be thought of that way.

indirect-indexed, which is when you split the operand in half, and both halves refer to registers, and the first register holds a pointer, and the second holds an integer, and the effective address is the sum of that pointer and that int, is really useful for arrays; the pointer is the base address of the array and the int is the index.

note that indirect-indexed could also give us PC-relative addressing, by selecting the PC as the pointer.

and in [1] figure 4 the Epiphany guy says that he thinks that the inclusion of postmodify addressing mode was a 'very important optimization'.

on the other hand, since our LOAD and STORE instructions have an extra operand that, in Boot, we are using as an immediate constant offset, in Oot, where it could be an immediate or register direct depending on addressing mode, we sort of already have indirect-indexed functionality. However, this wouldn't give us PC-relative addressing, because for that we probably want the 'JD' (jump dynamic) instruction to be able to use this too. Of course, OVM could just introduce a JD-PCREL instruction if that's the only hinderance.

also, boxed addressing is beginning to seem a little bit of a clunky intrusion of a higher-level concern into our assembly-like language. But not necessarily; part of the point of OVM is that sort of extensibility.

(what other modes do other ISAs have? scaled offsets come to mind)

so, we have three options:


i had a little indecision (now resolved) regarding the 'systemy' complexity of BootX? should be moved into Boot (or maybe BootX? should just be dispensed with and move it all into Boot). For reference, i'm writing down my thoughts there.

Should BootX? should only be about efficiency; having more instructions, all of which can be thought of as macros that could be expanded to Boot commands?

Or could BootX? have functionality for interfacing with the OS that cannot be gotten at all in Boot?

If the latter, then Boot could be thought of as a subset for BootX? that can be used in limited circumstances (such as when only pure computation is needed), but will those circumstances really come up that often?

i guess we could make the dividing line be:

Seen this way, maybe there is a place for Boot. Not every Oot program needs filesystem I/O; some will just use console I/O. In fact, one could imagine that some will be on console-less embedded systems (or, hey, console-less GUI systems) and not even use read/write, instead using implementation-specific devices for all I/O. Similarly, although Oot will support concurrency, if the underlying platform is a uniprocessor without an OS exposing threads, the Oot implementation could still do cooperative multitasking (this might actually be pretty common, recalling that Oot is going to target other HLLs as well as hardware platforms; many HLLs probably won't provide multiprocesses or signals). So, none of filesystem I/O, spawn/fork, exec, and signals are really necessary for the Oot toolchain (read/write might be, because the Oot compiler/interpreter probably wants to use READ to read in the Oot source code). Of course, we could still mandate that these operations are implemented, and on unsupported systems just have stub implementations that always fail ('file not found' and the like), but is this really any better than allowing them to fail with 'Unimplemented instruction', which is what happens if they aren't in Boot at all?

so, my decision is that the status quo stands:


we may want to move MALLOC and DEALLOC to BootX?.

this makes sense because in some contexts (eg embedded systems programming) you dont want to dynamically MALLOC and DEALLOC, you want to allocate once at program start and deallocate once at program end (or, in embedded systems, maybe don't even allocate at all, just hardcode which memory you are supposed to be using). For target platforms like this, it seems annoying to force the Boot implementor to provide a MALLOC implementation (even if just a stub that just always says 'Out of Memory' if called).

instead we could have a notion of a single block of preassigned 'core memory' starting at 0x0001 whose size is given by a SYSINFO call (or maybe starting whereever, whose start and end addresses are given by a SYSINFO call (start address is the one before the start, end is the actual last address; this way, if start = 0x0000 and end is 0xffff, that makes sense, because 0x0000 can't be used but 0xffff can).

otoh, on platforms where the linker or even the programmer preassigns a memory block, the program will just implicitly assume it has access to that block, it won't be calling SYSINFO at runtime.

and also otoh, we said that Boot is the minimum for the Oot toolchain, and it's hard to imagine that the Oot compiler won't be using MALLOC. Although i suppose it could include its own MALLOC implementation.

and also otoh, it's not that much effort to provide the stub that says 'Out of Memory' if called.

maybe instead look into SBRK or similar; see . See and for examples of the sort of things we may want to be able to support. Also . Also . Also .

Also . Copied from the 'data segment' article: "In computing, a data segment (often denoted .data) is a portion of an object file or the corresponding virtual address space of a program that contains initialized static variables, that is, global variables and static local variables. The size of this segment is determined by the size of the values in the program's source code, and does not change at run time. The data segment is read-write, since the values of variables can be altered at run time. This is in contrast to the read-only data segment (rodata segment or .rodata), which contains static constants rather than variables; it also contrasts to the code segment, also known as the text segment, which is read-only on many architectures. Zero-initialized data, both variables and constants, is instead in the BSS segment."

" Historically, to be able to support memory address spaces larger than the native size of the internal address register would allow, early CPUs implemented a system of segmentation whereby they would store a small set of indexes to use as offsets to certain areas. The Intel 8086 family of CPUs provided four segments: the code segment, the data segment, the stack segment and the extra segment. Each segment was placed at a specific location in memory by the software being executed and all instructions that operated on the data within those segments were performed relative to the start of that segment. This allowed a 16-bit address register, which would normally provide 64KiB? (65536 bytes) of memory space, to access a 1MiB? (1048576 bytes) address space.

This segmenting of the memory space into discrete blocks with specific tasks carried over into the programming languages of the day and the concept is still widely in use within modern programming languages. "

for future reference, here is a detailed look at Linux application memory allocation and layout:

what does GCC do? GCC is (surprisingly) written in C++. It apparently uses garbage collection internally. It uses metaprogramming to reflect on data structure declarations in its source code to inform the garbage collection machinery [2]. [3].

probably a good article:

mb a useful search:

(obstack seems to be an alternative to malloc-style management in libc; [4] talks about it) ( says that in the context of gcc, xmalloc is "a thin wrapper around standard malloc ensuring that it has not failed") ( says that GCC-melt has its own garbage collector on top of ggc)

some notes:

" Why is the older brk/sbrk getting replaced by newer mmap, when an allocation is equal or greater than the MMAP_THRESHOLD? Well the brk/sbrk calls only allow increasing the size of the heap contiguously. If you're just using malloc for small things, all of it should be able to be allocated contiguously in the heap, and when it reaches the heap end, the heap can be extended in size without any problems. But for larger allocatiosn, mmap is used, and this heap space does need to be contiguously joined with the brk/sbrk heap space. So it's more flexible. Memory fragmentation is reduced for small objects in this situation. Also the mmap call is more flexible, so that brk/sbrk can be implemented with mmap, whereas mmap cannot be implemented with brk/sbrk. One limitation of brk/sbrk, is that if the top byte in the brk/sbrk heap space is not freed, the heap size cannot be reduced. " -- [5]

/proc/$PID/maps shows memory layout in Linux. [6] gives a lot of example and explanation about this, and similar things. One type of memory segment listed here is "[heap]". The basic situation is (from low memory to high):

" 0 Program Text (.text) Initialised Data (.data) Uninitialised Data (.bss) Heap

    vMemory Mapped Region for Shared Libraries or Anything Else ^ 
User Stack " [7]

"After reviewing TLPI and check man page (with help from author of TLPI), now I understand how malloc() decide to use brk() or mmap(), as following:

mallopt() could set parameters to control behavior of malloc(), and there is a parameter named M_MMAP_THRESHOLD, in general:

    If requested memory is less than it, brk() will be used;
    If requested memory is larger than or equals to it, mmap() will be used;" [8]

"Well as we said before, anything smaller than MMAP_THRESHOLD uses brk syscall. It appears that the brk/sbrk is called with a padded size in order to reduce the number of syscalls made and the number of context switches...But remember our page size getconf PAGESIZE gives us 4096, meaning each page is 4 KiB?. This means when allocating 1000 Bytes in our program, a full page is allocated being 4 KiB?. And 4 KiB? + 128 KiB? is 132 KiB?, which is the size of our heap. This padding is not a padding to a fixed size, but padding that is always added to the amount allocated via brk/sbrk. This means that 128 KiB? by default is always added to how ever much memory you're trying allocate. However this padding only applies to brk/sbrk, and not mmap, remember that the past the MMAP_THRESHOLD, mmap takes over from brk/sbrk. Which means the padding will no longer be applied. However I'm not sure whether the MMAP_THRESHOLD is checked prior to the padding, or after the padding" -- [9]

"As you can see, when malloc switches to using mmap, the regions acquired are not part of the so called [heap] region, which is only provided by the brk/sbrk calls. It's unlabelled! We can also see that this kind of heap is not put in the same area as the brk/sbrk heap, which we understood to start at the low end and grow upwards in address space. Instead this mmapped heap is located in the same area as the shared libraries, putting it at the high end of the reserved user space address range" -- [10]

PDclib on POSIX uses mmap and munmap and on win32 uses VirtualAlloc? for allocation. [11] [12] [13]

"The Linux equivalent of VirtualAlloc?() is mmap(), which provides the same behaviours." [14]

"The brk and sbrk functions are historical curiosities left over from earlier days before the advent of virtual memory management" [15]


note: the brk initial heap size is 0 [17]


ok i havent read all those things above but i briefly skimmed the most important of them.

some analysis:

current decision:


i almost added this text to the Boot spec:


Static memory segments

Boot does not provide facilities for dealing with static memory segments -- only the dynamic instructions MALLOC and DEALLOC are provided. However, this does not mean that static memory segments are forbidden, only that using them is non-portable (or must be specified by a higher-level specification on top of Boot). Implementations may or may not specify:

but then i thought -- if it's not portable and not forbidden, we don't really need to say this, it's implied. Redundantly saying it outright might sort of be encouraging implementations to offer non-portable features, and for programs to use them. Instead, although we don't want to forbid this sort of thing (because (a) OVM is going to specify static memory segment facilities, and (b) some embedded system programming may require the programmer to use special locations in main memory), we don't want to encourage it either.

so, no change. Spent an hour or two thinking about this stuff (to see if we could mb remove MALLOC and DEALLOC) but in the end i like where we are -- MALLOC and DEALLOC but nothing else.

i looked over the current Boot ISA and it seems pretty minimalistic. I don't see anything else that i might want to take away. The only unusual functionality are MALLOC, DEALLOC, READ, WRITE, LOG, and i think we have good justifications for all of those (MALLOC, DEALLOC, READ, WRITE as needed for our toolchain, and LOG b/c its easier to efficiently strip them out at link/runtime this way).


ok i thought of another important problem.

right now, we (intentionally) handicap pointer arithmetic. Although we allow adding a uint to a pointer, or subtracting a uint from a pointer,

1) we don't provide any way to coerce integers into pointers

2) we don't provide any way to specify immediate or constant pointers (like 0x0000)

3) we don't provide any way to subtract two pointers

The justifications for these are as follows:

Another set of motivations derives from thinking about replacing pointers with 'fat pointers' or 'descriptors' that have a lower bound and an upper bound to the memory region/object that the pointer is within (these are also pointers) attached (to allow a 'safe' runtime or a capability-based 'secure' runtime to prevent out-of-bounds errors/accesses). This gives us the requirements:

The latter two restrictions are necessary in order for the former one (no coercion) to have teeth:

In fact, in general, if you have any countable set and you have a successor operation and a zero object such that the set is inductively generated from successor repeatedly applied to the zero object, then you can bijectively map it to/from the integers using the the operations of (a) zero constant, (b) adding 1 to an object (taking the successor), (c) subtracting zero from an object (mapping it to the corresponding uint).

So, that all sounds good; we've successfully identified a useful proper subset of pointer arithmetic that is transitively closed. But, the problem is that one can still imagine situations in which you'd miss the forbidden operations:


note that the 'subtracting pointers that aren't derived from one another produced nonsensical results' sounds like the complicated derivation criteria in the rules of pointer arithmetic validity in C/C++.


if we stick with the restrictions, we may as well get rid of the 'null pointer exception' from accessing 0x0000 since we can't produce 0x0000 anyhow

OK, i did that.


should the type of R0 and R1, when used as constants 0 and 1, be UINT, or should they also refer to pointers to 0x0000 and 0x0001?

according to the above, they should be UINT only.

OK, i did that.


sounds like one possible answer would be to allow subtracting two pointers but still not allowing coercion of ints to pointers (or immediate or constant pointers). The implementation is allowed to have an error, or to give UINT_MAX, or to give any non-zero integer such that, if this integer were to be added to the pointer that was subtracted, the result would either be the pointer that was subtracted from, or an out of bounds error; when two pointers which are not derived from one another are subtracted.

a pointer is 'derived from another pointer' if the derived pointer was produced by a series of additions and subtractions of uints from the other pointer (or if it was provided from primitive library functions that produce derived pointers).

we don't have opcode space for a 3-operand version of this, but that's okay, it should be uncommon -- we can do a 2-operand version that puts the result into T.

OK, i did that.


" How much RAM does FreeRTOS? use?

This depends on your application. Below is a guide based on:

    IAR STR71x ARM7 port.
    Full optimisation.
    Minimum configuration
    Four priorities.

Item Bytes Used Scheduler Itself 236 bytes (can easily be reduced by using smaller data types). For each queue you create, add 76 bytes + queue storage area (see FAQ Why do queues use that much RAM?) For each task you create, add 64 bytes (includes 4 characters for the task name) + the task stack size. "

(note: ARM is 32-bit so 64 bytes = 16 words)


we probably want to start with a small stack and then resize it (using realloc) as it grows. How big should each stack be initially? Since each greenthread will need its own stack, this ties into how much memory each greenthread requires. Three other systems to look at for examples of systems that use little memory per thread are Erlang and FreeRTOS? and Guile.

FreeRTOS? uses about 16 words per 'task' [20], which i think is mostly stack size. Erlang uses about 309 words per process [21], of which 233 words is heap area, which in Erlang includes the stack. [22] in the first figure in the section 'Listing Processes from the Shell' shows that many process' stacks are very small, eg as small as 2 or 3. Presumably these can just grow downwards as far as they can within the process's heap. Guile starts with allocating one memory page; memory pages are typically 4k [23] (slightly more information: ).

However in Oot i expect that our first naive implementations will just separately MALLOC each stack (and REALLOC to grow it when it gets full) (esp. because we have 2 growing stacks per process, not just one, which means that standard codes for dealing with a downward-growing stack at the top of a heap which is allocated from the bottom-up won't quite work -- we'd probably want to have one stack at the top growing downward and one at the bottom growing upward and allocate heap items from the middle out -- but for now we'll just use MALLOC). So how much should we initially MALLOC for each stack? The examples of FreeRTOS? and Erlang suggest that we should allocate between 16 and 256 words. And, since each of our threads needs 2 growing stacks not one (plus a 15-word SMALLSTACK, plus 16 words of registers, plus a little extra for overhead), we should probably shoot for a max of around 128 words, not 256. So, each stack should probably have an initially allocated size between 16 and 128 items.

Let's choose 64 words as the initial stack size. This means that each thread will require about 64*2 + 15 + 16 ~= 160 words plus a little more overhead -- since our words are at least 16 bits, that's a little more than 320 bytes. So a little less than 512 bytes per process. That sounds reasonable. That gives us about two million threads per gigabyte of memory -- decent but not quite up to 'human brain' numbers (100 billion neurons -- for a 16GB computer that's about 3000x the number of processes we'd have -- i figure one process can do the work of many neurons but 3000x? but that's not surprising; to have even one byte for 100 billion neurons would require about 100G (93GiB?) of memory, and of course what you'd really want is one nibble per synapse, which is probably more bytes than an contemporary 8-terabyte hard drive) -- but it's a good start.


Guile points out something interesting: if there is a stack overflow (the stack is full and MALLOC won't give us any more memory to expand it), since there is no stack available, 'unwinders'/deferred functions/destructors/context-managers can't be executed. So we probably want to have a 'soft' stack overflow when the stack is 80% full, or something like that.


i read a great, relevant paper on L4 and took a bunch of notes in notes-computer-systems-greenfield.

should probably review those notes, and also look at the seL4 manual (see below), and also look at FreeRTOS?.

sel4 stuff toread:

section 2.1 of

section 2.1 of

seL4 manual:

freertos tutorials and manual toread:

perhaps also look at other prominent L4 kernels: L4Ka::Pistachio, Fiasco.OC


more movement on RISC-V memory model:

" In order to find a compromise, defined RVTSO (strong) and RVWMO (weak).

    Both are multi-copy atomic. This means cores are allowed to peek at stores they have issues as long as they haven’t been observed by anyone else and is much simpler to reason about than the Power and ARMv7 memory models.

The base RISC-V memory model is RVWMO (software must assume this if it wants to be portable). It’s important to note that a hardware implementation meeting the RVWMO specification could be more conservative (stronger). Additional have the Ztso extension for RVTSO, which software might target.

RVWMO and RVTSO differ in the degree of memory access reordering they permit at the point of global visibility. In RVTSO, only store-to-load reordering can be observed. In RVWMO, most memory accesses can be reordered freely unless synchronized via .sq, .rl and/or fences.

Dan has some handy diagrams that explain the RVWMO and RVTSO rules in a nutshell, which you should be able to study once the slides become available.

There have been a number of other ISA changes. ld.rl and are deprecated. ld.aqrl and sd.aqrl mean RCsc. Also clarified other subtleties. May have future extensions to the fence instruction, and .aq/.rl variants for byte and halfware-size loads/stores. "


" I'm a coauthor of a new cryptographic library, NaCl?, that systematically avoids leaking secret data into load addresses, store addresses, and branch conditions. But this library is relying critically on observations of CPU behavior. We would much rather rely on promises made by the CPU manufacturer.

This brings me to my first suggestion: Please specify that various instructions keep various inputs secret, avoiding all data flow from those inputs to instruction timing, cache addresses, the branch unit, etc. Right now this isn't part of the specified architecture; it's a microarchitectural decision that could change in the next CPU.

For example, CMOV seems to keep its condition bit secret on various Intel CPUs today. Are you willing to commit to this for CMOV from a register? What about CMOV from memory? Or do you want to keep open the option of having CMOV from memory look at its condition bit and skip the load? What about multiplications: are you willing to commit to multiplication taking constant time, or do you want the option of PowerPC?-style early aborts for multiplications by integers with many top 0 bits? Surely you're at least willing to promise secrecy for vector operations within registers, and for vector loads and stores; we can build safe cryptographic software from those operations even if you aren't willing to commit to anything else. " [24]


" the re- mote store programming (RSP) model ... processes have private address spaces by default, but they can give other processes write access to their local memory. Once a producer has write access to a consumer’s memory, it communicates directly with the consumer using the standard store instruction to target remote memory, hence the name “remote store program- ming.” Communication in the RSP model is one-sided and easy to schedule and consumer processes are guaranteed to read physically close, or local, memory. This locality of re f- erence is important for performance on non-uniform mem- ory access (NUMA) architectures. The RSP model is similar to both the partitioned global address space (PGAS) [6, 8, 28, 16] and virtual memory mapped communication (VMMC) [11] models... the RSP model is distin- guished in the following two ways. First, RSP is defined by including only mechanisms that require incremental hard- ware support in multicores. For example, the RSP model does not include features that require hardware support for cache coherence or direct memory access (DMA), which is commonly used in PGAS and VMMC implementations 1 . Second, RSP programs are characterized by extremely fine grain communication that streams from a source processor’s registers to a destination processor’s cache. In compariso n, multichip programs require data to be buffered in memory on the producing core and then transferred to the consuming core in bulk. "

used in Celerity [25]


some of the things that the RISC-V guys wanted but didnt find in other ISAs, according to Table 2.1 at the end of Chapter 2 of Design of the RISC-V Instruction Set Architecture are:

in the rest of that chapter, they have other complaints about various extant ISAs:




SPARC: good:


Alpha: good:


ARMv7: todo


" We all respect elegant lasting structures (fork / exec / stdin / stdout is one such structure). a creative field with huge output will generate more elegant lasting structures. We just can't tell which ones will survive from where we sit in the middle of it all. "


flatline3 1933 days ago [-]

That's a false dichotomy -- a straw-man -- that's you're using to discredit the idea of planning ahead.

Compare FreeBSD? kernel design versus Linux.

kqueue vs. dnotify/inotify/???

Mach-descended VM vs. a string of linux-vms

BSD scheduler, ULE scheduler vs. how many different schedulers?

Linux churns through ill-conceived solutions to problems until they find one acceptable enough. FreeBSD? grinds on one until it definitively works.

FreeBSD? almost invariably winds up with the better solution, in less time. See kqueue, for example -- the foundation upon which Apple's GCD is built.

jballanc 1933 days ago [-]

It's very interesting that you bring up kqueue. I was about to bring up kqueue, but for a different reason.

Planning ahead can lead to great things. The Notre Dame, and the dozens of other cathedrals throughout Europe are positively stunning...

I love and I hate kqueue. I mean, I love kqueue. I love the way it can be used from C, I love the way it's integrated in MacRuby?...I spent the weekend studying Clojure's reducers and was dying to have some time to work on ClojureC? just so I could implement reducers with kqueue...

I hate kqueue because, in all likelihood, I'll never get to use it in a production system, because it's not in Linux.


flatline3 1933 days ago [-]

An example:

libkse was an attempt to implement M:N thread scheduling. This is something that can, in theory, provide significant benefits over 1:1 thread scheduling, but in practice, is extremely complicated to implement and has fallen out of favor across the board: Solaris abandoned the approach in Solaris 9, and FreeBSD? abandoned it in (IIRC) FreeBSD? 7.

Concurrent to libkse, libthr was developed by David Xu, and was also included in the FreeBSD? base system. It implemented 1:1 threading, is far simpler than libkse, and has replaced libkse as the default threading library.

I would argue that in this case, FreeBSD?'s cathedral model failed; M:N threading ideally would have never been attempted, and it was wasteful to attempt to implement two distinct threading libraries. However, considerable thought and expertise went into the work, and the work (and decisions around it) were not made flippantly or taken lightly.

It was simple a case where despite best intentions and best effort, the wrong choice was made. At the same time, in what some might call "bazaar-like", David Xu maintained libthr as an alternative. It remained there, ready for adoption as the default threading library, until such time as it became apparent that M:N via libkse was a dead-end.

This was a mistake, but no entity is perfect, and the success rate of FreeBSD?'s considered decision-making remains statistically high. In this case where something never worked, it was replaced with something equally well-considered and far more successful.

Moreover, compared to Linux's gross missteps with their threading implementations (including such gems as setuid() being thread-local as an implementation side-effect), libkse was a minor fender-bende


rektide 1933 days ago [-]

Lambasting libtool for providing a consistent experience across star-NIX is, imo, not the wisest move for a FreeBSDer?.

Article: This is a horribly bad idea, already much criticized back in the 1980s when it appeared, as it allows source code to pretend to be portable behind the veneer of the configure script, rather than actually having the quality of portability to begin with. It is a travesty that the configure idea survived.

Good high-minded notions here. But configure, with it's standardized parameters for how to do stuff, is near irreplaceable at this point. Certainly a more stripped down version, one not built on M4, would be wise, but libtool/autoconf itself is used too broadly & with trepid familiarity by developers & upstream maintainers: in spite of so much of it being indeed old deprecated no longer useful cruft, the best we can hope for is duplicating a wide amount of the existing functionality in a cleaner manner.

But at what cost would reimplementation come? How many weird build targets would for months or years go unnoticedly broken?

The place where we escape these painful histories is where we leaving the old systems programming languages behind. Node's npm I'd call out as a shining beacon of sanity, enabled by the best most popular code distribution format ever conceived: source distribution, coupled with a well defined not-completely-totally-batshit algorithm for looking for said sources when a program runs & at runtimes goes off to find it's dependencies:

phkamp 1933 days ago [-]

Why should I as a FreeBSD? person not be allowed to lambast the worst hack-on-hack-on-hack-on-hack I have to suffer ?

And trust me, libtool is replaceable, all it takes an agreement about a compiler and loader flag for producing shared libraries and you suddenly don't need it at all.


phkamp, sorry for missing the reply. M4 is ugly, libtool takes forever, everything in it is a hack. It's proven to be an at least adequately flexible hack that has kept *nix unified, more or less. I respect that, but I'm not a systems programmer that gets burned by it on a regular basis either, and I don't mind that my OpenWRT? compiles of all packages take 32 hours and it's all libtool's fault. ...

justincormack 1933 days ago [-]

I think FreeBSD?, and the Linux distributions do try to cater to too many different people, and quality and coherence suffers a lot from this. I think we can get past this though. The culture of testing and good code is on the ascendant again in many quarters. You need more people to understand build, packaging and distribution better, sure. You also need autotools to die, as the use cases for it are mainly dead. You can generally write portable code to the systems that matter if you want to now, and it just works.

A lot of the problems are due to poor integration between languages, so for example the JVM people have reimplemented almost everything, as have the C++ people.

phkamp 1933 days ago [-]

nodding violently.

tedunangst 1933 days ago

parent favorite on: A Generation Lost in the Bazaar

When OpenBSD? replaced GNU libtool with a home grown perl version, it was so much faster I believe it literally cut days off machine time off a full ports build. For smaller packages, with tiny C files, running takes longer than running the compiler does. The majority of build time for some of those packages is still running configure, testing for things like <stdio.h>, which the package provides no workaround when missing. The OpenBSD? project alone has spent years of machine time to running configure and libtool.

As for doing its job well, the failure mode of configure "you can't build this" is abysmal. Just give me a fucking Makefile, I'll fix it myself. I love packages that come with Makefiles that don't work. I pop "-I/usr/local/include" into CFLAGS, run make again, and boom. Done. Trying to do the same with configure? Forget about it. --with-include-dir or whatever doesn't work, because it's really running some bogus test in the background which expects /bin/bash to exist and so on and so forth.

phkamp 1933 days ago [-]

I'm nodding so hard my neck hurts.


" A logical left shift by n is equivalent to multiplication by 2 n . A logical right shift by n is equivalent to unsigned division by 2 n , rounding towards zero, whereas an arithmetic right shift by n is equivalent to signed division by 2 n , rounding towards −∞ .

4 There is no SUBI instruction, because ADDI with a negative immediate is almost equivalent. The one exception arises from the asymmetry of the two’s complement representation: SUBI with an immediate of − 2 11 would add 2 11 to a register, which ADDI is incapable of. "


" The other is AUIPC, short for add upper immediate to pc , which adds a 20-bit upper immediate to the pc and writes the result to register rd . AUIPC forms the basis for RISC- V’s pc -relative addressing scheme: it is essential for reasonable code size and performance in position-independent code. To demonstrate this point, Figure 3.3 shows code sequences to load a variable that resides 0x1234 bytes away from the start of the code block, with and without AUIPC

WITH AUIPC auipc x4, 0x1 lw x4, 0x234(x4)

WITHOUT AUIPC jal x4, 0x4 lui x5, 0x1 add x4, x4, x5 lw x4, 0x230(x4)

... 5 Position-independent code without AUIPC could be implemented with a different ABI, rather than using the jal instruction to read the PC. The MIPS PIC ABI, for example, guarantees that r25 always contains the address of the current function’s entry point. But the effect is to move the extra instructions to the call site, since r25 needs to be loaded with the callee’s address. " [26]


removed from BootX?:

12. addi-uint: ioi k (add immediate constant to uint, in-place) 13. addi-ptr: iop k (add immediate constant to ptr, in-place)

because we're adding addi-uint to the 3-operand instructions (replacing LOADI and LOADLO, in conjunction with SLL), and addi-ptr isn't too useful if it can only add up to 16.

removed from Boot for the same reason:

3. loadi: oi k k (load immediate 8-bit uint) 4. loadlo: oi k k (load immediate low 8-bit) -- TODO maybe just have LOADI and let streams of contiguous LOADIs represent loading larger values? Or even make programs compute 256 by 255 + 1 and then multiply by it manually when they want to load constants above 255?

and add addi-uint: io k k ().

no, wait a minute; in RISC-V, ADDI could substitute for LOADI because it had 3 different operands -- but if our addi only has io k k, then we can't say 'reg = r0 + constant', we can only say 'reg = reg + constant'. So now we don't have any instruction to make even 8-bit constants in a single instruction (we'd have to do 'cpy reg r0; addi reg k k'). And we don't want to make addi have 3 distinct operands instead of 2, because then k is only 4 bits (range of 16) instead of 8 (range of 256).

so let's put LOADI back. but i guess LOADLO is kinda funky, we can substitute that with chains of LOADI interspersed with SLL, as planned


here's 8 addressing modes for OVM:

immediate direct indirect stack constant indirect + immediate ('displacement', i think) (6 + 6 bits, that is, 64 registers and a 6-bit immediate) indirect + index (6 + 6 bits, that is, 64 registers and 64 registers) postincrement


i'm considering just merging Boot and BootX?.

The main reason is that i realized that, if the BootX? computational instructions are macros, then they'll need some space for temporaries. We could use DATASTACK but that's slow. An alternative is to reserve a register or some space in SMALLSTACK but both of those reduce the amount of registers or SMALLSTACK space available to ordinary BootX? programs.

In addition, the extra complexity of having a BootX? may not be worth it. It's too many layers.

We could always have optional computational instructions, even within Boot -- then we could say that SMALLSTACK has to be bigger only if the optional instructions aren't implemented. But that also seems like needless complexity.


i took a quick look at Rigetti's QUIL quantum assembly language [27].

Seem to me that it's like traditional assembly, with a program counter, branch instructions, and (bit) registers. The differences is just that you have some 'quantum registers' too, each of which holds one qubit. To act on the quantum registers you can only do certain operations, eg 'apply Hadamard gate to qubit 3'. The classical and quantum bit registers can communicate, eg "measure qubit 3 and place the result into classical bit 4".

You could also think of the quantum registers as I/O devices.


i'm thinking of taking out read, write, malloc, demalloc from Boot, and moving them to BootX?. This has a few benefits:


i also moved LDX and STX (indirect index loads and stores) to bootX.


the first commercial microprocessor, the 4-bit Intel 4004, from 1971, had:

so, our design doesn't sound crazy:

compared to the Intel 4004, we don't do worse on any dimension, so we keep:

and we do some minimal upgrades:

and some not-so-minimal upgrades:


do we want to have/allow a 'frame pointer'/'base pointer' to the top of the DATASTACK frame? This would seem difficult unless we require DATASTACK to be located in main memory. Currently we say that it's not necessary that the stacks be directly accessible in memory. And what about a pointer to the caller's frame?

i kind of feel like we shouldn't necessarily bake this in; eg in a Forth-like dual-stack situation you


i'm kind of thinking that SMALLSTACK would be much more useful if you also had a second-to-top-of-stack alias register

but... that might make analysis more complicated, because PUSHing to the stack affects this register too. Well, i guess we already have that problem even if we just have TOS!


also, i think that in OVM, we should consider having 4 bits for addressing locations WITHIN the stack for stack addressing mode. This would mean that stack addressing mode could only access the first 16 registers. So we are not fully orthogonal, but we aren't if we have a displacement addressing mode either

which reminds me, so displacement addressing mode in OVM uses 4 bits for displacement and 4 to select the register. So again, the first 16 regs are special.

so again, just to reiterate, i guess our 8 addr modes in OVM are shaping up to be:

immediate register indirect stack constant displacement indirect indexed ?

where ? is probably postincrement.


since the CALLSTACK and DATASTACK registers can't be written to anyways, maybe writing to them should PUSH onto the stack? This would allow us to do half of stack-addressing in Boot without violating the expecation that reading from a register doesn't change anything (you could do SMALLSTACK too, but then TOS would also change at the same time, which might be weird)

--- has an interesting proposal to eliminate indirect branches in the EVM. I recall that the JVM also started out with an indirect branch instruction that was later eliminated.


" void *dlopen(const char *filename, int flags) opens a binary (usually a shared object... usually), and returns a handle. Giving dlopen NULL as the file name gives us a handle to the main program.

void *dlsym(void *handle, const char *symbol) returns a pointer to a symbol (for example a function) in the binary which handle refers to. "


i still like that idea i had about a hexagonal array of 3 types of nodes (cpu, memory, network/fabricServices), in the "p3m1, (*333)" tiling:

note that although the connectivity is hexagonal, this doesn't meant that the circuits have to be hexagonal; you could have square chips which go together in this pattern:

aa bb cc aa bb cc cc aa bb cc aa bb aa bb cc aa bb cc cc aa bb cc aa bb

(where aa aa is one chip)

provided that each chip which is touching in the above diagram is connected (eg the chip of type 'a' in the middle of the second row (third and fourth ASCII-art row) is touching a 'b' and a 'c' above it, and a 'c' to its left, a 'b' to its right, and a 'b' and a 'c' below it, for 6 neighbors in total)

each node is next to 3 others of each other type (and none of its own type). Each pair of adjacent nodes shares two neighbors of the 3rd type. Each CPU node is 2 hops away from 6 other CPU nodes, with two paths between each such pair of 'quasiadjacent' nodes, one with the intermediate node in the path being each of the other two types.

This is a 2-D design, but one could imagine that layers of this could be stacked, and network nodes could connect to the network nodes above and below them.

which means that for each CPU node:

one could give each CPU node an address space of 64k words, partitioned in some way between internal memory and the memory of its 3 memory node neighbors.

for example, each memory node could have 16k words, and the CPU node could itself have 4k words of RAM and 8k words of ROM and 4k of special memory-mapped stuff (in addition to the CPU node's registers). Or, we could have more ROM but only 8k of the ROM is memory mapped (semi-harvard architecture).

the memory nodes could serve as raw memory, or they could also do stuff like memory allocation and garbage collection or simple vector/matrix operations over memory.

when i say the network nodes also do 'fabric services', i mean stuff like:

since the memory in the memory nodes is shared, either CPU nodes must send 'malloc's to the memory nodes, or we must statically preallocate the memory to the CPU nodes. If we do the latter, here's how it might turn out:

if each memory node has 16k words of memory and serves 3 CPU nodes and 3 network nodes:

alternate memory node memory layout:

with that same memory layout, but 1k message length, we have:

to review:

i like it. For an ordinary computer, 4k free RAM per CPU is way too small, but for an extremely massively parallel machine it's way too large (contrast to eg the 1-bit or so memory envisoned for the orginal Connection Machine (how much did it end up with?)). For a neuron it's probably very large; for a cortical column it may be a little small (but maybe not very; after all, aren't cortical columns sorta like pixels in V1?). 8k is enough to implement a BASIC interpreter. So we're probably in the right ballpark. Since this is a massively parallel machine with high bandwidth and low latency relative to its computing power, it makes sense that half of the computational nodes (the network nodes) and most of the memory (depending on how you count; after all the CPU nodes probably operate on incoming and outgoing messages in-place so these buffers serve a dual purpose) is dedicated to moving data around.

we should have each CPU node and network node be its own clock domain, so that they can each 'go to sleep' and not consume power when they are just waiting for incoming messages. Can we do this if we implement it as an FPGA? Maybe:

so yeah they can deal with multiple clocks. Not sure if you can get the power savings from having asynchrony with FPGAs or not (and FPGAs are power-inefficient anyways i think) but at least this means you could prototype the asynchrony and then be all ready to go to an IC and get the power savings there.

so: yeah, i like it.

one concern: we may want to reduce message size 4x to 256 words (512 bytes) in order to fit into the 'safe size' for UDP messages (the minimum MTU size that an host can set is 576, and IP header max size can be 60 bytes, and then computes 576 MTU - 60 IP - 8 UDP = 508; 512 is pretty close). Or at least to 512 words (1k bytes) to fit in the next-most-safe category (unfragmented UDP packets assuming a 1500 MTU; i think a maxed size ethernet packet payload is around 1500 bytes?). This is only a 2x reduction; if we reinterpret the above numbers as bytes instead of 16-bit words (eg each node can address 64k bytes; message sizes are 1k bytes; nodes have 8k bytes internal RAM), then we're already there. But i don't think it would be crazy to instead adjust message sizes to 512 words, which would end up (i think) reducing the total size of addressable message buffers from 52k to 26k, freeing up 26k words (=56k bytes!) for more local addressable ROM, special memory, or RAM.

another idea: an alternate topology, and one that generalizes to 3-D, would be to use a square tiling instead of hexagonal. This presents the problem that neighbors in a square tiling don't share neighbors (so a CPU node and its neighboring network node wouldn't share a memory node), but this can be solved by saying that diagonals are also connected (so two squares are connected if they share any vertex; so each square has 8 neighbors, not 6 (i think the most obvious generalization of this to 3-D means that in 3-D, each cube would have 26 neighbors, instead of 6!)). These numbers are not divisible by 3, so if there are 3 types of nodes, choices must be made. One way you could do this is: half of the nodes are memory nodes, offset by 1 each row so that they touch at diagonals; out of the empty spaces, half are CPU nodes and half are network, alternating as you go both by row and by column. This means that you get diagonals of nodes of the same type:


so this is not quite as useful as the hexagonal arrangement (because of the waste of connecting nodes of the same type at diagonals), so i like it less. One change could be to say that only the 4 edge-sharing neigbors are connected, instead of 8 vertex-sharing ones; now each network and each CPU node is surrounded only by memory nodes; but this is ok because the only difference is that CPU nodes can no longer directly notify network nodes when a packet is ready, they must only use the memory nodes as intermediaries -- so perhaps the network node can tell the memory node to 'wake it up' when certain memory locations are modified (which is useful general functionality anyways).

Or, one could decrease the number of CPU nodes to provide for each CPU node to be surrounded by network nodes, and connect at vertices again; this allows the network nodes to form vertical-diagonal 'columns' (but not vertical-diagonal rows at the same time):


Note that, using only network-network connections plus network-memory-network connections, all network nodes can be connected. Or one could alternate between network vertical-diagonal columns and actual network rows:


In this figure, we see 7*7 = 49 squares, with 25 memory nodes, 8 CPU nodes, and 16 network nodes (~50%, 17%, 33%). Here, unlike in the hexagonal proposal, no horizontal CPU nodes share memory nodes with each other (but vertical ones do); communication between CPUs is always mediated by network nodes. I guess this pattern still has vertical/horizontal asymmetry then; so we can probably do better (?), maybe by adding some CPU rows to match our CPU columns, replacing every other network row:


But let's replace the nM rows with nn rows to allow the network nodes to connect without n-M connections:


and maybe reduce the prevelence of CMnMnMC? rows:


still strange that there are a lot of MnMCMnM? rows. Also, we've disconnected the 'n's again with the CMC row. Reconnect by eliminating one C:


now we have: 28 Ms, 21 ns, 15 Cs (44%, 33%, 24%). But this geometry is getting pretty complicated. Otoh in the hexagonal geometry network nodes were never directly connected.

Perhaps if we want network nodes to be connected it makes more sense to just have 50% network nodes instead of 50% memory nodes:

CnMnC? nMnCn MnCnM?

This geometry still has 'wasted neighbors' of Ms next to Ms and Cs next to Cs, and it has asymmetry (there are diagonal stripes of Ms and Cs going from up-right to down-left) but it's much simpler. One change could be to break of the M diagonal stripes by adding more Cs:

CnMnC? nCnCn MnCnM? nCnCn

and we could then eliminate that C without any M neighbors by replacing it with an M:

CnMn? nCnC MnMn? nCnC

much better, i think. In this 4x4 patch we have 8 ns, 5 Cs, 4 Ms (50%, 30%, 25%). The downside here is that we'd probably prefer more Ms than Cs. But the upside is that we don't waste any adjacencies on 'M-M's; each M is surrounded by 4 ns and 4 Cs. The above is the basic repeating unit, but to more easily see the pattern, double it:

CnMnCnMn? nCnCnCnC MnMnMnMn? nCnCnCnC CnMnCnMn? nCnCnCnC MnMnMnMn? nCnCnCnC

As you can see, there are little 'stars' of 5 Cs, with 1 center C and 1 C at each of its diagonals (in the repeating unit figure, the upper-left square was a center C). We can break these, without adding M-M adjacencies, by replacing the center of the star with M, which also gains us a smaller repeating unit:

Mn nC

so now we have 50% n, 25% M, and 25% C (and no C-C connections). To more easily see what's happening, repeat it:

MnMn? nCnC MnMn? nCnC

well i feel pretty dumb for going thru all of that just to get here. It's good though.

With this topology we can just say that each memory node has 64k of memory and this is divided into 8 parts for its 8 neighbors, so 8k for each of its neighbors. A C node has 4 M neighbors, so it 'owns' 32k of memory there. If the memory nodes had double the memory then the CPU nodes would have their entire address space taken up by that, which isn't what we want (we want them to have some local memory too -- i think?), so this is probably good. We can also give each CPU node direct access to a small amount of memory in each neighboring n node; let's say 4k, which means that each n node must host 4k for 6 of its 8 neighbors (the C and n neighbors) for a total of 24k. So the C nodes now have 16k left in their 64k address space, which we can call their local memory. Or, we could say that each CPU node can DMA 1k in net nodes, so each net node must host 6k; and we can say that each CPU also hosts 1k per net neighbor in itself, for a total of 4k, leaving 8k local memory for CPU nodes. Each net node now hosts 6k for neighboring n and C nodes, owns 6k at neighboring n and C nodes, own 16k at neighboring M nodes, leaving 36k of 64k addressing space free; allocate some of that (16k?) for two buses, one for each diagonal line. So:

memory node: 64k memory, hosting 8k for each neighbor cpu node: 8k local memory, owns 32k at neighboring M nodes (at least 4k of this is probably reserved for communicating with the C nodes on the other sides of those M nodes), and 4k at neighboring n nodes, and hosts 4k for neighboring n nodes net node: 8k local memory, hosts 6k for neighboring n and C nodes, owns 6k at neighboring n and C nodes, own 16k at neighboring M nodes, 16k for two diagonal n-line buses (8k for each bus), (12k addressing space free; one imagines that the diagonal line buses can only get so large, after which point we have to break the repeating tessellation into larger squares; perhaps 8k of the remaining can be used for 'global'-intra-square or for inter-square communication, and 4k for booting or debugging or 'global' or something like that; or perhaps that 8k could instead be used for two more busses, so that we have a vertical and a horizontal and two diagonal busses intersecting each network node)

so this seems good too, and maybe easier to think about than the hexagonal geometry (and more easily generalizable to 3-D than hexagonal), and also more things are powers of 2.

so this suggests that a message size of about 1k is appropriate. It actually has to be slightly less because out of the 1k DMA zones, we need a few bits for locking/synchronization (so that the recipient knows when the sender is done writing a new packet into memory, and so that the sender knows when the recipient is done reading it). If we had an internet MTU of 1024 (as suggested by various sources based on RFC 4821), then we'd need to subtract 60 bytes for a max size IP header plus 8 bytes for a UDP header, so we'd have room for 956 bytes of payload. So let's say that our max message size should be <= 956room for 956 bytes of payload. So let's say that our max message size should be <= 956 bytes.

if each node had its own address, then with 16-bit addresses we could have up to 64k nodes (32k net nodes + 16k CPUs and 16k memory nodes); alternately we could say that whether we are talking about net nodes or CPUs or memory nodes is clearly by context, in which case we could have 64k net nodes and 32k CPUs and 32k memory nodes. But let's stick with 64k total nodes to be safe. Now, each memory node has 64k, so that's 16k*64k = 64k^2/4 = 232/4 = about 1 GiB? total memory in the memory nodes. So, we'd probably group these things into chips of 16k CPUs each with 1 GiB? globally addressable (sorta; at least in debugging) memory per chip; however only half of this memory is managed by the fabric (the net nodes), so half a gigabyte (so 215 bytes per CPU = 32k per CPU; note that this is different from the 32k that each CPU node 'owns' at each memory node; here we are talking about memory in the memory nodes 'owned' by the network nodes; another way to see this is that there are the same number of memory nodes as CPU nodes, and each memory node has 4 CPU neighbors and 4 network neighbors, and its memory is owned equally by its neighbors, so out of its 64k of memory, 32k of that must be owned by network nodes). So each chip has 16k CPUs and half a gig of (fabric-managed) RAM.

So, to have the same amount of RAM as my current computer (16GiB?), i'd have 32 chips, which would have 512k CPUs.

WIth 64k nodes per chip, you'd arrange this as 256 rows by 256. So there would be 256 diagonals, each with 256 net nodes in them. So there are 256 things contending for each bus.

We could consider breaking each diagonal up in a fractal/power-law-y way; there are a bunch of small local busses, interrupted by a few 'highway' nodes which participate in global busses (are the highway nodes also on the local bus?). The 'highways' are interrupted by 'superhighways', etc. Maybe each network node could just have one bus connection per diagonal, so either 'local' or 'highway'. If there are 8k network nodes per diagonal, that 2^13. We could have 12 levels of highway, with power-of-two distances. To make routing easier, we should have the canonical address of each node be based on its 'diagonal coordinates' relative to the network nodes, rather than the horizontal-vertical x-y coordinates. To make routing easier, which highway level a net node is on should be determined by how many trailing (least-significant-bits) zero there are in the component of its address corresponding to the diagonal direction in question; eg 1101100 is more 'local' and 1100000 is more 'highway' (there are many 'highways' at each level, of one length per level, so as to cover the whole diagonal). ... but i dont think we need that complexity, at least not initially. 256 things contending for a bus is probably workable, even if not efficient.

Note that 64k bits is 8k bytes. Which is the amount that each CPU or network node owns in a single memory node. So it would be (just barely) convenient to work with a bitmask with one bit for each node in a chip (this is also the amount in local memory but i'm betting most algorithms can't stand to have every bit of local memory used for just one data structure, even if there are also registers).

also i should note that, at least in early development, at least the network and CPU nodes will probably have similar hardware, with different software. The network nodes are just running the fabric OS, whereas the CPU nodes run the user program; this will allow us to modify the fabric OS easily during development. The busses are different hardware for the network nodes but maybe in early development we wont have them.

anyway the key takeaways are:

...wait that's wrong. The point of the memory nodes is that all of the memory in them is supposed to be addressable by each neighbor, even the memory 'owned' by others. So they can't be 64k, because then other nodes that neighbor more than one memory node can't address it all.

okay, let's try again. If each memory node has 16k, and each CPU node has 4 neighbors, then the entire addressable 64k is taken up by memory nodes; there is no room for local memory. However if each memory node only has 8k, that's not much; and in that case each address space would then have 32k free, which would be divided up the same as above. In that case, the amount of memory in a memory node is the same as the local memory in a CPU or network node. Hmm, that might make early versions easier because you could just use the same node type in all of them and implement the 'memory node' behavior in software, at least initially. So:

memory node: 8k memory, hosting 1k for each neighbor cpu node: 8k local memory, can access 32k at neighboring memory nodes (out of which, it 'owns' 4k) (this may be completely used up as a buffer for incoming messages from the C nodes on the other sides of those M nodes), and 4k at neighboring n nodes, and hosts 4k for neighboring n nodes net node: 8k local memory, hosts 6k for neighboring n and C nodes, owns 6k at neighboring n and C nodes, own 16k at neighboring M nodes, 16k for two diagonal n-line buses (8k for each bus), (12k addressing space free; one imagines that the diagonal line buses can only get so large, after which point we have to break the repeating tessellation into larger squares; perhaps 8k of the remaining can be used for 'global'-intra-square or for inter-square communication, and 4k for booting or debugging or 'global' or something like that; or perhaps that 8k could instead be used for two more busses, so that we have a vertical and a horizontal and two diagonal busses intersecting each network node)

wait that's still not right; 'cpu node' memory only adds to 8+32+8 = 48, and 'net node' adds to 8 + 6 + 6 + 32 + 16 + 12 = 80. this mistake was before i corrected memory nodes from 64k to 8k.

okay let's try again:

memory node: 8k memory, hosting 1k for each neighbor cpu node: 24k local memory, can access 32k at neighboring memory nodes (out of which, it 'owns' 4k) (this may be completely used up as a buffer for incoming messages from the C nodes on the other sides of those M nodes), and 4k at neighboring n nodes, and hosts 4k for neighboring n nodes. 24+32+4+4 = 64. net node: 24k local memory, addresses 16k of memory (8k at each of 2 neighboring memory nodes), hosts 6k for neighboring n and C nodes, owns 6k at neighboring n and C nodes, own 16k at neighboring M nodes, 12k for four diagonal n-line buses (3k for each bus). 8+6+6+12+16.

ok this is dumb the memory nodes have less memory than the cpu nodes or the net nodes.

let's just put all of the memory in the memory nodes and not have any local memory in the cpu and net nodes at all. Then the cpu and net nodes don't need any DMA circuitry because that's all in the memory nodes. To recall:

Mn nC

so now we have 50% n, 25% M, and 25% C (and no C-C connections). To more easily see what's happening, repeat it:

MnMn? nCnC MnMn? nCnC

In this configuration we can have congestion because each network node has 6 neighbors that might want to send packets to it, but it only has 4k buffers for these incoming packets. This suggests that either we reduce the packet size, or we use the 32k free addressable memory space at the network nodes for DMA. We could also use the 32k at the network nodes for busses. Let's do both:

The nice thing about this is that for early development, we can leave out the busses and the network node DMA connections. But it bugs me that the network node has local memory and the cpu node does not.

so takeaways:

y'know, we have too many CPUs on a chip for the amount of memory we have, and this idea of having special DMA channels between adjacent network nodes makes things more complicated because the network nodes have special hardware, and also those DMA channels probably aren't even that useful because a lot of the routing will be done via the busses. So let's try having 50% memory nodes instead of 50% network nodes. Now we have:




note that this way, each node only needs 4-way connectivity (up,down,left,right) instead of 8!


so takeaways:

Yeah, let's take some 'non-local working memory' away from the nodes, and have more fabric memory instead:


ok that sounds pretty good.

one remaining reservation is that, since the 64k address space limit is the major bottleneck, it seems like a waste to expose the 4 1k buffers for the other CPU nodes neighboring neighboring memory nodes to every CPU node (that is, a given CPU node has access to 15k from each of 4 neighboring memory nodes; but 4k of this 15k is spent on buffers owned by the 4 other CPU nodes neighboring the same memory nodes). The advantage of this is that it allows the CPU nodes to directly communicate with neighboring CPU nodes, without involving the net nodes; but if we eliminated this we could increase each CPU's local RAM by 4k, to a total of 8k instead of 4k. But i like to leave the possibility of direct CPU-CPU communication. Each CPU can also see some of its neighboring CPUs' 'working memory', allowing for the possibility of collaboration.

the convention for the buffers could be that the buffers are for INCOMING packets for the nodes that 'own' them.


are there other clues to how small message sizes should be from IPC in various systems?

not surprisingly socket packet sizes are larger than ethernet MTUs [28], so that doesn't change it.

interestingly, some systems have mailbox queue sizes of 4k [29] and 'Maximum number of shared memory IDs' of 4k [30].

the seL4 microkernel uses a structure called seL4_IPCBuffer for IPC [31] [32]. A comment near the end of [33] says that the IPC buffer is 1024 bytes on x86_64.

the Mach kernel, used in Apple systems, has an IPC system ('mach ports'). A send routine is CFMessagePortSendRequest? [34]. The file [35] has __CFMessagePortMaxDataSize? = 0x60000000L, which is about 1.5 GiB?.

libcsp, designed for cubesats, speaks of MTUs in the range of 100 to 256 bytes [36]

linkux message queues have a default size (in bytes) of 8k, and messages have a default max size of 8k (and the minimum you can set the max value to is 128). The max value that you can set the max size limit to is 16M but it used to be 1M [37].

linux pipe buffer is 64k, with a max of 1M, but 4k is the max amount that can be transferred atomically [38] [39]. On MacOS? or FreeBSD?, the pipe buffer starts at 16k with a max of 64k [40].

android bundles seem to have limits around 80k or 500k or 1M. android binder transactions have limits of 1M (the size of the binder transaction buffer). [41].

POSIX signals have no payload besides the signal number [42]. But there are also POSIX realtime signals which may contain 4 bytes of data.

The smallest IP packet size maximum allowed is 68 bytes (60 byte header and 8 byte payload) [43]. Note however that UDP demands at least 576.

only tangentially related but "some DB systems like MongoDB? has rather specific unique ID generator, we’re assuming a 20 byte length string ID as some “median” value which fits to almost any DB system with a little change. Mongo’s ID length is 24 byte" [44]

(note: in ipv6, the minimum MTU is 1280, not 576 like in ipv4. ipv6 has a 'fixed header' of 40 bytes plus optional 'extension headers', of which one presumably common one is the 'fragment' header, which is 8 bytes; so i might guess that 1232 bytes or so would be the amount left for the payload; so this is comfortably above 1024)

so the choice of 1k for the message buffer (and hence, the max message size would be slightly smaller than this) is near the low end, as intended, but there is at least one important IPC (seL4_IPCBuffer) with the same size limit. The networking package for CubeSats? suggests the possibility of even lower limits, but that's pretty far afield. POSIX realtime signals are lower (4 byte payloads) but that's obviously lower than we want here (although probably useful for other stuff). Since we're already at 1k, it's tempting to go just a little further to 256, which can be indexed by a single byte, but that would be a 4x reduction. Also, 1k allows us to have a 256-word (if words are two bytes) data structure plus a header. Also, these are max packet sizes, not minimums, and their main function is to set the size of packet buffers; if most packets are less than half of the max size, then the buffers, even if sized to 1k to only hold 1 max-sized packet, could still hold multiple small packets; in this case the buffers could operate as a ring buffer (the index of the 'TOS' within the ring could be put within the 1k buffer memory, at the beginning, and could be part of the overhead that makes the max message size less than 1024; for compatibility we may as well reserve 68 bytes (because IP + UDP max headers are 60 + 8 bytes [45]).

Note that, to allow locking the buffers (or parts of them), the memory nodes should support not just reading and writing memory, but also CAS.

i'm outta time to lookup this stuff, but more links:


so a spectrum of small sizes:

8 bytes: 64-bit 32 bytes: 256 bits (size of various hashes/crypto data structures) 64 bytes: typical cache line, also size of Ed25519 signature ~512 bytes: safe UDP packet payload size (508 bytes to be exact) ~1KiB?: mostly safe UDP packet payload size, in practice 4KiB?: typical page size, also max amount of RAM available in some MCUs 8KiB?: smallest l1 cache typically seen; about the size of an old-school BASIC interpreter 32KiB?: typical l1 cache (actually 64KiB? but that's split into 2, a data cache (dcache) and an instruction cache (icache)) 64KiB?: addressable memory of 16-bit pointers (2^16) 4GiB?: addressable memory of 32-bit pointers (2^32)


" Essential components and minimality

As a microkernel must allow building arbitrary operating system services on top, it must provide some core functionality. At a minimum, this includes:

    some mechanisms for dealing with address spaces, required for managing memory protection
    some execution abstraction to manage CPU allocation, typically threads or scheduler activations
    inter-process communication, required to invoke servers running in their own address spaces... For efficiency, most microkernels contain schedulers and manage timers, in violation of the minimality principle and the principle of policy-mechanism separation. ... The design of the IPC system makes or breaks a microkernel. To be effective, the IPC system must not only have low overhead, but also interact well with CPU scheduling. ... Start up (booting) of a microkernel-based system requires device drivers, which are not part of the kernel. Typically this means that they are packaged with the kernel in the boot image, and the kernel supports a bootstrap protocol that defines how the drivers are located and started; this is the traditional bootstrap procedure of L4 microkernels. Some microkernels simplify this by placing some key drivers inside the kernel (in violation of the minimality principle), LynxOS? and the original Minix are examples. Some even include a file system in the kernel to simplify booting. ... what is referred to as third-generation microkernels,[31] characterised by a security-oriented API with resource access controlled by capabilities, virtualization as a first-class concern, novel approaches to kernel resource management,[32] and a design goal of suitability for formal analysis, besides the usual goal of high performance. " [46]


[47] (actually [48]) has a diagram labeled 'figure 2.16 Traditional UNIX kernel' (BACH86) containing these parts under 'syscall interface' and above 'hardware control' and 'device drivers':

there is also some text (reformatted slightly):

" There are about 64 systems calls in System V, of which fewer than 32 are used frequently... ... Files are organized into file systems, which are treated as logical devices; a physical device such as a disk can contain several logical devices (file systems). Each file system has a superblock that describes the structure and contents of the file system, and each file in a file system is described by an inode that gives the attributes of the file. System calls that manipulate files do so via inodes. (and the buffer pool)...There are two versions of the inode: the disk copy that stores the inode information when the file is not in use and the in-core copy that records information about active files. ... The execution of user processes on UNIX systems is divided into two levels: user and kernel. When a process executes a system call, the execution mode of the process changes from user mode to kernel mode: the operating system executes and attempts to service the user request...

... the philosophy of the UNIX system is to provide operating system primitives that enable users to write small, modular programs that can be used as building blocks to build more complex programs. One such primitive visible to shell users is the capability to redirect I/O.

... In addition to servicing system calls, the kernel does general bookkeeping for the user community, controlling process scheduling, managing the storage and protection of processes in main memory, fielding interrupts, managing files and devices and taking care of system error conditions. "

not related but something cool that i happened across while researching this, a map of the linux kernel:


[49] lists the core functions of a microkernel as:

[50] says the same


" The disadvantage to a microkernel is that asynchronous IPC messaging can become very difficult to debug, especially if fibrils are implemented. Additionally, just tracking down a FS/write issue means examining the user space process, the block device service, VFS service, file system service and (possibly) the PCI service. If you get a blank on that, its time to look at the IPC service. This is often easier in a monolithic kernel. GNU Hurd suffers from these debugging problems (reference). I'm not even going to go into checkpointing when dealing with complex message queues. Microkernels are not for the faint of heart. " [51]


"The Carnegie Mellon University Mach operating system, a contemporary example of a microkernel architecture, implements a minimal kernel that comprises thread scheduling, message passing, virtual memory, and device drivers. Everything else, including various APIs, file systems, and networking, runs in user mode. However, commercial implementations of the Mach microkernel operating system typically run at least all file system, networking, and memory management code in kernel mode." [52]


maybe what we want, considering the idea of a bunch of tiny nodes with only 4k, is to have processes but not multiple threads within each process? also, with capability-based memory management, you could have memory regions whose 'ownership' is essentially shared between processes, which allows you to do the things you wanted threads-within-processes for anyhow.

so, there are a bunch of processes (or you could call them threads), and a global opaque store of memory, and each process can only access or share memory using capabilities.


anyhow i should probably note that my purpose in going on this microkernel kick is to look at functionality so 'core' that it should maybe be designed into Oot at a low level. So far i see things like:

most of these have POSIX standards, but we want to be simpler than POSIX