"Thumb-2 with its unequaled code density" -- https://news.ycombinator.com/item?id=21988614

---

"Interesting that the B5000 didn't make this list. Berkeley CS252 has been reading the Design of the B5000 System paper for years....

 Aloha 1 day ago [-]

I was also surprised - but I wonder thats the computer architectures language designers like, not computer architects.

--- https://news.ycombinator.com/item?id=21988096

---

https://people.cs.clemson.edu/~mark/admired_designs.html

" ... CDC-6600 and 7600 - listed by Fisher, Sites, Smith, Worley ... Cray-1 - listed by Hill, Patterson, Sites, Smith, Sohi, Wallach (also Bell, sorta) ... IBM S/360 and S/370 - listed by Alpert, Hill, Patterson, Sites (also Bell) ... MIPS - listed by Alpert, Hill, Patterson, Sohi, Worley ... "

" Related Lists

Perspectives from Eckert-Mauchly Award Winners

Bruce Shriver and Bennett Smith asked six Eckert-Maychly Awards winners what were the 5 or 6 most important books or articles that affected the way they approached the central issues of computer architecture and what 5 or 6 books or articles they would recommend for others to read because of likely impact on future architectures. See pp. 52-61 in Shriver and Bennett, The Anatomy of a High-Performance Microprocessor: A Systems Perspective, IEEE Computer Society Press, 1998. The six award winners are: John Cocke, Harvey Cragon, Mike Flynn, Yale Patt, Dan Siewiorek, and Robert Tomasulo.

Processor design pitfalls

Grant Martin and Steve Leibson listed thirteen failed processor design styles in "Beyond the Valley of the Lost Processors: Problems, Fallacies, and Pitfalls in Processor Design." See Chapter 3 inJari Nurmi (ed.), Processor Design: System-On-Chip Computing for ASICs and FPGAs, Springer, 2007.

    Designing a high-level ISA to support a specific language of language domain
    Use of intermediate ISAs to allow a simple machine to emulate its betters
    Stack machines
    Extreme CISC and extreme RISC
    VLIW
    Overly aggressive pipelining
    Unbalanced processor design
    Omitting pipeline interlocks
    Non-power-of-2 data-word widths for general-purpose computing
    Too small an address space
    Memory segmentation
    Multithreading
    Symmetric multiprocessing"

https://news.ycombinator.com/item?id=21988096

 tachyonbeam 2 days ago [-]

I think one of the most influential designs of recent times has been the DEC Alpha lineage of 64-bit RISC processors[1].

1 point by bshanks 0 minutes ago

parent

edit

delete

on: Which Machines Do Computer Architects Admire? (201...

The ones listed by 4 or more people (not including Bell) were:

CDC-6600 and 7600 - listed by Fisher, Sites, Smith, Worley
Cray-1 - listed by Hill, Patterson, Sites, Smith, Sohi, Wallach (also Bell, sorta)
IBM S/360 and S/370 - listed by Alpert, Hill, Patterson, Sites (also Bell)
MIPS - listed by Alpert, Hill, Patterson, Sohi, Worley

Special mention:

6502 - only listed by Wilson, but she was the chief architect of ARM so i think her choice is important to note

---

some notes on Forth implementations:

kazinator 23 hours ago [-]

> Now onto the new instruction:

Adding a special VM instruction for such operations as inheriting from a class is a strikingly bad design decision.

You want to minimize the proliferation of VM instructions as much as possible.

A rule of thumb is: is it unreasonable to compile into a function call? Or else, is it likely to be heavily used in inner loops, requiring the fastest possible dispatch? If no, don't add an instruction for it. Compile it into an ordinary call of a run-time support function.

You're not going to inherit from a class hundreds of millions of times in some hot loop where the application spends 80% of its time; and if someone contrives such a thing, they don't necessarily deserve the language level support for cutting their run-time down.

munificent 22 hours ago [-]

This is a good point. One of the things I have to balance with the book is teaching the optimal way to do something without dragging through reader through a large amount of verbose, grungy code. So sometimes (but not too often) I take a simpler approach even if it's not the best one.

With this VM, we have plenty of opcode space left, so it's natural to just add another instruction for inheritance, even if it does mean that the bytecode dispatch loop doesn't fit in the instruction cache quite as well.

I'll think about this some more. I'm very hesitant to make sweeping changes to the chapter (I want nothing more in life than for the book to be done), but this would be a good place to teach readers this fundamental technique of compiling language constructs to runtime function calls.

---

" The Classical Forth Registers

The classical Forth model has five "virtual registers." These are abstract entities which are used in the primitive operations of Forth. NEXT, ENTER, and EXIT were defined earlier in terms of these abstract registers.

Each of these is one cell wide -- i.e., in a 16-bit Forth, these are 16-bit registers. (There are exceptions to this rule, as you will see later.) These may not all be CPU registers. If your CPU doesn't have enough registers, some of these can be kept in memory. I'll describe them in the order of their importance; i.e., the bottom of this list are the best candidates to be stored in memory.

W is the Working register. It is used for many things, including memory reference, so it should be an address register; i.e., you must be able to fetch and store memory using the contents of W as the address. You also need to be able to do arithmetic on W. (In DTC Forths, you must also be able to jump indirect using W.) W is used by the interpreter in every Forth word. In a CPU having only one register, you would use it for W and keep everything else in memory (and the system would be incredibly slow).

IP is the Interpreter Pointer. This is used by every Forth word (through NEXT, ENTER, or EXIT). IP must be an address register. You also need to be able to increment IP. Subroutine threaded Forths don't need this register.

PSP is the Parameter Stack (or "data stack") Pointer, sometimes called simply SP. I prefer PSP because SP is frequently the name of a CPU register, and they shouldn't be confused. Most CODE words use this. PSP must be a stack pointer, or an address register which can be incremented and decremented. It's also a plus if you can do indexed addressing from PSP.

RSP is the Return Stack Pointer, sometimes called simply RP. This is used by colon definitions in ITC and DTC Forths, and by all words in STC Forths. RSP must be a stack pointer, or an address register which can be incremented and decremented.

If at all possible, put W, IP, PSP, and RSP in registers. The virtual registers that follow can be kept in memory, but there is usually a speed advantage to keeping them in CPU registers.

X is a working register, not considered one of the "classical" Forth registers, even though the classical ITC Forths need it for the second indirection. In ITC you must be able to jump indirect using X. X may also be used by a few CODE words to do arithmetic and such. This is particularly important on processors that cannot use memory as an operand. For example, ADD on a Z80 might be (in pseudo-code)

   POP W   POP X   X+W -> W   PUSH W

Sometimes another working register, Y, is also defined.

UP is the User Pointer, holding the base address of the task's user area. UP is usually added to an offset, and used by high-level Forth code, so it can be just stored somewhere. But if the CPU can do indexed addressing from the UP register, CODE words can more easily and quickly access user variables. If you have a surplus of address registers, use one for UP. Single-task Forths don't need UP.

...

Use of the Hardware Stack

Most CPUs have a stack pointer as part of their hardware, used by interrupts and subroutine calls. How does this map into the Forth registers? Should it be the PSP or the RSP?

The short answer is, it depends. It is said that the PSP is used more than the RSP in ITC and DTC Forths. If your CPU has few address registers, and PUSH and POP are faster than explicit reference, use the hardware stack as the Parameter Stack.

On the other hand, if your CPU is rich in addressing modes -- and allows indexed addressing -- there's a plus in having the PSP as a general-purpose address register. In this case, use the hardware stack as the Return Stack.

Sometimes you do neither! The TMS320C25's hardware stack is only eight cells deep -- all but useless for Forth. So its hardware stack is used only for interrupts, and both PSP and RSP are general-purpose address registers. (ANS Forth specifies a minimum of 32 cells of Parameter Stack and 24 cells of Return Stack; I prefer 64 cells of each.)

...

Top-Of-Stack in Register

Forth's performance can be improved considerably by keeping the top element of the Parameter Stack in a register!

...

If you have at least six cell-size CPU registers, I recommend keeping the TOS in a register. I consider TOS more important than UP to have in register, but less important than W, IP, PSP, and RSP. (TOS in register performs many of the functions of the X register.) It's useful if this register can perform memory addressing. PDP-11s, Z8s, and 68000s are good candidates.

Nine of the 19 IBM PC Forths studied by Guy Kelly [KEL92] keep TOS in register.

...

What about buffering two stack elements in registers? When you keep the top of stack in a register, the total number of operations performed remains essentially the same. A push remains a push, regardless of whether it is before or after the operation you're performing. On the other hand, buffering two stack elements in registers adds a large number of instructions -- a push becomes a push followed by a move. Only dedicated Forth processors like the RTX2000 and fantastically clever optimizing compilers can benefit from buffering two stack elements in registers.

Some examples

Here are the register assignments made by Forths for a number of different CPUs. Try to deduce the design decisions of the authors from this list.

             Figure 5. Register Assignments

            W     IP    PSP   RSP   UP     TOS

8086[1] BX SI SP BP memory memory [LAX84] 8086[2] AX SI SP BP none BX [SER90] 68000 A5 A4 A3 A7=SP A6 memory [CUR86] PDP-11 R2 R4 R5 R6=SP R3 memory [JAM80] 6809 X Y U S memory memory [TAL80] 6502 Zpage Zpage X SP Zpage memory [KUN81] Z80 DE BC SP IX none memory [LOE81] Z8 RR6 RR12 RR14 SP RR10 RR8 [MPE92] 8051 R0,1 R2,3 R4,5 R6,7 fixed memory [PAY90]

[1] F83. [2] Pygmy Forth.

"SP" refers to the hardware stack pointer. "Zpage" refers to values kept in the 6502's memory page zero, which are almost as useful as -- sometimes more useful than -- values kept in registers; e.g., they can be used for memory addressing. "Fixed" means that Payne's 8051 Forth has a single, immovable user area, and UP is a hard-coded constant.

Narrow Registers

Notice anything odd in the previous list? The 6502 Forth -- a 16-bit model -- uses 8-bit stack pointers!

It is possible to make PSP, RSP, and UP smaller than the cell size of the Forth. This is because the stacks and user area are both relatively small areas of memory. Each stack may be as small as 64 cells in length, and the user area rarely exceeds 128 cells. You simply need to ensure that either a) these data areas are confined to a small area of memory, so a short address can be used, or b) the high address bits are provided in some other way, e.g., a memory page select.

In the 6502, the hardware stack is confined to page one of RAM (addresses 01xxh) by the design of the CPU. The 8-bit stack pointer can be used for the Return Stack. The Parameter Stack is kept in page zero of RAM, which can be indirectly accessed by the 8-bit index register X. (Question for the advanced student: why use the 6502's X, and not Y? Hint: look at the addressing modes available.) " -- https://www.bradrodriguez.com/papers/moving1.htm

eForth (a (popular?) portable Forth with 31 primitives):

" The following registers are required for a virtual Forth computer:

Forth Register 8086 Register Function IP SI Interpreter Pointer SP SP Data Stack Pointer RP RP Return Stack Pointer WP AX Word or Work Pointer UP (in memory) User Area Pointer

...

eForth Kernel

System interface: BYE, ?rx, tx!, !io Inner interpreters: doLIT, doLIST, next, ?branch, branch, EXECUTE, EXIT Memory access: ! , @, C!, C@ Return stack: RP@, RP!, R>, R@, R> Data stack: SP@, SP!, DROP, DUP, SWAP, OVER Logic: 0<, AND, OR, XOR Arithmetic: UM+ " -- http://www.exemark.com/FORTH/eForthOverviewv5.pdf

---

here are CamelForth?'s ~70 Primitives:

http://www.bradrodriguez.com/papers/glosslo.txt

design rationale for this choice:

" 1. Fundamental arithmetic, logic, and memory operators are CODE. 2. If a word can't be easily or efficiently written (or written at all) in terms of other Forth words, it should be CODE (e.g., U<, RSHIFT). 3. If a simple word is used frequently, CODE may be worthwhile (e.g., NIP, TUCK). 4. If a word requires fewer bytes when written in CODE, do so (a rule I learned from Charles Curley). 5. If the processor includes instruction support for a word's function, put it in CODE (e.g. CMOVE or SCAN on a Z80 or 8086). 6. If a word juggles many parameters on the stack, but has relatively simple logic, it may be better in CODE, where the parameters can be kept in registers. 7. If the logic or control flow of a word is complex, it's probably better in high-level Forth. " -- http://www.bradrodriguez.com/papers/moving5.htm

               ANS Forth Core wordsThese are required words whose definitions are  specified by the ANS Forth document.

! x a-addr -- store cell in memory + n1/u1 n2/u2 -- n3/u3 add n1+n2 +! n/u a-addr -- add cell to memory

n1/u1 n2/u2 -- n3/u3 subtract n1-n2 < n1 n2 -- flag test n1<n2, signed

x1 x2 -- flag test x1=x2

> n1 n2 -- flag test n1>n2, signed >R x -- R: -- x push to return stack ?DUP x -- 0

@ a-addr -- x fetch cell from memory 0< n -- flag true if TOS negative 0= n/u -- flag return true if TOS=0 1+ n1/u1 -- n2/u2 add 1 to TOS 1- n1/u1 -- n2/u2 subtract 1 from TOS 2* x1 -- x2 arithmetic left shift 2/ x1 -- x2 arithmetic right shift AND x1 x2 -- x3 logical AND CONSTANT n -- define a Forth constant C! c c-addr -- store char in memory C@ c-addr -- c fetch char from memory DROP x -- drop top of stack DUP x -- x x duplicate top of stack EMIT c -- output character to console EXECUTE i*x xt -- j*x execute Forth word 'xt' EXIT -- exit a colon definition FILL c-addr u c -- fill memory with char I -- n R: sys1 sys2 -- sys1 sys2 get the innermost loop index INVERT x1 -- x2 bitwise inversion J -- n R: 4*sys -- 4*sys get the second loop index KEY -- c get character from keyboard LSHIFT x1 u -- x2 logical L shift u places NEGATE x1 -- x2 two's complement OR x1 x2 -- x3 logical OR OVER x1 x2 -- x1 x2 x1 per stack diagram ROT x1 x2 x3 -- x2 x3 x1 per stack diagram RSHIFT x1 u -- x2 logical R shift u places R> -- x R: x -- pop from return stack R@ -- x R: x -- x fetch from rtn stk SWAP x1 x2 -- x2 x1 swap top two items UM* u1 u2 -- ud unsigned 16x16->32 mult. UM/MOD ud u1 -- u2 u3 unsigned 32/16->16 div. UNLOOP -- R: sys1 sys2 -- drop loop parms U< u1 u2 -- flag test u1<n2, unsigned VARIABLE -- define a Forth variable XOR x1 x2 -- x3 logical XOR

x x DUP if nonzero

               ANS Forth ExtensionsThese are optional words whose definitions are specified by the ANS Forth document.

<> x1 x2 -- flag test not equal BYE i*x -- return to CP/M CMOVE c-addr1 c-addr2 u -- move from bottom CMOVE> c-addr1 c-addr2 u -- move from top KEY? -- flag return true if char waiting M+ d1 n -- d2 add single to double NIP x1 x2 -- x2 per stack diagram TUCK x1 x2 -- x2 x1 x2 per stack diagram U> u1 u2 -- flag test u1>u2, unsigned

               Private ExtensionsThese are words which are unique to CamelForth?. Many of these are necessary to implement ANS Forth words, but are not specified by the ANS document. Others are functions I find useful.

(do) n1

u1 n2

u2 -- R: -- sys1 sys2

                             run-time code for DO(loop) R: sys1 sys2 --

sys1 sys2

                           run-time code for LOOP(+loop) n -- R: sys1 sys2 --

sys1 sys2

                          run-time code for +LOOP>< x1 -- x2 swap bytes  ?branch  x -- branch if TOS zero BDOS   DE C -- A                   call CP/M BDOS branch -- branch always lit    -- x         fetch inline literal to stack PC! c p-addr -- output char to port PC@ p-addr -- c           input char from port RP! a-addr -- set return stack pointer RP@ -- a-addr         get return stack pointer SCAN   c-addr1 u1 c -- c-addr2 u2 find matching char SKIP   c-addr1 u1 c -- c-addr2 u2 skip matching chars SP! a-addr -- set data stack pointer SP@ -- a-addr           get data stack pointer S= c-addr1 c-addr2 u -- n      string compare n<0: s1<s2, n=0: s1=s2, n>0: s1>s2 USER   n -- define user variable 'n' "

---

CoolGuySteve? 5 hours ago [–]

Systems programming in any language would benefit immensely from better hardware accelerated bounds checking.

While Intel/AMD have put incredible effort into hyperthreading, virtualization, predictive loading, etc, page faults have more or less stayed the same since the NX bit was introduced.

The world would be a lot safer if we had hardware features like cacheline faults, poison words, bounded MOV instructions, hardware ASLR, and auto-encrypted cachelines.

Per-process hardware encrypted memory would also mitigate spectre and friends as reading another process's memory would become meaningless.

---

incidentally, here's the 'passing arguments' part of the osdev wiki page on syscalls:

https://wiki.osdev.org/System_Calls#Passing_Arguments

---

just to review, since it's been many months since i was able to work on this and i forgot a lot:

the current proposal for the Oot layer cake:

1. Boot/BootX? 2. 'Oot Assembly' or 'BootX?'; Loot 3. OVM 4. Oot Core 5. Pre-Oot 6. Oot

1. Boot/BootX?

Boot: An assembly language which is dead easy to implement. Competes with 'toy' assembly languages. Limited functionality. Sufficient for self-hosting compiler and interpreter. Non-concurrent. Does this support ASCII consoles (STDIN, STDOUT)? Does this support console interactivity, or only batch computation? Are the interop xcall etc instructions in here?

The goal of Boot is to be the easiest-to-implement target assembly language, which can support self-hosting and which can be incrementally extended to BootX?.

I'm leaning towards including interactivity in Boot. This way you can think of Boot as the minimum necessary for bootstrapping a (C64/Apple II BASIC)-like Oot REPL on an embedded system (rather than just bootstrapping a batch compiler, which isn't that useful b/c you may prefer to cross-compile when bootstrapping a port to a new system).

BootX?: Adds instructions and syscalls to Boot so that it can serve as a target language for full Oot implementations. Maybe also directly supports: Everything needed for a TUI (e.g. cursors)? PICO-8 style interactivity?

The goal of Boot is to be the perfect simple portable target assembly language for bootstrapping the Oot toolchain on a new platform, and for general-purpose single-threaded (pure/batch? or interactive?) computation, which can be extended to BootX?. That goal of BootX? is to be the perfect portable target assembly language incrementally extended from Boot.

2. LoVM? (and Loot)

A portable assembly language which is pretty high-level. Easier to read and understand than Boot. More pleasant to write in than Boot. Preserves more high-level semantics so can be more efficiently translated to most target platforms than Boot. Trivially compiles to Boot. Abstraction level of LLVM, C--, and maybe of Pillar and AliceVM? and Hashlink? LLVM-like but easier to implement or port to a new backend, and supports the features that we need for an efficient Oot. Less configurable than LLVM. Many implementations port this layer, because its higher-level nature allows more efficient use of platform primitives than porting Boot. Because it's higher-level than some assembly languages, a code fragment without weird control flow can be efficiently transpiled to a HLL.

Loot ('low level oot') is an HLL that compiles to this layer. Competes with C, Zig, BASIC, Oberon. Oot's garbage collector, scheduler, etc are written in this layer, as are some things that benefit from optimization.

The goal of LoVM? is to be the perfect VM for writing portable language services like garbage collectors, schedulers, and to also be a great target language.

The goal of Loot is to be the perfect simple language for writing portable language services like garbage collectors, schedulers.

3. OVM

OVM is a portable safe virtual machine. Abstraction level of JVM, CLR, BEAM, and maybe of Pillar and AliceVM? and Hashlink? Easier to port than JVM? Definitely easier than CLR, BEAM. Unlike the previous layer, is safe, in the sense that buggy code will cause an OVM panic or a single-thread hang, not a lower-level crash or a hang that affects other threads. Has a sandboxed mode for untrusted code. Code fragments can be efficiently and readably transpiled to HLLs (provided they don't use features of OVM that aren't compatible with the target HLL). More mature ports of Oot port this layer, so that they can use platform-provided services (such as GC and scheduling) rather than using Oot's implementation of these on top of a platform that already provides them.

Has garbage collection, scheduling, bounds-checking, tail-recursion. Restricted control flow (no GOTOs)? Has a HLL that compiles to this layer?

The goal of OVM is be a simple safe portable VM that meets Oot's needs.

4. Oot Core A small core Oot that has 'Oot-like' semantics. More explicit than Oot; probably too verbose for easy human reading and writing. May have a simpler, incompatible syntax than Oot but semantics-wise, is a subset of Oot. Even more mature ports of Oot port this layer, so that they can implement or transpile Oot core code in the most efficient and readable way for each platform.

Abstraction level of Haskell Core, or Scheme.

5. Pre-Oot A subset of Oot out of which the rest of Oot is built using metaprogramming.

6. Oot Oot proper. Has at least three 'profiles':

nostdlib (e.g. doesn't support a filesystem; good for expressing sandboxed pure computation in a stripped-down way)
is this what can target Boot?
small (has a small stdlib with basic I/O primitives but not stuff like Unicode that creates large staticly-linked binaries)
is this what can target Boot?
std

So compilation/interpretation in the reference implementation goes:

parse Oot
Oot lowered to Pre-oot (apply metaprogramming)
Pre-oot compiled to Oot Core (perform desugaring and inference)
Oot Core compiled to OVM (general compilation)
OVM compiled to LoOVM? (not sure what this looks like yet, hopefully pretty simple/quick)
LoOVM? compiled to BootX? (mostly like macroassembly i guess; apply macros, expand addressing modes to multiple instructions, convert stack instructions with runtime facilities, inline implementation of various complex instructions, recompute branch targets, etc. So probably very quick and local/parallelizable except for nonrelative branch targets)

A mature implementation on a specific platform might do something like:

parse Oot
Oot compiled directly to Oot Core (skipping Pre-oot)
Oot Core compiled directly to native platform (skipping OVM, LoVM?, BootX?)

Note that even near the bottom levels (e.g. LoVM?) you might be able to get decent transpilation into a HLL for simple algorithms, which serves the design goal of being able to use Oot as an embedded 'scripting language' within various HLL platforms with as little porting work as possible.

Note that compared to the previous proposals, i think i've added BootX? back in, and pushed even more of the unusual or complicated or even scary-looking stuff (eg atomics) out of Boot. Smallstacks and addressing modes and compact encodings remain pushed up into LoVM?. Now i'm also pushing all concurrency, and all 'optional instructions', into BootX?.

You can kind of divide this into 3 macro-layers:

Portability layers: Boot/BootX?, LoVM?(Loot). These are non-Oot-specific technologies for implementing portable programming languages. Why not just LLVM? LLVM isn't easily portable enough (remember, we want to port Oot into embeddings within a bunch of HLLs so it's gotta be real easy to port).
OVM: This layer contains all of the 'messy stuff' like garbage collection, scheduling.
Oot layers: Oot Core, Pre-Oot, Oot. These layers contain Oot syntax and semantics.

The layer most likely to be a boondoggle is LoVM?. Boot is constrained to be as simple/easy to implement as possible (given the constraint that BootX? is incrementally added onto Boot and BootX? must be a reasonably good target platform), BootX? is only adding necessary instructions to be a complete target platform. LoVM?'s motivation has a bit of 'i want an awesome assembly language', and all of my probably-stupid 'innovations' such as the stacks, as well as some ill-defined aspirations (making a higher-level assembly language for better code density and transpilation) (as well as just for ease of reading/writing; which is silly because there's really no need to read/write this, you'd read/write the corresponding HLL, Loot). OVM may be a similar danger, although at least there we're going to be somewhat driven by Oot's needs.

---

i just realized something great:

i have been avoiding putting instruction data in the instruction stream because that makes it hard to parse the instruction stream in parallel. However, this is less of a problem for instruction data for JMP (not branch) instructions, because in that case, you can mis-parse the data as instructions and it doesn't matter because you aren't going to execute past the JMP in any case.

I guess it's still a little tiny bit of a problem because the JMP data could itself contain a JMP; which makes it hard to know, without prior knowledge of which instructions are real, whether the next instruction after the data (which in reality is an actual instruction) is itself jump data (for the false jump in the previous jump's data) or a real instruction.

At least, if the instructions are all 32 bits and the jump data are multiples of 32 bits, you don't have the problem of splitting the instructions in the wrong place. So this one tiny problem might make you think that some data is actually instructions, but it won't make you confused about decoding the instructions. And, any binary could arrange to put data in the middle of the binary in a place that is unreachable, so thinking some data is actually instructions is unavoidable (if binaries do that).

And, not only can you decode, but you can also do speculative execution easily.

If a JMP had 24 bits of data within itself plus 32 bits of data in the next slot, that's 56 bits in total, which is 16 petabytes. Which i think is enough (and much more than the 16 MB provided by 24 bits of data).

Otoh it will make arithmetic about figuring jump offsets slightly annoying if different instructions have different total sizes, rather than each one being 32 bits. Also, the compiler/interpreter data structures now have to allocate at most 64 bits to store each instruction in total, rather than 32 bits. So there's still some reasons not to do this, if ultimate simplicity is desired. Otoh having to store jump labels (a form of constant table) in the binary is complex also, and is possibly worse for performance.

Hmm but didn't i have some reason involving compilation targets other than assembly why jk (jump table jmp) is preferred? Why full indirect jumps ae bad? No, i just looked that up, it's in [[ootAssemblyNotes24?]], the issue was with jumping to DYNAMICALLY COMPUTED addresses; the issue is that if you try to compile code with dynamically computed addresses to a target with different sizes for (the compilation of) different instructions, then since the original program didn't know about those different sizes, it's going to be hard for the the compiler to, in general, compute the transformation from the original dynamically computed jump target to its image in the transformed program; the compiler would basically have to embed into the compiled binary a complete map from each location in the source program to where that ends up in the compiled program, which of course bloats the binary a ton. But we don't have that problem here because here we have static addresses, not dynamically computed ones; this problem had nothing to do with needing to force a jump table/labels on everything, it just had to do with dynamic/static computation of jump targets.

so i think we should probably do this. It allows executable size to reach 16 petabytes instead of 16 MB, and gets rid of the requirement to have jump tables/labels at a special place in the binary. To really realize that simplicity, though, we must get rid of constant tables as well and have only inline constants. Which means that either we only allow small constants, or we allow them in the data stream. This doesn't have the same 'but we'll never reach that place in the instruction stream' property as JMPs, however, though. But one thing we could do with inline constants is force them to have a special opcode, taking advantage of the fact that we already have 24 bits of data in the leading constant instruction. So it would take 3 slots, 3x32 bits to get 24*3 = 72 bits of data, which is what we'd need for a 64-bit constant. That's fine because even without a special prefix, two slots only gives us 24+32 bits, so 3 slots is needed in this case regardless. If we did the same with JMP instructions, then 2-slot JMPs can only have 48 bits of data instead of 54. So instead of 16 petabyte executables, we would only get 256 terabytes. That sounds just fine to me! So let's do that. The data can have a variant of the ANN prefix, perhaps.

Another benefit of allowing this reduces the pressure for non-uniform instruction types.

We can bring back constant tables in LoVM?.

later: you know, there are also some benefits of just having large constants, and absolute jump targets, in their own bytes e.g. not taking advantage of the 24 data bits in the instruction at all. Specifically, in many platforms this allows the implementation to just use fast, ordinary methods for reading the constants out of the instruction stream; and in many platforms, specifically those where jump targets are ultimately translated into ordinary native 64-bit pointers, it allows a linker to just translate 64-bit jump targets into 64-bit pointers (but we'll also have 32-bit jump targets which won't have the latter property). So it makes the implementation more simple in this way, and less simple in terms of making it easy to definitively see, in every case, if you are just looking at some random place in the middle of the instruction stream, where data ends and instructions begin. And for the latter advantage, it still isn't total, because programs could still put data in unreachable areas of the instruction stream. My guess is that the simplicity of having the constants and jump targets freestanding is more valuable than the simplicity of being 100% certain in every case where instructions might begin. So forget about forcing constants and jump targets to have the ANN prefix.

If we do it this way, we free up those 24 data bits inside the longjump32 and longjump64 instructions. Should we use that to make them long BRANCH instructions instead? Or, should we leave this as implementation-dependent free space? I'm leaning towards the latter; one could imagine, for example, that JIT interpreter/compilers could use this to count how many times the jump is taken and to note whether or not the target has been compiled yet, and that ordinary compilers could use these to identify the jump target via a 'label' which would be an index into a table of at most 2^24 jump targets which is being maintained outside of the instruction stream (we could even make this 2^24 unique longjump target thing a standardized limitation of Oot; that's a limit of 16 MB count of jump targets; if each jump target corresponds to a 64-bit pointer, then it would take 16 MB x 8 = 128 MB bytes to store this table).

For example, the gcc executable on my system, /usr/bin/x86_64-linux-gnu-gcc-9, is currently 1.2M in size. The Python executable, /usr/bin/python3.8, is 5.3M in size.

There are game executables that are larger than 128MB (e.g. https://www.reddit.com/r/pcgaming/comments/3ta7ur/whats_the_largest_game_exe_youve_ever_seen/ ), but i bet a lot of that size is embedded data, rather than code with unique jumps.

In LoVM? we could add instructions that refer to module jump tables and only provide the index into that table, and don't have following data.

---

i forgot why we had separate dual everything (which is why i was thinking of naming it 'bicycle') for integers and pointers?

was it so that we could standardize the integer bitwidth while leaving pointer bitwidth implementation-dependent?

i remember now; it's not just bitwith, it's because ptrs need to be totally opaque. Imagine an implementation on top of and interoperating with a HLL like Python; you might need to place references to arbitrary Python objects in pointers; in which case a pointer is not just an integer with a special interpretation, it's a wholly different, opaque object type.

it also has the bonus advantage that, since most instructions naturally specify a type int/ptr of each operand, you can have twice as many total registers given the number of bits in each operand. Not so useful when operands are each 8 bits but very useful when they are only 4 bits (here you can have 16 int registers and also 16 ptr registers, instead of just 16 general-purpose registers).

---

i'm considering adding another layers which is an 'intermediate language optimization layer', or maybe we should make Oot Core that, or make OVM that. This layer has no associated HLL (i guess if you wrote it out it would look like a lisp s-expr, or something similarly 'no extraneous syntax'-y). A CFG and/or CPS or SSA or STG or ANF or something like that.

it would also be cool if sometimes that IL-for-optimization layer could 'skip layers', that is, some instances of pre-oot could be compiled directly to it, skipping the usual step in between, and also some instances of it could be compiled directly to LoVM?, skipping OVM, or even compiled directly to BootX?. Alternately, maybe it could not skip LoVM?, but it could skip the layers above it, all the way up to Oot (or maybe just up to pre-oot, if there are other layers in between it and pre-oot).

another thing to note: some of these layers also have associated HLLs, such as LoVM? and Loot. If we implement OVM in Loot but then ultimately implement everything in Oot, then we won't have any Loot implementation any longer. So what we should actually do is have an alternate compilation pipeline that goes through Loot instead of LoVM?. This also makes sense because we wanted to be able to transpile to another HLL at the end anyhow. The same goes for OVM; there should be a (Lua-like? C#-like? OCaml-like?) HLL corresponding to OVM. Maybe we'll have a third layer like that too ('like that' meaning with both a 'bytecode' representation and also a corresponding HLL, and alternate compilation paths for transpilation (using the HLL) or targeting machine-readability only (using the bytecode)); it could go on the other side of the optimization layer.

I'm not sure if Boot counts as the same sort of thing; the Boot 'HLL' is just assembly code, which isn't very high level. We could make an awesome macroassembly for Boot, but currently i was thinking of saving that for the LoVM? layer. Otoh, Loot is much more than a macroassembler so maybe it belongs here.

If one of the layers is on the other side of the optimization layer, and the Boot layer does count, then i guess the optimization layer goes below OVM. Otherwise, if there is going to be another layer, then the optimization layer goes above OVM.

To summarize this section:

we probably need an optimization-oriented IL other than Oot Core at some point. GNU has GIMPLE, Haskell has STG, other languages have CFGs possibly with SSA or CPS or ANF
where there is an HLL associated with a bytecode-ish IL (like Loot and LoVM?), transpilers will probably target the HLL not the IL, and in cases where two adjacent compilation passes are like this, the bytecode-ish IL may get skipped totally. If done well, this assists in bootstrapping, because it means that we can write, for example, Oot code and transpile it to Loot, which means that we implement OVM in Oot yet have this transpiled to a Loot implementation of OVM, which is nice for bootstrapping (and also useful as a guide for someone looking to reimplement OVM on another platform in a deeper way).
we may want an associated HLL for compilation stages other than LoVM?

---

ptrace(2) - Linux manual page - man7.org www.man7.org › man-pages › man2 › ptrace.2.html The ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of another process (the "tracee"),

PTRACE_SINGLESTEP

---

previously, i think i had merged BootX? into Boot, and had a bunch of 'optional instructions' in Boot, but now, i feel that that will make the Boot reference document too scary, so, as per the above plan, i'm going to split off BootX? again. i'm going to update boot_reference.md to remove the BootX? instructions, or at least to instruct the implementor to regard them as stubs that either return immediately or alias to other Boot instructions.

are the instruction opcode constant lists in the pyboot implementation currently more up to date than boot_reference.md? maybe

note that there is also boot_reference.adoc, which has the memory order / memory consistency model, and bootReferenceOld191229.txt, which is also old, and also has some old todos. and boot_extended_reference.adoc. There's a much older version on ootAssemblyNotes16.txt.

a few instructions that were taken out are in ootAssemblyNotes23.txt. Some other text that was rewritten (probably dont need to reread) is in ootAssemblyNotes26.txt.

cp ~/prog/oot/boot/doc/boot_reference.md ~/oot/bootReferenceOld200622.txt cp ~/prog/oot/boot/doc/boot_reference.adoc ~/oot/bootReferenceAdocOld200622.txt cp ~/prog/oot/boot/doc/boot_extended_reference.adoc ~/oot/bootExtendedReferenceOld200622.txt cp ~/prog/oot/boot/doc/boot_design.adoc ~/oot/bootDesignOld200622.txt

cp -r ~/prog/oot/boot ~/prog/oot/boot_old_200622

as of now i havent actually changed boot_reference.md, i just made those copies so that it's ready to be changed.

---

jy? lpc? or load jrel ptr? (CONSTANT offset; because it's constant, there is no need for the runtime to understand how boot instructions mapped to compiled instructions; and because it's relative, it can also do lpc when the offset is zero; )

how does the jvm handle callbacks, esp. callbacks passed in from native code? how do harvard architectures handle callbacks? do they have indirect jumps? how does LLVM deal with this stuff? ok i looked that stuff up (i might be wrong about some of this tho):

wondering how jy and lpc would be implemented in a JVM implementation.

lpc in Boot would be used to create callbacks, so how are those done in the JVM? in JNI?

...(i went and looked this up)...looks like in JNI, from C you call registerNatives to register your C functions as Java methods, then in Java you call them as methods.

There are JVM instructions for calling methods:

https://www.jrebel.com/blog/using-objects-and-calling-methods-in-java-bytecode

so, in C code interfacing with JVM via JNI, the way you get a function pointer, and pass it as a callback, call call the callback: is to just get the function pointer in C normally, then use registerNatives to register it as a JVM method, and then to use something like invokeVirtual (invokeDynamic?) to call the method.

so, i guess that in Java, you would do 'lpc' by getting the 'this' pointer, and then you'd pass that like any other value, and then you'd call it by calling some predetermined method on 'this'; OR you'd use Method References along with 'this' to capture a reference to a particular method. This wont' get you to the exact line where lpc is called however, only to the beginning of a method.

i guess that means that it would be very hard to express lpc in the JVM without a full interpreter runtime layer. an 'lpc-JREL' could be expressed if and only if it only captured function entry points. jy however sounds necessary to be able to implement JVM's invokedynamic/invokevirtual etc, and actually i think an lpc-JREL is necessary too in order to implement method references (without a full interpreter runtime layer).

harvard architectures just mean that code and data are separate address spaces (and maybe code can't be written to), it doesn't mean there are no indirect jumps. For example, AVR has indirect jumps.

on AVR, a harvard (or is it modified harvard?) architecture, there is no 'LPC' instruction to directly load the PC, but you can do an RCALL (relative call) to the next instruction, and then pop the stack to get the return address, which effectively accomplishes the same thing.

LLVM's %indirectbr instruction is only for local (within the same function) jumps, but its %call instruction can take an arbitrary (e.g. externally-given, i believe) function pointer. LLVM allows you to capture a pointer to some location in LLVM code by using an llvm::object::SymbolRef?

i'm not sure how indirect jumps/calls to arbitrary function pointers are represented in CFGs?

so yeah all of these things seem to allow both:

some way of passing callbacks, at least to predefined functions, to external code
some way of jumping to an arbitrary function passed in from external code

---

so i think we do need both

jy (indirect jump to arbitrary code pointers)
a way to load a pointer to locations in Boot code i think the best way is 'load immediate code pointer', which embeds a 32-bit absolute in the instruction stream, which indicates the i-th instruction in the code stream, starting from the beginning

---

in Boot, programs can be assumed to be only 2^32 long; longer ones can only be permitted in BootX?

---

yknow, you could still have 'constant tables' without any specific binary structure just by having the 'constant number' be the location of the constant within the instruction stream... but that kind isn't very helpful because you can't have arbitrarily large native ptrs in there

---

otoh, as much as jy and lci are needed for a full system, they are going to cause some complications on some targets, and i'm not sure that they are really needed for the self-hosting interpreter. Are they?

And what about xcall, xentry, etc?

i guess the question is, what's the point of the self-hosting interpreter in the Boot stage? If we are leaving out floating-point ops, then it's clearly not so that we can run all normal Oot programs. Potential applications of a Boot-only system include:

run all normal Oot programs, but not efficiently If we eventually implement floating-point in software, this could happen. Otherwise, need floating point.
run Oot programs on embedded systems probably do need stuff like jy, xcall, xentry
run an Oot interpreter but for what purpose? As an integer calculator? To play "guess the number"? Without interaction with the external system, can't do much.
to hack together an Oot toolchain for a new system Now this one sounds quite likely, but it definitely requires external system interaction.
to compile/execute simple pure code very efficiently on the target platform This one may require no jy on some targets.

So it seems that the really necessary uses of Boot (without BootX?) are external system interaction. Which suggests that maybe xcall, xlibcall, xentry, xret are more important than jy and lci. lci could still be useful to create callbacks to be passed, but we could require that it targets an xentry (in Boot; it could be more general in BootX?). Then instead of a general jy we could reuse xcall. This allows us to impose structural constraints like 'every code path starting in xentry ends in an xret, with no more xentrys in between' which would ensure that the xcall stuff can be compiled into ordinary structured programming language targets.

so this route would see Boot not only having less instructions than BootX?, but also more constrained in control flow, which would allow it to be more easily compiled to more targets.

It also suggests that leaving out the I/O functionality is not the way to go, as it impedes the really key use case of hacking together a new toolchain on a new platform.

We can have a 'pure profile'. Also a 'user code profile' which uses less registers.

---

hmm so no jy = no computed goto! terrible for interpreter perf.

but we said that Boot omits some high-perf things, so that it can be easy to implement everywhere, so i guess that's fair (although annoying).

also, we still have lentry (was: lci, but only for xentry labeled things), isn't that hard to implement (without a full interpreter loop)? Only on languages that don't have function/method references, or any other sort of high-level indirection. Even pre-8 Java could implement each xentry as its own class, with a distinguished method 'call'.

but yes, on some especially primitive languages, lentry will force the whole thing to be interpreted. Even Oberon-2 has procedure types.

supporting jy and arbitrary labels for callbacks is much harder because many language have some sort of first-class functions or high-level indirection but do not have any way to refer to individual lines of code. This would lead to each line of code being its own function, which is very inefficient, or to giving up on compiliation and only implementing a Boot interpreter in these languages.

https://en.wikipedia.org/wiki/First-class_function#Language_support shows that a broad variety of languages support this.

otoh, we already have (relative) branches in code, which means that it's already pretty easy to construct something which is hard to transpile to an HLL without either an interpreter, or putting each line into a separate function.

---

can conditional moves ever be worse than branches? yes:

https://kristerw.blogspot.com/2017/03/the-cost-of-conditional-moves-and.html

---

right now we don't have indirect control flow, except when the target is an xcall. I believe that this means that computed goto is not possible.
i'm having second thoughts about no indirect control, because if you were going to try to compile Boot to a HLL, even though indirect control flow can only target xentrys (meaning that function pointers can be used, which most HLLs support in one way or another), you still have branches and jumps targeting arbitrary locations in code, so doesn't this already force you to compute basic blocks within each function and compile it to a big switch statement for the jumps between the basic blocks, at which point you are only a little better than an interpreter (you ARE still better though, because each basic block would execute without dispatch overhead until it reaches the end of the basic block) -- you are not using the HLL's loop constructs?
well, i'm not sure that you can only compile to a switch statement and can't use the HLL's loop constructs. First, you could try to recognize loops within the Boot code. Second, i think that possibly you can represent this arbitrary-except-non-indirect control flow by a bunch of nested while loops, one for each backwards branch/jump target (forward jumps use the simpler idea of enclosing blocks of code in 'if' statements that aren't triggered until the instruction counter reaches that part of the code).
this idea of nested loops reminds me of WASM's weird control flow. Maybe they were onto something there...
ok i looked that up, apparently WASM's 'relooper' algorithm is described here: https://github.com/kripken/emscripten/raw/master/docs/paper.pdf and according to [1] it's just what i was afraid of: "a greedy algorithm that tries to find some common patterns in the CFG, and falls back to our naive approach with the label variable when it can’t match any pattern." where the naive approach with label is "According to the structured program theorem, we only need a loop and a branching construct in order to solve the problem (so we don’t even need break or continue statements!), but at the cost of using an additional helper variable (we will call it label) that encodes the control flow state.".

and that same link mentioned a purportedly better approach, that they call Stackifier, that applies apparently an old algorithm when the CFG is 'reducible', and otherwise uses a dispatcher (which is still like falling back to the naive approach). A reducible CFG is defined by https://en.wikipedia.org/wiki/Control-flow_graph#Reducibility as one such that:

"Forward edges form a directed acyclic graph with all nodes reachable from the entry node."
"For all back edges (A, B), node B dominates node A." where "A block M dominates a block N if every path from the entry that reaches block N has to pass through block M."

https://medium.com/leaningtech/solving-the-structured-control-flow-problem-once-and-for-all-5123117b1ee2 explains that this means "all loops need to be single entry".

In other words, the key constraint is that you can't jump into the middle of a loop from outside of that loop, where a loop is something that begins at the target of a backwards jump.

we could consider adding jy but maintaining the lentr restriction. No, we already have xcall so we're already good there (except for the performance penalty of function entry and of needlessly switching between the Boot and platform calling conventions); but this doesn't help us with while loops internal to a function (xentr segment).
also, what about JVM? ok i just looked up JVM, and jvm DOES have arbitrary branches and jumps (presumably they don't jump between functions). But it doesn't have indirect jumps. So, the way we have things now, it would be simple to efficiently compile to JVM, but difficult if we added jy.

so to summarize the above:

right now, we allow arbitrary branches and jumps but only within the current 'function' (xentr segment (an xentr is an externally-reachable entry point))
right now, we can't do computed goto, so we can't efficiently implement interpreter loops (on some platforms; on others, computed goto is less important for efficiency, i think i heard)
arbitrary intra-function branches and jumps already make it hard to compile to a HLL
when the CFG is 'reducible' it is apparently possible to compile to an HLL using its structured control flow constructs (that is, without an inefficient 'interpreter loop')
however we currently have no requirement that the CFG be reducible
one alternative is to disallow branches and jumps and only provide structured control flow constructs. This is what WASM does. But this makes it more complex to compile arbitrary HLLs to Boot.
another alternative is to add indirect jumps. This makes computed goto possible, but prevents simple compilation to JVM
we are planning to add indirect jumps, jy, in BootX?
what we currently have seems perfect for easy compilation to JVM -- like us, JVM has intrafunction branches and jumps, but disallows indirect jumps
what we currently have does not permit the source language to have computed goto, and hence does not permit efficient interpreter loops
what we currently have is simple for other source languages to compile into (unless they have computed goto, in which case we'd have to compute CFGs for them)

In summary, the status quo (local arbitrary direct branches and jumps permitted, indirection via function pointers permitted, local indirect jumps prohibited) has the following consequences:

it's easy to compile Boot to other platforms, including JVM
it's possible but difficult to transpile Boot to HLLs (without just compiling it to an interpreter)
it's easy to compile other HLLs to Boot, if they don't have computed goto
it's difficult to compile other HLLs to Boot if they have computed goto
interpreter loops can be implemented in Boot (using switch constructs) but not efficiently (because no computed goto)

So i think i'll stick with the status quo, because:

there are 3 things that are missing:
computed goto functionality
easy transpilation of Boot to HLLs
efficient interpreter loops
the lack of computed goto functionality bothers me, but it's not present in many HLLs anyways
good transpilation should occur at a higher layer anyways, it's a not a design goal of Boot
efficency is not a design goal of Boot

The lack of computed goto may be okay because having Boot support everything needed for full 'desktop' Oot is not a design goal of Boot -- after all, it's also missing floating point. BootX? is required for all standard Oot functionality to be unlocked.

The lack of computed goto is bothersome because it means we have to write interpreter loops in Oot twice; once without computed goto (for bootstrapping with Boot), and once with. Ultimately though, that's just manageable cruft, and less complex than, say, implementing Boot on JVM if it had indirect branches. Maybe LoVM? should provide instructions specifically abstracting over interpreter dispatch.

For the Boot implementation, aside from the aforementioned cruft/duplication of code, the main effect of lacking computed goto is reduced efficiency. But that's not a design goal; efficient execution isn't a goal until BootX?.

To summarize that decision:

for Boot, stick with: local arbitrary direct branches and jumps permitted, indirection via function pointers permitted, local indirect jumps prohibited
the only big loss is lack of computed goto functionality
there is an efficiency loss, but that's not a design goal. There is a complexity penalty in the self-hosting bootstrap Oot toolchain, but that's managable, and more desirable than making Boot harder to port to JVM-like environments without adding another layer of interpreter dispatch

So, this choice maximizes the ease of implementing a simple compiler from Boot to other platforms, while keeping the Boot implementation easy to compile to in most cases, at the expense of loss of efficiency and difficulty of transpilation. This is in line with the design goals of Boot.

---

okay over about one weekend's worth of time i sort of implemented a Boot VM from scratch (i say sort of because i didn't really test it more than on one tiny program, so it's not really done). So, not quite at my target of being a one-weekend project for others, but close enough.

some things i learned during the process:

the biggest things i realized are about control flow, especially indirect control flow:
if we don't have any way to take the PC, then we need CALL/RET instructions
(see below; this could be wrong) i realized that if we really want this to be easy to compile to JVM without a runtime interpreter loop, we can't have indirect control flow at the sub-function level. This kind of makes it 'not assembly-like'. On the other hand, it's not that bad, because most of the time you can just associate (in your program/in your mind) an integer value with each place you want to jump to, and then do a 'switch'-like construct to get to them using ordinary conditional branches.
otoh the fact that you can do this anyways is an argument FOR allowing indirect control flow
first-class functions are a form of indirect control flow, with the limitation that it only applies at the procedural, not the intraprocedural, level, and something like this is much more widely supported among HLLs than i had thought. xcall can call first-class functions, because it takes the return address in a register
if you are going to compile to something like the JVM, you can't allow interprocedural jumping (where 'procedure' is whatever grouping will be compiled to separate methods/functions in the target language), because e.g. JVM doesn't allow it, because they want to (easily) verify at load-time(i think) that the typing of things on the stack is correct at all times.
i realized there was no way to blindly copy memory, so i added the anytype register bank
i realized that in-instruction-stream data payloads aren't actually that bad, because even without them JMPs could already hop over stuff in the instruction stream anyways and then so the program author could put data in there (although they can't read that data using Boot because codepointers are opaque). However i removed them anyways for implementation simplicity; they can come back in BootX?
i removed the switch statement, because it requires an in-instruction-stream data payload and it can be replicated with an if-then chain (which can be accomplished via conditional branching)
i specified the undefined-value-ness of mixing arithmetic bitwidths when reading/writing
code pointers, although opaque like anytype in a way that data pointers aren't, still aren't anytypes because you read/write them in chunks of multiple bytes, whereas anytypes are read/written in units of single bytes (hey, i should rename anytype to byte)
i unified a few of the register-based calling conventions i had come up with earlier
we dont really need inp, outp

otoh, i just realized that the fact that you can do intraprocedure indirect branching anyways is an argument FOR allowing an indirect branch instruction, because in the JVM it can be compiled using a switch construct as above, without a runtime interpreter loop ('runloop', i guess that's what those Perl guys that i read so long ago were talking about). I guess the real key there is prohibiting INTERprocedure indirect branching; the literals being passed around (that represent code locations) can be passed to external procedures and back, but they must never be USED except in their procedure of origin. Also, there must not be arithmetic on code pointers, which we already disallow. Hmm, do any of the computed goto-using dispatch methods do arithmetic on code pointers? I doubt it, because computed goto is something that you do in (some dialects of) C. also, on3h (on the third hand; i just coined that, but i mean a rebuttal to an otoh), in order to do this, the implementation needs to be able to statically know all possible targets of the indirect jump. Which means that, as with 'lentry' and xcall (which is a first-class-function kind of thing), you need to make it clear (by having a special instruction type, perhaps) where all the places are that are creating a code pointer, you need to have the value of the code pointer being created be an immediate constant, not something computed at runtime, and you need to have some concept of module scope such that the compiler can know what are all of the possibilities that it must account for. But... on4h, we have to do something like this anyways for 'lentry/xcall', and in fact we already have the groupings of 'xentr-subroutines' that cannot be jumped between. I guess we would have to specify one extra thing, that any possible target must be available at compile-time or load-time, but that doesn't sound like a huge deal. We could rename lentr back to lcm, like it used to be (well i guess it used to be lci, that was before we decided to use 'm' instead of 'i' to indicate 'immediate'), because after all valid 'lentr's can still be distinguished by a compiler because the target location has an 'xentr'.

in summary, i'm really excited that this time (unlike last time), it seems to be as simple to implement as i wanted. I think i can push this to github now, or soon. After that, there are still some important questions to resolve before moving on; most pressingly, are we going to allow indirect control flow, or just a very restricted CALL/RET, and if the latter, how (i'm leaning towards just CALL/RET, but not sure about some of the details). After we resolve that stuff, though, i just have to write a compiler to JVM and a compiler to LLVM and if i can do that, i think there will be good evidence that this is good enough, and then i can move on to lovm, and work on tests for Boot simultaneously.

---

next steps:

read through the old pyboot, absorb what's needed, replace it with pyboot 2, save the old pyboot in an 'old' folder somewhere
push to github
decide if we will allow irreducible control flow
decide if we will separate codeptr from ptrs
decide if we will have CALL and RET and TAIL, and if so, decide if we will define CALL-subroutines and restrict jumps between them; if not, decide how and when to make the PC accessible, and if so, decide if we need an ENTR instruction to mark the targets of CALLs, and also a TAIL instruction for tail calls, and also if the stack will be opaque or accessible
decide if we will rename the anytype registers to byte registers
an argument AGAINST is so that mb they can hold other HLL constructs too..
someday:
compiler to jvm (a compiler to Python, without a runloop, would require making a CFG, which is more work than we want to do at this early stage)
compiler to llvm
at that point we'll have 3 (toy?) implementations, 2 compilers and 1 interpreter. At that point the spec (or at least, the reference document, which isn't as formal as a spec) is probably good enuf, at least until we can more feedback from using it as a target language
should we write a backend to llvm to use this as a TARGET language? this is really more of a design goal for BootX?, but maybe it can be done for Boot, i'm not sure (does LLVM allow it's backends to just say "i don't support ___", where ___ is floating-point, etc?)
then write tests (remembering to copy test ideas from the old pyboot)
now we can work on LoVM? and Lo/Loot (actually i'll probably be writing down ideas for LoVM? the whole time) (and create BootX? incrementally as needed for lovm, after deleting the old draft and starting with the principle that all BootX? does is add instructions to Boot, and loosen specified limitations from Boot).

---

hmm.. the clean thing to do is separate codeptrs from ptrs. but this adds state! i don't like having too much data in each process, because i want zillions of processes.

otoh, we are still small compared to Erlang BEAM: "A newly spawned Erlang process uses 309 words of memory in the non-SMP emulator without HiPE? support. (SMP support and HiPE? support both add to this size.)...The size includes 233 words for the heap area (which includes the stack)." -- http://erlang.org/doc/efficiency_guide/processes.html

309 - 233 = 76

So far we have 16 int registers and 16 pointer registers and the PC, so 33. If we added 16 codeptr registers and 16 byte registers (assuming 32-bit words), that's another 20 words, so a total of 53. Later on in lovm we'll have a SMALLSTACK with another 16, so 69, so we'd have 7 words left to store process control block stuff (which isn't much space for that, we'd probably want more). So, yeah, we'd be pushing it a bit here, but nothing crazy. But i don't like how now we have no chance of staying in registrers even on 32 register RISC machines.

if anytype regs are really anytype, not bytes, then that's another 12 (b/c we had given them 4 words, instead of 16, above, by assuming each one was only a byte and that we had 32-bit words), so we're already slightly over.

---

"Some architectures, such as GPUs, do not support jumping to an arbitrary program location and implement branching using masked execution and loop using special instructions around the loop body. In order to avoid CFG modifications that introduce irreducible control flow not handled by such hardware, a target must call setRequiresStructuredCFG(true) when being initialized." -- https://llvm.org/docs/WritingAnLLVMBackend.html

---

one important choice is whether Boot will permit irreducible control flow (irreducible CFGs) ( https://en.wikipedia.org/wiki/Control-flow_graph#Reducibility ). I've come to believe that this may be the 'real' issue with allowing indirect control, that is, the thing that would make it difficult to compile to e.g. JVM or GPUs (other than 'compiling' to an interpreter runtime).

However, note that even with only DIRECT jumps, only within a function, you can have irreducible control flow, so i may be focusing on the wrong thing here.

There are two questions here. 1) What do we lose by prohibiting irreducibility? Is it just a little efficiency, like with computed goto? Or is do we lose essential expressivity, e.g. something like exceptions? 2) How do we prohibit irreducibility? I'm looking for something easier to understand than the above-linked definition. One thing i've heard is a prohibition on multiple entry points into a loop, if that is synonymous, that's good.

for (1) i've asked a question:

https://cs.stackexchange.com/questions/127891/uses-of-irreducible-control-flow

---

oo this might be one important use:

 jasode on July 3, 2015 [–]

>In C++ labeled breaks would make 'goto' completely obsolete.

In 20 years of using C/C++, I've found one use for "goto" that's hard to substitute: simulating coroutines that yield to an outer context. (Similar to C# yield return).

A "break label" gets you out of a loop. I needed to use "goto" to jump back into the middle of a loop to resume where the "coroutine" previously left off. The keywords "break/setjmp/longjmp" wouldn't have been substitutes for this particular use case.

-- https://news.ycombinator.com/item?id=9827594

---

on uses of GOTO:

https://news.ycombinator.com/item?id=9827437 (already read) https://news.ycombinator.com/item?id=9830174 (already read) https://news.ycombinator.com/item?id=9829816 (already read) https://news.ycombinator.com/item?id=9829476 (already read) https://news.ycombinator.com/item?id=9828914 (already read)

adynatos on July 4, 2015

parent

favorite

on: Things Rust shipped without

C's goto has a bad rep it only partially deserves, however I can't agree that it should be in every language. I use it fairly frequently, but virtually all the uses are about cleanup after error. I like that it requires unique label for identification, as it allows for language-enforced documentation. But my gotos always go "down" the source code, never "up. Through goto I can basically setup an error-condition code flow for every function, with multiple conditional entry points into that flow and if there was another language feature that allowed just this particular usage in cleanup, without multiple nested scopes and with clear labelling, then I would see no reason for keeping goto. Throughout the years we developed a way to neatly use an unsafe feature, but it's always better to have such things enforced by language, not by tradition.

so i think everything that was talked about was:

goto for interpreter (and other state machine) switches
goto for RAII-style stuff, and exception handling
goto for labeled breaks
goto for coroutine re-entry

tobemined: https://www.google.com/search?client=ubuntu&channel=fs&q=resumable+exceptions+irreducible+control+flow&ie=utf-8&oe=utf-8 https://cs.stackexchange.com/questions/102984/why-irreducibility-is-an-important-concept-in-flow-graphs http://infolab.stanford.edu/~ullman/dragon/w06/lectures/dfa3.pdf https://stackoverflow.com/questions/22039125/can-all-control-flow-graphs-be-translated-back-using-if-and-while https://stackoverflow.com/questions/10386152/effect-of-goto-on-c-compiler-optimization https://stackoverflow.com/questions/48638653/can-anyone-help-explain-the-one-pass-verification-process-shows-in-webassembly https://stackoverflow.com/tags/control-flow-graph/hot?filter=all https://www.google.com/search?safe=active&client=ubuntu&hs=Q7h&channel=fs&sxsrf=ALeKk02iksYzYBfko45r346FIYLVBOFaaA%3A1593617485003&ei=TKz8XvfQPMLm-gSwpbuACg&q=applications+of+irreducible+control+flow&oq=applications+of+irreducible+control+flow&gs_lcp=CgZwc3ktYWIQAzoECAAQR1CA0wFY2uABYNrhAWgAcAJ4AIAB1gGIAYEJkgEFOC4yLjGYAQCgAQGqAQdnd3Mtd2l6&sclient=psy-ab&ved=0ahUKEwj3pufgr6zqAhVCs54KHbDSDqAQ4dUDCAs&uact=5 https://www.researchgate.net/publication/260321293_Semantics_of_an_Intermediate_Language_for_Program_Transformation

http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf notes that "Acompileroptimisationknowntocauseirreducibilityistailcallelimina-tion....Ifafunctionisrecursivethentailcalleliminationreplacestherecursivecallwithabranchtotheentryofthefunction,creating a loop. Whencombinedwithinliningthereisa possibility ofcreatingmultiple loop entryp oints,thuscreatingirreducibility"

"Itisalsop ossible for switch statements to cause irreducibilityinCandC++,withthemostfamousexam-pleb eing Duff's Device, whichwascreatedinorder to manually unroll loops. Itinterlacesaswitchstatementwithawhilelo opandactsasacomputedgoto1.Javado esnotallowthisinterlacingofconstructs.Itisworthmention-ing,however,thatJavaallowstheuseoflab elledstatementsincombinationwithbreakandcontinue,allowingcontroltoreturntotheenclosinglab elledstatement.Thesearemorewell-b ehavedthanconventionalgotostatements,andwillnotcauseirreducibility.

"Twosurveysareoftencitedintextstoprovethatirreducibilityisrare.Knuth[79]sampledatotalof33FORTRAN66programswhichspannedover20,000punchcards,whichisroughlyequivalenttothesamenumb eroflinesofco de...ckeandAllen[13]p erformedintervalanalysistechniqueson72randomlyselectedFORTRANIVprogramswhichwerecurrentlyrunningintheT.J.WatsonResearchCenter?"

"CHAPTER3.ASTUDYOFIRREDUCIBILITYINCPROGRAMS473.5Metho dSincethesurveysmentionedinthelastsectionwerecarriedout,thep opularityofFORTRANhasdropp edandsoforthisstudyweturnedtoC.AccordingtotheTIOBEindex?,Cstillrankssigni cantlyhigherinp opularitythanC++.Manycoresystemutilitiesandto olsarewritteninthislanguage,aswellasalotofemb eddedco deandama jorityoflargeop ensourcepro jects.Inthissectionwepresentourmetho dologyforp erformingourstudyofirreducibility.Thetechniquethatweusetodetermineirreducibilityinaprogramiscalledstructuralanalysis.Structuralanalysisisa ne-grainformofintervalanal-ysiswhichisp erformedontheCFG."

"Wechosethesourceco deof15GNUpro jectsasoursampledata "Intotal,wefound5irreduciblefunctionsinatotalof10427,givingatotalaverageirreducibilityforthissetofcurrentprogramsof0.048%.Weseefromtheresultsthatallirreducibilityo ccursonlywhengotosareused,agreeingwitha ndingfromHechtandUllman[65].Itcanb eseenthatalargenum-b erofgotostatementsarestillusedinmostpro jects.Afterobtainingresultsforthecurrentversionsofthesoftware,weranthesametestsontheoldestversionsavailabletous.TheseresultsareshowninFigure?3.6.Herewefound20irreduciblefunctionsinatotalof4772,givinganaverageirreducibilityof0.42%forolderversionsofthesamesoftware...Imp ortantly,the5irreduciblefunctionsinthelatestversionsareallpresentintheolderversions.Nonewirreduciblefunctionshaveb eenintro duced"

" 3.7PatternsofirreducibilityWevisuallyinsp? ectedthefunctionswhichwere aggedasirreducibleaswewerecuriousastowhethertherewereanyrecurringpatternsinthesourceco de.Inthissectionwewillexploretheseandsuggesthowasimilarfunctionalitycouldb eachievedbyusingmorestructuredprogramming.Wediscoveredthatthestyleofsourceco depro ducingirreducibilityisnotfarremovedfromthestyleofthecanonicalexamplegiveninFigure3.2.

Figure3.2:Somepseudo co dewhichcontainsthecanonicalthree-no deirre-ducibleCFG. bool main(bool x) { a: if(x) goto b; else goto c; b: if(x) goto end; else goto c; c: if(!x) goto b; else goto end; end: return x;}

Com-monly,anirreduciblefunctionispartitionedintodi erentsectionsbyusinglab els.Eachlab elcontainsco dep erformingsomesp eci ctask.gotostate-mentstransfercontrolb etweenthesesectionsinsuchawaythatanirreducibleCFGispro duced. Eachlab elcontainsco dep erformingsomesp eci ctask.gotostate-mentstransfercontrolb etweenthesesectionsinsuchawaythatanirreducibleCFGispro duced.Ago o dexampleofthisb ehaviourcanb eseeninirreduciblefunctionsthatareparsingtext,suchasprint_comment()fromindent-1.9.1inthe lecomments.c.Here,theco deisiteratingovercharactersrepresentingacommentinco de,andthenusinggototojumptodi erentsectionsdep endingonthecharacterthathasb eenread.Thisallowsadi erentactiontob etakenwhenreadingtheb eginningandendofaline.Similarb ehaviourcanb eseeninanotherirreducibletextpro cessingfunction,fch_get()fromless-290inch.c.Wegiveasimpli edexampleinFigure3.7.Hereweseethreelab elledsectionsrepresentingdi erenttasks,andcontrol owjumpsb etweenthem.Ap ossiblesolutionistotranslateeachlab elledtaskasaseparatefunction,andthenusefunctioncallsinsteadofgotos,asthisresemblestheoriginalpresentationoftheco de.

Figure3.7:Unstructuredlab elledsectionsofco deandtheequivalentstruc-turedco de.opnrepresentsnon-branchingco deandcnrepresentsBo oleanvariables a: if(c1) {op1; goto b; }op2; goto c;b: if(c2) {op3; goto end; }op4; goto c;c: if(c3) {op5; goto b; }op6; goto end;end: op7; -> a() { if(c1) {op1; b(); } else {op2; c(); }op7;}b() { if(c2) { op3; return; }op4; c();}c() { if(c3) {op5; b(); return; }op6;}

Furtherexamplesofthisb ehaviourcanb eseeninfourfunctionsinguile-1.0,inthe leunif.c.Thesefunctionsare: •scm_vector_set_length_x() •scm_array_prototype() •scm_uniform_array_read_x() •scm_uniform_array_write()

Here,afteranumb erofmacroshaveb eenexpandedbythecompiler,anla-b elledlo opp erformspro cessingthathasb eenorganisedintolab elledsections.Thesesectionscontaingotoswhichjumptodi erentlab elledsectionsrepre-sentingdi erentstagesofthispro cessing.Weseenodi cultyinaprogrammerrefactoringthisco deinastructuredmanner.

Itmayb ep ossibletoautomaticallydetectirreduciblepatternsinsourceco deduringcompilationbypatternmatchingandthenemitadiagnosticwarn-ingtotheprogrammer.Alternatively,adiagnosticwarningcouldb eemittedduringaCFGanalysisphase.Thiswouldallowtheprogrammertoreconsiderthewaytheyhaveprogrammedaparticularpieceofco deandtoadjustitaccordingly,thusgivingthecompilermorechancetooptimiseit.

Most,ifnotall,oftheirreduciblefunctionswerecomp osedofthepatternab ove.Ideally,alliterativeb ehaviourshouldb ehandledbystructuredlo opingconstructsorbyfunctioncalls

-- http://sro.sussex.ac.uk/id/eprint/7576/1/Stanier,_James.pdf

"We use Clang’s representation of the control flow graph to identify so-callednatural loops.These are subgraphs whose edges form a cycle with exactly one entry point. As an im-plementation detail, we choose this definition of loops to simplify the definition and im-plementation of data-flow analysis. In general,irreducible flow graphsmay contain loopswith multiple entry points. However, a recent study [SW12] “found 5 irreducible functionsin a total of10427, giving a total average irreducibility for this set of current programs of0.048%”. The authors conclude that irreducibility “is an extremely rare occurrence, and itwill probably be even rarer in the future”." -- https://pdfs.semanticscholar.org/0bfd/3460e2718ce1c6ad9f0253f76ed0c3da6280.pdf J. Stanier, D. Watson. A study of irreducibility in C programs.Softw., Pract. Exper.42(1):117–130, 2012.

"A new struc-turing algorithm was presented that converts arbitrary unstructured CFGs into struc-tured ones. The algorithm repeatedly applies two novel transformations that removeunstructuredness by redirecting the control flow and inserting additional predicatevariables. The time complexity of this algorithm was studied along with the causedcode expansion. The chapter proved that the algorithm limits code expansion to apolynomial function of the number of nodes in the CFG, and presented experimentalevidence. This result is important since it overcomes the issue of exponential codeblowup of irreducible CFGs in previous Node Splitting-based techniques, which resortto code duplication. The chapter showed that the irreducibility insertion techniqueemployed by software obfuscators is rendered ineffective by the proposed structuringalgorithm. Furthermore, this structuring algorithm is directly applicable to variouscompiler passes [91–97] that either give up and resort to Node Splitting when facedwith irreducibility, or need special implementations [124–126] to address it." -- https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=2612&context=open_access_dissertations , thesis, January 2016Programming models, compilers, and runtimesystems for accelerator computingAmit SabnePurdue? University

---

gsg on July 3, 2015 [–]

Yeah. Even basic things like building SSA can be done more simply on "structured" CFGs.

The benefits are enough that data structure and algorithm designs in the JVM compiler world often take advantage of assuming reducible control flow even though Java bytecode can express irreducible CFGs. Instead, such programs are left to the interpreter.

cwzwarich on July 3, 2015 [–]

Rust's CFGs are reducible with unbounded treewidth, not structured, because Rust has named exits from loops that nest arbitrarily.

Most problems on reducible graphs are not easier than for general graphs, because you can always compute a loop forest (for generalized loops, not natural loops with a single entry point) and just consider a derived acyclic graph.

gsg on July 4, 2015 [–]

The structured SSA building algorithm I had in mind has a fairly simple extension that covers break, continue, and early return. It's more complicated, but still way simpler than iterated dominance frontiers.

> you can always compute a loop forest

It's easier to not have to.

cwzwarich on July 4, 2015 [–]

Technically speaking, break / continue / early return are unstructured by the classical definition. The algorithm you are likely thinking of can easily be extended to handle arbitrary control-flow:

https://pp.info.uni-karlsruhe.de/uploads/publikationen/braun...

In the case of irreducible control-flow it may produce redundant cyclic phis, but those can be eliminated with a pretty simple post-pass.

https://news.ycombinator.com/item?id=9827305 has some other stuff on irreducibility but i've already read it

---

this control flow question is proving trickier than i thought.

on the one hand, we want Boot to be easy to implement.

Now, a Boot interpreter is easy to implement whether or not we have irreducible control flow, and whether or not we have indirect control. But a Boot compiler, to a target platform that doesn't support these things (like JVM or GPUs), is not. Do we care? Maybe we should just say, yeah, Boot is for an easy, inefficient bootstrap of Oot, and BootX? is for an easy, inefficient port of Boot, and sometimes, like Python bytecode, this method of running Oot uses an interpreter runtime even if you are 'compiling'; if you want efficient compilation, you're going to have to directly implement the more complicated machines higher up the stack, which only provide structured control flow, albeit of a powerful sort. Until this week, that was the plan; but the idea that we could have an easy way to COMPILE, not just interpret, Boot to JVM, WASM, GPUs etc persuaded me to look at more restrictive alternatives.

but, we could also say, hey, we want Boot to be easy to implement, which includes being easy to compile to diverse platforms including JVM and GPU. Boot is supposed to be a stripped down target language that can even be used in performant Oot implementations, even on diverse platforms, for code that happens to meet its prerequisites. And furthermore, indirect control and irreducible control flow seem to be rare. So when the Oot toolchain needs to use those rare things, and if you want that part to be efficient, you'll again have to directly the more complicated machines higher up the stack, but otherwise, implementing Boot and BootX? is really all you need to do. Or, maybe this stuff is in BootX? but not Boot.

but on the other hand, we want Boot to be an easy, general-purpose, simple compile target. Something that WASM fails to do b/c they only provide structured control flow, not GOTO. And maybe irreducible control flow, or computed goto, is found in the HLL we may want to implement, so let's just have arbitrary GOTO and make it easy to do so.

so those are the styles to choose between. in more detail, the choices seem to be:

1. ban 'wacky' control flow in Boot, where 'wacky' is some subset of indirect jumps, irreducible CFGs, access to the PC, access to the callstack 1a. and ban it in BootX? too 1b. ban 'wacky' control flow in Boot but allow it in BootX? 1bi. and, allow Oot to run on Boot but in some sort of restricted 'bootstrapping' mode 1bii. and, do not allow Oot or its implementation to run on Boot, but instead provide an implementation of a self-hosting Lo/Loot->Boot compiler that runs on Boot; this compiler is actually a BootX? compiler but it only outputs BootX? if you use certain constructs, which it doesn't use in its own selfhosting implementation. The rest of the Oot toolchain requires BootX?, but many things turn out to output plain Boot, which is nice for easy and efficient compilation/optimization of those parts on targets like JVM and WASM. 1biii. allow Oot to run on Boot but only as an interpreter, not as a compiler. The Lo/Loot compiler is as in 1bii. 2. allow 'wacky' control flow in Boot

last night, i was leaning towards forbidding wacky control flow in Boot (#1), because it seems rare/unnecessary. This morning, i was leaning towards allowing it (#2), because that was the original idea (Boot is inefficient on many platforms b/c it requires a runtime interpreter, so on those platforms it's much more efficient to directly implement something higher up the stack). Right now, i'm leaning towards #1bii.

A problem with #1a. is that you're probably going to end up having another interpreter loop on top of Boot if you do this, to support stuff like delimited continuations. A problem with #2 is that this means we can't use Boot as the 'easy to efficiently execute' subset of BootX? down the line; Boot's only function is bootstrapping. A problem with #1bi is that we either we may have to sort of implement Oot twice, once using BootX? wacky control for indirection for stuff like delimited continuations, and once using another interpreter-ish loop on top of Boot to do the wacky control manually; but, if we already want to use Boot as the 'efficiently executable subset' then we probably already need the functionality of recognizing which code can be separated into a component with no wacky control, so maybe that's not a problem. 1biii may solve that problem.

---

so i guess i'm really curious if compilations of coroutines and resumable exceptions require irreducible control flow.

i think a good way to get at this is to ask a question on stackoverflow: how do they compile languages with coroutines to WASM?

---

btw i think JVM DOES support irreducible control flow

https://stackoverflow.com/questions/22039125/can-all-control-flow-graphs-be-translated-back-using-if-and-while

https://stackoverflow.com/questions/6827363/bytecode-features-not-available-in-the-java-language

---

" 4

There are several reasons why bytecode control flow may not be translatable back into Java without extreme measures.

    JSR/RET - this instruction pairs has no equivalent in Java. The best you can do is inline it. However, this will lead to an exponential increase in code size if they are nested.

    Irreducible loops - In Java, every loop has a single entry point which dominates the rest of the loop. An "irreducible" loop is one that has multiple distinct entry points, and hence no direct Java equivalent. There are several approaches. My preferred solution is to duplicate part of the loop body, though this can lead to exponential blow up in pathological cases as well. The other approach is to turn the method into a while-switch state machine, but this obscures the original control flow.

An example instruction sequence is

ifnull L3 L2: nop L3: goto L2

This is the simplest possible irreducible loop. It is impossible to turn into Java without changing the structure or duplicating part of the code (though in this case, there are no actual statements so duplicating wouldn't be so bad).

    The last part is exception handling. Java requires all exception handling to be done through structured try/catch blocks and it's variations while bytecode doesn't. At the bytecode level, exception handlers are basically another form of goto. In pathological cases, the best you can do is create a seperate try catch for every instruction that throws and repeat the process above."

btw http://cliffhacks.blogspot.com/2008/02/java-6-tryfinally-compilation-without.html says that b/c of jsr/ret deprecation, 'finally' blocks are just duplicated and inlined at the end of both 'try' and 'catch'!

---

some discussion on WASM and irreducible control flow and GOTO:

https://www.reddit.com/r/programming/comments/amdjhr/webassemblys_problematic_control_flow/ http://troubles.md/posts/why-do-we-need-the-relooper-algorithm-again/

that article and the reddit comments on it think the irreducibility restrictions are bonkers, leading to complex stuff like relooper that is easy to have bugs, and they think it doesn't really help make the compiler to the target simpler (except for 'old GPUs' that don't support irreducible control). Futhermore, the article thinks that the right thing to do is to have "a list of blocks like WebAssembly? has now, each taking some number of arguments and returning some number of values onto the stack, and that you can call like function. Now you can have arbitrary flow between these blocks, like a typed version of goto.

Now I’m not some visionary genius and I didn’t come up with this idea, nor was I the first to apply this to WebAssembly?. The concept has been implemented in compilers for a very long time, where it is known as a control flow graph or CFG. A CFG is usually combined with a code representation in SSA form, and if you’ve read my previous article you probably know where this is going by now but let’s work through it together anyway. "

---

note that if CALL and RET use an opaque stack, then you can't walk the stack for tracebacks, and i'm guessing some delimited continuation stuff will be impossible, too. ---

1biv) although BootX? is an extension to Boot, part of the Oot toolchain provides a 'compiler' from BootX? to Boot, which rewrites the BootX? program as an interpreter loop so that indirect control flow and irreducible control flow are simulated, and which simply panics if truly unsupported instructions (like file I/O) are used.

at first glance, this solves everything:

porters to weird platforms like GPU, Java, WASM dont have to implement the transformations of control flow to interpret wacky control flow; we do that for them in the canonical BootX?->Boot compiler
higher levels of the Oot stack can only target BootX?, and assume that wacky control flow is available there; they don't have to make a second, restricted implementation for bootstrapping the self-hosting toolchain
people who want fast code can directly implement BootX? on their platform, rather than using the provided BootX?->Boot compiler

but at second glance, there's not really a usecase for this:

people who want to port to a new platform easily will just write an interpreter for Boot. Since it's an interpreter anyway it would be easy for them to add the wacky control flow
people who want perf will write a compiler for BootX?, not wanting the extra interpreter runloop generated by the BootX?->Boot compiler

no one is going to write a compiler for Boot->platform, knowing that BootX?->Boot is just going to generate an interpreter on top of that; either they'll take on the more difficult task of writing a compiler for BootX?->platform, which is more performant but harder, or they'll just write an interpreter for BootX?, which is easier but less performant. So this would be kind of pointless.

although... the one remaining usecase would be that since someone wrote a compiler for Boot->native, within an Oot compiler, that can be used for simple code so that it runs fast, and then the fallback is the BootX? interpreter.

But if we're going to do that, why not just ban irreducible control flow by only provided structured control flow primitives? and if we do that, why not just use WASM? after all, the reason we didn't use wasm originally was that i thought it was too complicated to compile to its restricted structural control flow. (now, however, it seems that Boot has managed to achieve even more simplicity than WASM, and also its easily extensible to BootX?, so i'd probably stick with Boot anyways).

so at the moment, i'm actually leaning towards 1biv. Ban wacky control flow, making Boot simple to compile to almost anything, put the wacky control flow in BootX?, and supply a BootX?->Boot interpreter.

i admit, part of this is just to make Boot an even more perfect 'extremely simple' bytecode.

---

i think the way to specify the reducibility property is this:

any backwards jump from B->A defines a "loop" consisting of A, B, and all of the instructions between them. There can be no jumps from outside of the loop targeting any instruction in the loop except for A.

this neglects the 'and every node is reached from the entry node' property but that probably isn't so important for us. The reason this is less abstract than the defn of reducibility (as found on Wikipedia) is that here the instructions are in a linear ordering, making it easier the DAG requirement on the forward jump part automatically satisfied (i think), and also making it easier to simply and concretely define 'loop' in my formulation above.

---

next morning, still going back and forth on this. when i woke up i thought, yeah, 1biv, make Boot super restrictive so that it works on all platforms, the provide a BootX? compiler, so that even on platforms that need an interpreter for BootX?, when code is identified as pure Boot they can skip that.

and then a little while later i thought (and still think at this moment), no that's stupid, we should just add back indirect control into Boot, if people need that for some weird platform they can implement a higher-level layer, Boot is just supposed to make it easy to write a small interpreter, and BootX? is just supposed to be a few more instructions, not a whole other layer to understand -- we'd be adding another layer for not much gain.

a short while later i still think the latter (choice #2 above; wacky control flow allowed in Boot) is best. Here's some more detailed reasoning:

Boot and BootX? were always supposed to be ASSEMBLY-like, of which one key characteristic is indirect/wacky/free control flow. Boot and BootX? were not supposed to be separate layers, they are one thing but then Boot was spun out just so that the spec wouldn't seem too scary with all the verbiage needed for things like memory ordering consistency models, and with things that embedded systems may not have or need like floating point or division.

Boot is supposed to be easy to write an interpreter for for any platform, not necessarily a compiler. Hence the BOOTstrap. It's not particularly supposed to be efficient. LoVM? is the efficient one.

For native stuff running on actual assembly, eg x86, Boot/BootX? should be an efficient choice. But it doesn't have to be the right choice for JVM, WASM, GPUs, etc, those guys can write a simple interpreter in Boot, or they can go to a higher level.

if some targets, such as GPUs or JVM, demand a non-assembly-like paradigm for compilation, or Python etc for transpilation, they should port a layer further up. It's silly to compile the structured control flow in higher layers into assembly just to compile that assembly back into structured control flow when going from BootX?->Boot.

---

a few days later:

so, on the one hand, if we allow wacky control in Boot, it seems dumb to give up the chance to have a nice restricted language that can compile to a wider variety of platforms (like WASM), in favor of something permitting arbitrary indirect control, that can only be interpreted or compiled to truly low-level assembly-like platforms.

on the other hand, different platforms would require different control restrictions. JVM requires no indirect jumps and opaque and probably no non-interop call/returns. WASM has call/return (and even has a call_indirect, although this is more like a switch), but requires no non-reducible control flow (hence its restriction to structured programming primitives). And if jumps are provided, defining non-reducible control flow, while not too complex, is certainly one more complicated-ish thing that users (Boot programmers) need to worry about.

so overall i'm still leaning towards adding all wacky control to Boot, and telling platforms that can't compile that to either just interpret Boot, or port a higher-level layer.

a day later:

my opinion hasn't changed; still leaning towards adding all wacking control flow to boot

a few days later:

my opinion hasn't changed; still leaning towards adding all wacking control flow to boot. I think this decision has been made, at least for now.

---

if Boot has wacky control flow, including lpc, then is there any need to support xentr? Maybe so, but it could just be a NOP i guess.

---

still need to decide about if separate register banks for anytype and for ptrc.

---

is it truly efficient to do wrapping 32-bit unsigned arithmetic on present-day 64-bit architectures? I guess so: on ARM64 i tried the following using the VIXL simulator package (using the 'getting started' example):

...

void GenerateDemoFunction?(MacroAssembler? *masm) { uint64_t demo_function(uint64_t x) __ Ldr(w1, 0xFFFFFFFF); __ Add(w0, w1, 0x1); __ Ret(); }

...

this should do a 32-bit ADD of 1 to the highest 32-bit quantity, 0xFFFFFFFF (on ARM64, the 'w' registers are 32-bit). The result was 0. I verified that if i loaded 21 with 0xFFFFFFFE instead of 0xFFFFFFFF, the result was ffffffff, as it should be.

so, this suggests that yes, there is value is specifying Boot arithmetic as 32-bit wraparound, rather than just saying the behavior is undefined when the result is above 0xFFFFFFFF. The 32-bit wraparound appears to be easy to implement even on 64-bit systems like ARM64.

What about Rust? Although arithmetic is apparently non-wrapping and overflow is an error, the standard library provides wrapping_* variants of arithmetic http://huonw.github.io/blog/2016/04/myths-and-legends-about-integer-overflow-in-rust/ https://doc.rust-lang.org/std/primitive.i32.html#method.wrapping_add

similarly on x86_64:

https://www.onlinegdb.com/online_gcc_assembler

.data .text .global main main: # your code goes here # xor %eax, %eax mov $0xffffffff, %rax add $1, %eax ret

yields 0. I think on x86_64 using the e registers (i think the E registers are the 32-bit subsets of the 64-bit R registers), the instruction maybe just does the computation in 64 bits and then sets the high 32 bits to zero. I think that would work for addition and subtraction, at least. And our MUL instruction is only the low 32 bits anyways.

So i think it is not too much of a burden on implementors on any major architecture if we specify 32-bit wraparound unsigned addition.

---

200726

removed the following from Boot spec:

halt $ncond $result #imm8: if $ncond == 0, end program
halt: if $ncond is equal to zero, then end program, returning result code ($result + #imm8). Otherwise, if $ncond is nonzero, do nothing. The int32 $result is interpreted as signed.

memory allocation:

mallo &dest $size _: Memory ALLOcate a new region of size $size, return ptr
mfree &ptr _ _: Memory FREE a region

and @n refers to the n-th anytype register, for example the first and last registers in each bank are: $0, $15, &0, &15, @0, @15.

la @dest &addr #imm8: Load Anytype from memory addr (&addr + #imm8)
sa @addr &src #imm8: Store Anytype to memory addr (&addr + #imm8)
cpa @dest @src $cond: if $cond == 0, then CoPy? Anytype between registers

Blind copying with anytype registers ##

Anytype registers allow you to blindly copy the contents of memory without knowing if it holds integers or pointers. There are only three instructions provided for anytypes, la, sa, cpa (load, store, and copy).

To blindly traverse memory, use ap and apm.

Do not load memory locations into integer registers or pointer registers if you do not know that they contain integers or pointers respectively.

xcall &target_address #nargsi_imm8u #nargsp_imm8u: eXternal CALL subroutine
xentr _ #nargsi_imm8u #nargsp_imm8u: eXternal ENTRypoint to Boot function
xaftr _ #nretsi_imm8u #nretsp_imm8u: place immediately AFTeR? xcall or xlib
xret0 &return_address _ _: RETurn void to external platform
xreti &return_address $return_val #imm8: return int32 ($result + #imm8)
xretp &return_address &return_val #imm8: return ptr (&return_val + #imm8)
xcall: call external subroutine with #nargsi_imm8 integer arguments and #nargsv_imm8 pointer arguments. Before executing this instruction, the arguments must be placed as per the Boot Calling Convention
xentr: This instruction should be placed at each entry point that may be called from foreign code. #nretsi_imm8 and #nretsp_imm8 indicate the number of integer and pointer arguments expected. Every code path starting in xentr must end in an xret or xtail, with no other 'xentr's in between. The xentr most immediately previous to an instruction, if any, is considered to begin an 'xentr subroutine' containing that instruction. No source instruction shall branch or jump to a target location within any xentr subroutine, unless either both source and target are within the same xentr subroutine, or the jump is by way of one of the instructions: xcall, xtail, xlib, xret0, xreti, or xretp.
xaftr: reenter Boot function after returning from external subroutine call. This instruction should immediately follow each xcall. #nretsi_imm8 and #nretsp_imm8 indicate the number of int32s and ptrs being returned, respectively.
xret0, xreti, xretp: return from Boot function to external platform at &return_address. xreti's int32 $return_val is interpreted as signed. &return_address must be the value that was in pointer register 4 upon the corresponding xentr.
xretp: if &return_value holds a ptrc, then $imm8 must be 0
xtail: xtail cannot be used to call any function taking more than 3 integer arguments or more than 3 pointer arguments, unless it is within an xentr routine.

If the function being called takes a variable number of arguments, then the total number of integer arguments is passed in register $11 and the total number of pointer arguments is passed in register $12.

If more than 3 integer arguments or more than 3 pointer arguments need to be passed, then a pointer to the remaining integer arguments is passed in &11 and/or a pointer to the remaining pointer arguments is passed in &12. The contents of the memory holding the additional arguments may be overwritten by the callee, just as with registers 5,6,7. However, the registers 11,12 (both banks) themselves are still callee-saved and, if modified, must be restored before return. The callee must not deallocate the memory pointed to by pointer registers 11 or 12 (that is, the memory holding the additional arguments).

On platforms which pass values which are neither integers nor pointers, when arguments are passed which are neither 32-bit integers nor pointers, if the value is guaranteed to fit within 32-bits, it is passed as an integer, otherwise the value is stored in memory and a pointer to the value is passed.

memory allocation	mallo mfree
interop	xcall xentr xaftr xret0 xreti xretp xtail

loads and stores and copies

la sa cpa

sam $dest #imm16u: Shift left 16-bits then Add 16-bit iMmediate #imm16u
sam: In C notation, $dest = ($dest<<16) + #imm16
lentr &dest #imm16: &dest = ptrc to the specified xentr instruction
lentr: target must be an xentr instruction. The #imm16 specifies an code location in terms of bytes relative to the next instruction after the lentr instruction.

I/O

out outp

inp &dest $device #imm8u: read IN one Pointer from device ($device + #imm8)
outp $device &src #imm8u: write OUT one Pointer to device ($device + #imm8)

Anytype arguments are not part of the Boot Calling Convention.

# Interoperation instructions ## To enable Boot code to call native platform code, the instructions xcall, xlib, xaftr, xtail are provided:

xcall calls a native function pointer
xlib calls a statically specified platform-specific library function
xaftr must be placed immediately after any xcall or xlib instruction, to prepare a return into Boot code from native code.
xtail is like xcall but without any expectation of a return. Exactly how this is implemented is platform-specific, but typically a tailcall means that no new stack frame will be created and the location after the call will never be return-ed to. Since no new stack frame is created, this can be used for iteration without overflowing the stack. Because the caller will never be returned to, in some ways it's like 'exec', although typically the tailcalled-code may still return to other callers higher up on the stack.

To enable Boot code to call and be called by native platform code, the instructions xentr, xret0, xreti, xretp are provided:

xentr is placed at the beginning of any Boot subroutine which can be called from native code
xret0, xreti, xretp return from a Boot subroutine back to native code (returning void, one integer, or one pointer, respectively)

Boot code may xcall or xtail into other Boot code if the target is marked with xentr. (and of course, a platform may choose to implement one of the xlib library functions using Boot, in which case xlib would also be from Boot code to other Boot code).

These instructions cause the implementation to translate between the Boot Calling Convention and the platform calling convention, as well as to do whatever other else is needed to transition between Boot and native code.

# The Boot Calling Convention ##

Up to 3 integer arguments and up to 3 pointer arguments are passed in registers. Anytype arguments are not part of the Boot Calling Convention.

Registers 4, 5, 6, 7, 8, 9, 10 (all banks) are caller-saved. Registers 3, 11, 12, 13, 14, 15 (all banks) are callee-saved. TODO: what about regs 1 and 2 -- caller saved?

Pointer register 3 is used as a memory stack pointer when applicable (TODO what does 'when applicable' mean?), otherwise it is callee-saved.

Pointer register 4 is used as a return address pointer/link register. When using xlib, there is no need to set this register, these instructions will set it if needed.

Registers 5, 6, 7 (both integer bank and pointer bank) are used to pass arguments and return values (from lower to higher number get arguments from left to right). Upon making a call, up to 3 integer arguments are in integer registers 5, 6, 7, and up to 3 pointer arguments are in pointer registers 5, 6, 7, and the return address is found in pointer register 4.

Registers 8,9,10 (both banks) are caller-saved scratch registers and may be overwritten and used for any purpose by the callee.

Upon returning from a call, up to 3 integer and up to 3 pointer return values will be found in registers 5,6,7 using the same convention as for calling.

other control flow

halt

# Profiles ##

200726

i just made the following changes:

reduced the number of registers in each bank from 16 to 8; this should hopefully allow Boot to compile to x86_64 straightforwardly without a runtime. As a bonus, it also allows Boot to be represented with a (straightforward) 16-bit encoding, although i haven't yet decided if we should switch to that or not: 1 encoding bit, 6 opcode bits, 3x3 operand bits.
the argument for switching to 16 bits is that we can (and 32-bits really wastes a lot of space if we can do 16; also, it seems ugly to have 8-bit operand fields for registers if we only use 3 bits); the argument against is that it means that in BootX?, where we'll probably have more than 64 instructions (because of all those floating points and concurrency moded loads and stores, if nothing else), we'd have to introduce the 32-bit encoding, so now we'd have a variable-length encoding already (or we'd have two distinct encodings for Boot and BootX?)
adjusted the Boot Calling Convention for 8 regs in each bank
defined syscall ('xlib') memcpy
removed the anytype bank and related stuff; memcpy can be used instead
moved malloc, mfree into syscalls: implementations which compile to native code would like to just call syscalls for these, and that means that Boot code needs to assume that caller-saved registers will not persist across the call
moved halt into syscalls; this is kind of an unusual instruction that relates to the lower-level environment
moved the interop stuff into bootX, except for xlib; this is not really required for a bare-bones self-hosting implementation, and so should be moved into boot_extended
moved the interop Boot calling convention stuff into bootX for many arguments and for variable-length calls and for values that are not integers or pointers; this could be relevant for xlib i guess but it isn't strictly necessary and it really goes more with the interop stuff
made the immediate offset operands for load/store to scale with the type of data there (because these are immediate values and so can't be computed at run-time, and in portable code we don't know the target platform and so can't compute byte offsets for mixed structs anyways)
removed the lm and sam instructions and replaced with an l32m instruction that reads the immediate data from the following word in the instruction stream
added jy instruction
removed lentr instruction and replaced with lpc (because we decided to permit weird control flow)
removed jmp, because with only 9 immediate bits, an unconditional immediate jump isn't that useful

---

i decided that, like IN, it's possible for OUT to return 0 characters written, because i looked at the man page for the posix/linux write syscall and it said this can happen there in some cases, so i'm just matching that behavior.

---

actually if we wanted to support a 'user profile', we may need to reserve ptr registers 3 and 4 and 5 for smallstack, stack, and link ptr.

and think about including smallstack in bootx

i think it's kind of confusing to reserve registers now b/c what if, in LOVM, we want to say "in the 16-bit encoding register 2 is TOS of SMALLSTACK"? Then register 2 is usable but has different semantics than is defined in Boot. In which case, may as well just talk about it in LOVM rather than here.

---

ok so it would be pretty awesome for Boot to be a 16-bit encoding with 8 regs per bank, and BootX? to be a 32-bit encoding with 255 regs per bank. But i do think it would make it more complicated for implementors. Let's just leave that to LOVM, and keep Boot as a 32-bit encoding, and BootX? as extensions to Boot.

Or, some of the BootX? 'extensions' could be a 16-bit encoding for Boot, and 255 regs.

---

for easy reference, here's what GNU Lightning says about regs:

" There are at least seven integer registers, of which six are general-purpose, while the last is used to contain the frame pointer (FP). The frame pointer can be used to allocate and access local variables on the stack, using the allocai or allocar instruction.

Of the general-purpose registers, at least three are guaranteed to be preserved across function calls (V0, V1 and V2) and at least three are not (R0, R1 and R2). Six registers are not very much, but this restriction was forced by the need to target CISC architectures which, like the x86, are poor of registers; anyway, backends can specify the actual number of available registers with the calls JIT_R_NUM (for caller-save registers) and JIT_V_NUM (for callee-save registers).

There are at least six floating-point registers, named F0 to F5. These are usually caller-save and are separate from the integer registers on the supported architectures; on Intel architectures, in 32 bit mode if SSE2 is not available or use of X87 is forced, the register stack is mapped to a flat register file. As for the integer registers, the macro JIT_F_NUM yields the number of floating-point registers. " -- https://www.gnu.org/software/lightning/manual/html_node/The-instruction-set.html

---

bootx can have profiles too

---

Maybe think about register ordering again so that caller saved and callee saved can be continuous Especially when we move to 255 registers. Maybe move small stack register to register 1. So maybe callee-saved at lower numbers and caller-saved at higher numbers so that when we move to 255 regs and there are more caller-saved above that, both caller and callee can specify a single contiguous range of things they need to save (this is assuming that we need more caller-saveds before we need more callee-saveds, because in this paradigm if the callee needs more saveds then they would use a second discontiguous range)

---

It would make programs much more efficient if we could standardize integer and pointer sizes because then you could include memory offsets into mixed (int and ptr) memory data structures (for example offsets into a traditional stack using a frame pointer) rather than computing at run time. The main issue with this is that on some architectures (fat) pointers may be larger than eight memory locations.

Although on the other hand I suppose If you were compiling to native instructions you could pre-compute the constant offset arithmetic instructions and propagate Constants, So the chance for efficiency is not entirely lostin the current setup.

---

another way to achieve more efficiency with offsets Is to complicate the instruction including when there is an Immediate offset; Split the bits in half and use Half of the bits to encode an int offset and then the other half to include a pointer offset with a total offset being a sum. Doesn't really work with three bits though. With BootX? and 8-bit offsets, however, this makes a lot of sense (i already added this to the bootx_reference which is currently just a list of todos).

---

i've been thinking about microkernel operating systems as inspiration for minimalist feature sets.

i have some notes on this in greenField.txt. I think the thing to do is look at the following:

freertos, zephyr, riot, rt-thread nano, L4 family

and find their intersection (which is probably a subset of their 'kernels'). I'll take notes on that in greenField.txt.

However, something else to consider is that application-level high-level language (HLL) libraries might have LESS features in some ways than these. For example, the Python 'threading' library doesn't seem to have task priorities. So maybe also need to take a further intersection with HLL libraries after the microkernel analysis is done.

also, i have some notes already elsewhere on stuff i learned from microkernels, and also notes on minimum message sizes found there. i forgot where, TODO.

also, microkernels provide some stuff that would be in HLL libraries too, like networking. So at some point, beyond their kernels, it would be useful to take the intersection of that with HLL libraries to get a useful 'peripheral' library set.

also, microkernels provide inspiration for other kinds of base I/O besides console, eg maybe GPIO, I2C, SPI, UART

--- GPIO, I2C, SPI, UART:

UART:

I/O connection between two devices, using two pins on each device (called TX and RX)
transmits one character at a time (5-9 bits, so for our purposes assume it's an 8-bit byte)
can be thought of as a converter from parallel I/O to serial, because the hosts (eg. the 2 CPUs connected to the 2 UART peripherals which are talking to each other) on each end send/receive data in 'parallel' (8 bits at a time), and the UARTs talk to each other in serial (one bit at a time)
sometimes provides a FIFO buffer for incoming data so the host CPU doesn't have to check it every character
has a flag to tell the host when new incoming data is waiting to be read (and may generate an interrupt in this case)
sometimes provides a FIFO buffer for outgoing data so the host CPU doesn't have to wait for every single character to be sent before sending the next one
has a flag to tell the host when it is still busy sending outgoing data
can be simplex (one-directional), half-duplex, or full-duplex
async (sender and recipient don't have to share a clock signal; the UART protocol contains synchronization within itself. Devices do have to know what the bitrate will be (at least at the beginning?) and to both be calibrated to within 10% of that)
https://en.wikipedia.org/wiki/Universal_asynchronous_receiver-transmitter#UART_models has some examples of configurations. I see mentions of FIFO buffers of sizes 1,2,4,16,32,64,128. https://en.wikipedia.org/wiki/Universal_asynchronous_receiver-transmitter#UART_in_modems recommends FIFO buffers of sizes 16,32,64, or 128.
for our purposes i'm leaning towards a buffer size of 16, 64, or 256.

I2C: TODO

SPI: TODO

GPIO: TODO (mb see https://sneklang.org/doc/snek.html#_gpio_functions )

links:

https://www.seeedstudio.com/blog/2019/09/25/uart-vs-i2c-vs-spi-communication-protocols-and-uses/

---

regarding my still considering 16 bits for Boot:

if you make Boot ints 16 bits, you can have a 32-bit extension in BootX? and then require that extension for everything else in Oot, which means that Boot proper is still open for 16-bit usage elsewhere, but doesn't impact Oot. Otoh that still might be more complicated for implementors trying to read the spec that if we have a 16-bit PROFILE (restriction) in BootX? instead.
if Boot ints were actually only 16 bits, then even if you provided 32-bit instructions, they would have to use two registers to store each 32-bit int. This is infeasible.

also, not as important, but i see 4 bytes crop up as minimal message sizes/alignments in some random microkernel stuff that i happened to look at today. And many small systems (such as Micropython) require more than 64k ROM. So more evidence that the world has moved on from 16 bits.

So as of this moment i'm still thinking that Boot will be 32 bits (with a 16 bit profile, that i won't be using, available in BootX?). This is also strengthened by noting that (a) the human genome is much more than 64k; in fact closer to 2^32 (b) even some users of mid-size embedded OSs like NuttX? say they need more than 64k to be comfortable (c) even some embedded programming languages like Micropython need more than 64k (d) the number of LGN neurons in the human brain is about 140e6/40 = 3.5e6 [2] and this may be proportional to the number of 'pixels' [3]; although this model doesn't say how many LGN neurons per pixel, if it's less than about 50 then we've exceeded 64k, and in any case, if we want our system to make it easy to write programs that model/simulate the brain, we need to have some headroom by which i mean we need a little more precision in our system than the thing we are trying to model actually has.

proj-oot-ootAssemblyNotes27

x1 x2 -- flag test x1=x2