"Thumb-2 with its unequaled code density" -- https://news.ycombinator.com/item?id=21988614
---
"Interesting that the B5000 didn't make this list. Berkeley CS252 has been reading the Design of the B5000 System paper for years....
Aloha 1 day ago [-]
I was also surprised - but I wonder thats the computer architectures language designers like, not computer architects.
reply
"
--- https://news.ycombinator.com/item?id=21988096
---
https://people.cs.clemson.edu/~mark/admired_designs.html
" ... CDC-6600 and 7600 - listed by Fisher, Sites, Smith, Worley ... Cray-1 - listed by Hill, Patterson, Sites, Smith, Sohi, Wallach (also Bell, sorta) ... IBM S/360 and S/370 - listed by Alpert, Hill, Patterson, Sites (also Bell) ... MIPS - listed by Alpert, Hill, Patterson, Sohi, Worley ... "
" Related Lists
Perspectives from Eckert-Mauchly Award Winners
Bruce Shriver and Bennett Smith asked six Eckert-Maychly Awards winners what were the 5 or 6 most important books or articles that affected the way they approached the central issues of computer architecture and what 5 or 6 books or articles they would recommend for others to read because of likely impact on future architectures. See pp. 52-61 in Shriver and Bennett, The Anatomy of a High-Performance Microprocessor: A Systems Perspective, IEEE Computer Society Press, 1998. The six award winners are: John Cocke, Harvey Cragon, Mike Flynn, Yale Patt, Dan Siewiorek, and Robert Tomasulo.
Processor design pitfalls
Grant Martin and Steve Leibson listed thirteen failed processor design styles in "Beyond the Valley of the Lost Processors: Problems, Fallacies, and Pitfalls in Processor Design." See Chapter 3 inJari Nurmi (ed.), Processor Design: System-On-Chip Computing for ASICs and FPGAs, Springer, 2007.
Designing a high-level ISA to support a specific language of language domain
Use of intermediate ISAs to allow a simple machine to emulate its betters
Stack machines
Extreme CISC and extreme RISC
VLIW
Overly aggressive pipelining
Unbalanced processor design
Omitting pipeline interlocks
Non-power-of-2 data-word widths for general-purpose computing
Too small an address space
Memory segmentation
Multithreading
Symmetric multiprocessing"https://news.ycombinator.com/item?id=21988096
tachyonbeam 2 days ago [-]
I think one of the most influential designs of recent times has been the DEC Alpha lineage of 64-bit RISC processors[1].
1 point by bshanks 0 minutes ago
| parent | edit | delete | on: Which Machines Do Computer Architects Admire? (201... |
The ones listed by 4 or more people (not including Bell) were:
Special mention:
---
some notes on Forth implementations:
kazinator 23 hours ago [-]
> Now onto the new instruction:
Adding a special VM instruction for such operations as inheriting from a class is a strikingly bad design decision.
You want to minimize the proliferation of VM instructions as much as possible.
A rule of thumb is: is it unreasonable to compile into a function call? Or else, is it likely to be heavily used in inner loops, requiring the fastest possible dispatch? If no, don't add an instruction for it. Compile it into an ordinary call of a run-time support function.
You're not going to inherit from a class hundreds of millions of times in some hot loop where the application spends 80% of its time; and if someone contrives such a thing, they don't necessarily deserve the language level support for cutting their run-time down.
reply
munificent 22 hours ago [-]
This is a good point. One of the things I have to balance with the book is teaching the optimal way to do something without dragging through reader through a large amount of verbose, grungy code. So sometimes (but not too often) I take a simpler approach even if it's not the best one.
With this VM, we have plenty of opcode space left, so it's natural to just add another instruction for inheritance, even if it does mean that the bytecode dispatch loop doesn't fit in the instruction cache quite as well.
I'll think about this some more. I'm very hesitant to make sweeping changes to the chapter (I want nothing more in life than for the book to be done), but this would be a good place to teach readers this fundamental technique of compiling language constructs to runtime function calls.
reply
---
" The Classical Forth Registers
The classical Forth model has five "virtual registers." These are abstract entities which are used in the primitive operations of Forth. NEXT, ENTER, and EXIT were defined earlier in terms of these abstract registers.
Each of these is one cell wide -- i.e., in a 16-bit Forth, these are 16-bit registers. (There are exceptions to this rule, as you will see later.) These may not all be CPU registers. If your CPU doesn't have enough registers, some of these can be kept in memory. I'll describe them in the order of their importance; i.e., the bottom of this list are the best candidates to be stored in memory.
W is the Working register. It is used for many things, including memory reference, so it should be an address register; i.e., you must be able to fetch and store memory using the contents of W as the address. You also need to be able to do arithmetic on W. (In DTC Forths, you must also be able to jump indirect using W.) W is used by the interpreter in every Forth word. In a CPU having only one register, you would use it for W and keep everything else in memory (and the system would be incredibly slow).
IP is the Interpreter Pointer. This is used by every Forth word (through NEXT, ENTER, or EXIT). IP must be an address register. You also need to be able to increment IP. Subroutine threaded Forths don't need this register.
PSP is the Parameter Stack (or "data stack") Pointer, sometimes called simply SP. I prefer PSP because SP is frequently the name of a CPU register, and they shouldn't be confused. Most CODE words use this. PSP must be a stack pointer, or an address register which can be incremented and decremented. It's also a plus if you can do indexed addressing from PSP.
RSP is the Return Stack Pointer, sometimes called simply RP. This is used by colon definitions in ITC and DTC Forths, and by all words in STC Forths. RSP must be a stack pointer, or an address register which can be incremented and decremented.
If at all possible, put W, IP, PSP, and RSP in registers. The virtual registers that follow can be kept in memory, but there is usually a speed advantage to keeping them in CPU registers.
X is a working register, not considered one of the "classical" Forth registers, even though the classical ITC Forths need it for the second indirection. In ITC you must be able to jump indirect using X. X may also be used by a few CODE words to do arithmetic and such. This is particularly important on processors that cannot use memory as an operand. For example, ADD on a Z80 might be (in pseudo-code)
POP W POP X X+W -> W PUSH W
Sometimes another working register, Y, is also defined.
UP is the User Pointer, holding the base address of the task's user area. UP is usually added to an offset, and used by high-level Forth code, so it can be just stored somewhere. But if the CPU can do indexed addressing from the UP register, CODE words can more easily and quickly access user variables. If you have a surplus of address registers, use one for UP. Single-task Forths don't need UP.
...
Use of the Hardware Stack
Most CPUs have a stack pointer as part of their hardware, used by interrupts and subroutine calls. How does this map into the Forth registers? Should it be the PSP or the RSP?
The short answer is, it depends. It is said that the PSP is used more than the RSP in ITC and DTC Forths. If your CPU has few address registers, and PUSH and POP are faster than explicit reference, use the hardware stack as the Parameter Stack.
On the other hand, if your CPU is rich in addressing modes -- and allows indexed addressing -- there's a plus in having the PSP as a general-purpose address register. In this case, use the hardware stack as the Return Stack.
Sometimes you do neither! The TMS320C25's hardware stack is only eight cells deep -- all but useless for Forth. So its hardware stack is used only for interrupts, and both PSP and RSP are general-purpose address registers. (ANS Forth specifies a minimum of 32 cells of Parameter Stack and 24 cells of Return Stack; I prefer 64 cells of each.)
...
Top-Of-Stack in Register
Forth's performance can be improved considerably by keeping the top element of the Parameter Stack in a register!
...
If you have at least six cell-size CPU registers, I recommend keeping the TOS in a register. I consider TOS more important than UP to have in register, but less important than W, IP, PSP, and RSP. (TOS in register performs many of the functions of the X register.) It's useful if this register can perform memory addressing. PDP-11s, Z8s, and 68000s are good candidates.
Nine of the 19 IBM PC Forths studied by Guy Kelly [KEL92] keep TOS in register.
...
What about buffering two stack elements in registers? When you keep the top of stack in a register, the total number of operations performed remains essentially the same. A push remains a push, regardless of whether it is before or after the operation you're performing. On the other hand, buffering two stack elements in registers adds a large number of instructions -- a push becomes a push followed by a move. Only dedicated Forth processors like the RTX2000 and fantastically clever optimizing compilers can benefit from buffering two stack elements in registers.
Some examples
Here are the register assignments made by Forths for a number of different CPUs. Try to deduce the design decisions of the authors from this list.
Figure 5. Register Assignments
W IP PSP RSP UP TOS
8086[1] BX SI SP BP memory memory [LAX84] 8086[2] AX SI SP BP none BX [SER90] 68000 A5 A4 A3 A7=SP A6 memory [CUR86] PDP-11 R2 R4 R5 R6=SP R3 memory [JAM80] 6809 X Y U S memory memory [TAL80] 6502 Zpage Zpage X SP Zpage memory [KUN81] Z80 DE BC SP IX none memory [LOE81] Z8 RR6 RR12 RR14 SP RR10 RR8 [MPE92] 8051 R0,1 R2,3 R4,5 R6,7 fixed memory [PAY90]
[1] F83. [2] Pygmy Forth.
"SP" refers to the hardware stack pointer. "Zpage" refers to values kept in the 6502's memory page zero, which are almost as useful as -- sometimes more useful than -- values kept in registers; e.g., they can be used for memory addressing. "Fixed" means that Payne's 8051 Forth has a single, immovable user area, and UP is a hard-coded constant.
Narrow Registers
Notice anything odd in the previous list? The 6502 Forth -- a 16-bit model -- uses 8-bit stack pointers!
It is possible to make PSP, RSP, and UP smaller than the cell size of the Forth. This is because the stacks and user area are both relatively small areas of memory. Each stack may be as small as 64 cells in length, and the user area rarely exceeds 128 cells. You simply need to ensure that either a) these data areas are confined to a small area of memory, so a short address can be used, or b) the high address bits are provided in some other way, e.g., a memory page select.
In the 6502, the hardware stack is confined to page one of RAM (addresses 01xxh) by the design of the CPU. The 8-bit stack pointer can be used for the Return Stack. The Parameter Stack is kept in page zero of RAM, which can be indirectly accessed by the 8-bit index register X. (Question for the advanced student: why use the 6502's X, and not Y? Hint: look at the addressing modes available.) " -- https://www.bradrodriguez.com/papers/moving1.htm
eForth (a (popular?) portable Forth with 31 primitives):
" The following registers are required for a virtual Forth computer:
Forth Register 8086 Register Function IP SI Interpreter Pointer SP SP Data Stack Pointer RP RP Return Stack Pointer WP AX Word or Work Pointer UP (in memory) User Area Pointer
...
eForth Kernel
System interface: BYE, ?rx, tx!, !io Inner interpreters: doLIT, doLIST, next, ?branch, branch, EXECUTE, EXIT Memory access: ! , @, C!, C@ Return stack: RP@, RP!, R>, R@, R> Data stack: SP@, SP!, DROP, DUP, SWAP, OVER Logic: 0<, AND, OR, XOR Arithmetic: UM+ " -- http://www.exemark.com/FORTH/eForthOverviewv5.pdf
---
here are CamelForth?'s ~70 Primitives:
http://www.bradrodriguez.com/papers/glosslo.txt
design rationale for this choice:
" 1. Fundamental arithmetic, logic, and memory operators are CODE. 2. If a word can't be easily or efficiently written (or written at all) in terms of other Forth words, it should be CODE (e.g., U<, RSHIFT). 3. If a simple word is used frequently, CODE may be worthwhile (e.g., NIP, TUCK). 4. If a word requires fewer bytes when written in CODE, do so (a rule I learned from Charles Curley). 5. If the processor includes instruction support for a word's function, put it in CODE (e.g. CMOVE or SCAN on a Z80 or 8086). 6. If a word juggles many parameters on the stack, but has relatively simple logic, it may be better in CODE, where the parameters can be kept in registers. 7. If the logic or control flow of a word is complex, it's probably better in high-level Forth. " -- http://www.bradrodriguez.com/papers/moving5.htm
ANS Forth Core wordsThese are required words whose definitions are specified by the ANS Forth document.
! x a-addr -- store cell in memory + n1/u1 n2/u2 -- n3/u3 add n1+n2 +! n/u a-addr -- add cell to memory
> n1 n2 -- flag test n1>n2, signed >R x -- R: -- x push to return stack ?DUP x -- 0
| x x DUP if nonzero |
ANS Forth ExtensionsThese are optional words whose definitions are specified by the ANS Forth document.
<> x1 x2 -- flag test not equal BYE i*x -- return to CP/M CMOVE c-addr1 c-addr2 u -- move from bottom CMOVE> c-addr1 c-addr2 u -- move from top KEY? -- flag return true if char waiting M+ d1 n -- d2 add single to double NIP x1 x2 -- x2 per stack diagram TUCK x1 x2 -- x2 x1 x2 per stack diagram U> u1 u2 -- flag test u1>u2, unsigned
Private ExtensionsThese are words which are unique to CamelForth?. Many of these are necessary to implement ANS Forth words, but are not specified by the ANS document. Others are functions I find useful.
(do) n1
| u1 n2 | u2 -- R: -- sys1 sys2 |
run-time code for DO(loop) R: sys1 sys2 --
| sys1 sys2 |
run-time code for LOOP(+loop) n -- R: sys1 sys2 --
| sys1 sys2 |
run-time code for +LOOP>< x1 -- x2 swap bytes ?branch x -- branch if TOS zero BDOS DE C -- A call CP/M BDOS branch -- branch always lit -- x fetch inline literal to stack PC! c p-addr -- output char to port PC@ p-addr -- c input char from port RP! a-addr -- set return stack pointer RP@ -- a-addr get return stack pointer SCAN c-addr1 u1 c -- c-addr2 u2 find matching char SKIP c-addr1 u1 c -- c-addr2 u2 skip matching chars SP! a-addr -- set data stack pointer SP@ -- a-addr get data stack pointer S= c-addr1 c-addr2 u -- n string compare n<0: s1<s2, n=0: s1=s2, n>0: s1>s2 USER n -- define user variable 'n' "
---
CoolGuySteve? 5 hours ago [–]
Systems programming in any language would benefit immensely from better hardware accelerated bounds checking.
While Intel/AMD have put incredible effort into hyperthreading, virtualization, predictive loading, etc, page faults have more or less stayed the same since the NX bit was introduced.
The world would be a lot safer if we had hardware features like cacheline faults, poison words, bounded MOV instructions, hardware ASLR, and auto-encrypted cachelines.
Per-process hardware encrypted memory would also mitigate spectre and friends as reading another process's memory would become meaningless.
reply
---
incidentally, here's the 'passing arguments' part of the osdev wiki page on syscalls:
https://wiki.osdev.org/System_Calls#Passing_Arguments
---
just to review, since it's been many months since i was able to work on this and i forgot a lot:
the current proposal for the Oot layer cake:
1. Boot/BootX? 2. 'Oot Assembly' or 'BootX?'; Loot 3. OVM 4. Oot Core 5. Pre-Oot 6. Oot
1. Boot/BootX?
Boot: An assembly language which is dead easy to implement. Competes with 'toy' assembly languages. Limited functionality. Sufficient for self-hosting compiler and interpreter. Non-concurrent. Does this support ASCII consoles (STDIN, STDOUT)? Does this support console interactivity, or only batch computation? Are the interop xcall etc instructions in here?
The goal of Boot is to be the easiest-to-implement target assembly language, which can support self-hosting and which can be incrementally extended to BootX?.
I'm leaning towards including interactivity in Boot. This way you can think of Boot as the minimum necessary for bootstrapping a (C64/Apple II BASIC)-like Oot REPL on an embedded system (rather than just bootstrapping a batch compiler, which isn't that useful b/c you may prefer to cross-compile when bootstrapping a port to a new system).
BootX?: Adds instructions and syscalls to Boot so that it can serve as a target language for full Oot implementations. Maybe also directly supports: Everything needed for a TUI (e.g. cursors)? PICO-8 style interactivity?
The goal of Boot is to be the perfect simple portable target assembly language for bootstrapping the Oot toolchain on a new platform, and for general-purpose single-threaded (pure/batch? or interactive?) computation, which can be extended to BootX?. That goal of BootX? is to be the perfect portable target assembly language incrementally extended from Boot.
2. LoVM? (and Loot)
A portable assembly language which is pretty high-level. Easier to read and understand than Boot. More pleasant to write in than Boot. Preserves more high-level semantics so can be more efficiently translated to most target platforms than Boot. Trivially compiles to Boot. Abstraction level of LLVM, C--, and maybe of Pillar and AliceVM? and Hashlink? LLVM-like but easier to implement or port to a new backend, and supports the features that we need for an efficient Oot. Less configurable than LLVM. Many implementations port this layer, because its higher-level nature allows more efficient use of platform primitives than porting Boot. Because it's higher-level than some assembly languages, a code fragment without weird control flow can be efficiently transpiled to a HLL.
Loot ('low level oot') is an HLL that compiles to this layer. Competes with C, Zig, BASIC, Oberon. Oot's garbage collector, scheduler, etc are written in this layer, as are some things that benefit from optimization.
The goal of LoVM? is to be the perfect VM for writing portable language services like garbage collectors, schedulers, and to also be a great target language.
The goal of Loot is to be the perfect simple language for writing portable language services like garbage collectors, schedulers.
3. OVM
OVM is a portable safe virtual machine. Abstraction level of JVM, CLR, BEAM, and maybe of Pillar and AliceVM? and Hashlink? Easier to port than JVM? Definitely easier than CLR, BEAM. Unlike the previous layer, is safe, in the sense that buggy code will cause an OVM panic or a single-thread hang, not a lower-level crash or a hang that affects other threads. Has a sandboxed mode for untrusted code. Code fragments can be efficiently and readably transpiled to HLLs (provided they don't use features of OVM that aren't compatible with the target HLL). More mature ports of Oot port this layer, so that they can use platform-provided services (such as GC and scheduling) rather than using Oot's implementation of these on top of a platform that already provides them.
Has garbage collection, scheduling, bounds-checking, tail-recursion. Restricted control flow (no GOTOs)? Has a HLL that compiles to this layer?
The goal of OVM is be a simple safe portable VM that meets Oot's needs.
4. Oot Core A small core Oot that has 'Oot-like' semantics. More explicit than Oot; probably too verbose for easy human reading and writing. May have a simpler, incompatible syntax than Oot but semantics-wise, is a subset of Oot. Even more mature ports of Oot port this layer, so that they can implement or transpile Oot core code in the most efficient and readable way for each platform.
Abstraction level of Haskell Core, or Scheme.
5. Pre-Oot A subset of Oot out of which the rest of Oot is built using metaprogramming.
6. Oot Oot proper. Has at least three 'profiles':
So compilation/interpretation in the reference implementation goes:
A mature implementation on a specific platform might do something like:
Note that even near the bottom levels (e.g. LoVM?) you might be able to get decent transpilation into a HLL for simple algorithms, which serves the design goal of being able to use Oot as an embedded 'scripting language' within various HLL platforms with as little porting work as possible.
Note that compared to the previous proposals, i think i've added BootX? back in, and pushed even more of the unusual or complicated or even scary-looking stuff (eg atomics) out of Boot. Smallstacks and addressing modes and compact encodings remain pushed up into LoVM?. Now i'm also pushing all concurrency, and all 'optional instructions', into BootX?.
You can kind of divide this into 3 macro-layers:
The layer most likely to be a boondoggle is LoVM?. Boot is constrained to be as simple/easy to implement as possible (given the constraint that BootX? is incrementally added onto Boot and BootX? must be a reasonably good target platform), BootX? is only adding necessary instructions to be a complete target platform. LoVM?'s motivation has a bit of 'i want an awesome assembly language', and all of my probably-stupid 'innovations' such as the stacks, as well as some ill-defined aspirations (making a higher-level assembly language for better code density and transpilation) (as well as just for ease of reading/writing; which is silly because there's really no need to read/write this, you'd read/write the corresponding HLL, Loot). OVM may be a similar danger, although at least there we're going to be somewhat driven by Oot's needs.
---
i just realized something great:
i have been avoiding putting instruction data in the instruction stream because that makes it hard to parse the instruction stream in parallel. However, this is less of a problem for instruction data for JMP (not branch) instructions, because in that case, you can mis-parse the data as instructions and it doesn't matter because you aren't going to execute past the JMP in any case.
I guess it's still a little tiny bit of a problem because the JMP data could itself contain a JMP; which makes it hard to know, without prior knowledge of which instructions are real, whether the next instruction after the data (which in reality is an actual instruction) is itself jump data (for the false jump in the previous jump's data) or a real instruction.
At least, if the instructions are all 32 bits and the jump data are multiples of 32 bits, you don't have the problem of splitting the instructions in the wrong place. So this one tiny problem might make you think that some data is actually instructions, but it won't make you confused about decoding the instructions. And, any binary could arrange to put data in the middle of the binary in a place that is unreachable, so thinking some data is actually instructions is unavoidable (if binaries do that).
And, not only can you decode, but you can also do speculative execution easily.
If a JMP had 24 bits of data within itself plus 32 bits of data in the next slot, that's 56 bits in total, which is 16 petabytes. Which i think is enough (and much more than the 16 MB provided by 24 bits of data).
Otoh it will make arithmetic about figuring jump offsets slightly annoying if different instructions have different total sizes, rather than each one being 32 bits. Also, the compiler/interpreter data structures now have to allocate at most 64 bits to store each instruction in total, rather than 32 bits. So there's still some reasons not to do this, if ultimate simplicity is desired. Otoh having to store jump labels (a form of constant table) in the binary is complex also, and is possibly worse for performance.
Hmm but didn't i have some reason involving compilation targets other than assembly why jk (jump table jmp) is preferred? Why full indirect jumps ae bad? No, i just looked that up, it's in [[ootAssemblyNotes24?]], the issue was with jumping to DYNAMICALLY COMPUTED addresses; the issue is that if you try to compile code with dynamically computed addresses to a target with different sizes for (the compilation of) different instructions, then since the original program didn't know about those different sizes, it's going to be hard for the the compiler to, in general, compute the transformation from the original dynamically computed jump target to its image in the transformed program; the compiler would basically have to embed into the compiled binary a complete map from each location in the source program to where that ends up in the compiled program, which of course bloats the binary a ton. But we don't have that problem here because here we have static addresses, not dynamically computed ones; this problem had nothing to do with needing to force a jump table/labels on everything, it just had to do with dynamic/static computation of jump targets.
so i think we should probably do this. It allows executable size to reach 16 petabytes instead of 16 MB, and gets rid of the requirement to have jump tables/labels at a special place in the binary. To really realize that simplicity, though, we must get rid of constant tables as well and have only inline constants. Which means that either we only allow small constants, or we allow them in the data stream. This doesn't have the same 'but we'll never reach that place in the instruction stream' property as JMPs, however, though. But one thing we could do with inline constants is force them to have a special opcode, taking advantage of the fact that we already have 24 bits of data in the leading constant instruction. So it would take 3 slots, 3x32 bits to get 24*3 = 72 bits of data, which is what we'd need for a 64-bit constant. That's fine because even without a special prefix, two slots only gives us 24+32 bits, so 3 slots is needed in this case regardless. If we did the same with JMP instructions, then 2-slot JMPs can only have 48 bits of data instead of 54. So instead of 16 petabyte executables, we would only get 256 terabytes. That sounds just fine to me! So let's do that. The data can have a variant of the ANN prefix, perhaps.
Another benefit of allowing this reduces the pressure for non-uniform instruction types.
We can bring back constant tables in LoVM?.
later: you know, there are also some benefits of just having large constants, and absolute jump targets, in their own bytes e.g. not taking advantage of the 24 data bits in the instruction at all. Specifically, in many platforms this allows the implementation to just use fast, ordinary methods for reading the constants out of the instruction stream; and in many platforms, specifically those where jump targets are ultimately translated into ordinary native 64-bit pointers, it allows a linker to just translate 64-bit jump targets into 64-bit pointers (but we'll also have 32-bit jump targets which won't have the latter property). So it makes the implementation more simple in this way, and less simple in terms of making it easy to definitively see, in every case, if you are just looking at some random place in the middle of the instruction stream, where data ends and instructions begin. And for the latter advantage, it still isn't total, because programs could still put data in unreachable areas of the instruction stream. My guess is that the simplicity of having the constants and jump targets freestanding is more valuable than the simplicity of being 100% certain in every case where instructions might begin. So forget about forcing constants and jump targets to have the ANN prefix.
If we do it this way, we free up those 24 data bits inside the longjump32 and longjump64 instructions. Should we use that to make them long BRANCH instructions instead? Or, should we leave this as implementation-dependent free space? I'm leaning towards the latter; one could imagine, for example, that JIT interpreter/compilers could use this to count how many times the jump is taken and to note whether or not the target has been compiled yet, and that ordinary compilers could use these to identify the jump target via a 'label' which would be an index into a table of at most 2^24 jump targets which is being maintained outside of the instruction stream (we could even make this 2^24 unique longjump target thing a standardized limitation of Oot; that's a limit of 16 MB count of jump targets; if each jump target corresponds to a 64-bit pointer, then it would take 16 MB x 8 = 128 MB bytes to store this table).
For example, the gcc executable on my system, /usr/bin/x86_64-linux-gnu-gcc-9, is currently 1.2M in size. The Python executable, /usr/bin/python3.8, is 5.3M in size.
There are game executables that are larger than 128MB (e.g. https://www.reddit.com/r/pcgaming/comments/3ta7ur/whats_the_largest_game_exe_youve_ever_seen/ ), but i bet a lot of that size is embedded data, rather than code with unique jumps.
In LoVM? we could add instructions that refer to module jump tables and only provide the index into that table, and don't have following data.
---
i forgot why we had separate dual everything (which is why i was thinking of naming it 'bicycle') for integers and pointers?
was it so that we could standardize the integer bitwidth while leaving pointer bitwidth implementation-dependent?
i remember now; it's not just bitwith, it's because ptrs need to be totally opaque. Imagine an implementation on top of and interoperating with a HLL like Python; you might need to place references to arbitrary Python objects in pointers; in which case a pointer is not just an integer with a special interpretation, it's a wholly different, opaque object type.
it also has the bonus advantage that, since most instructions naturally specify a type int/ptr of each operand, you can have twice as many total registers given the number of bits in each operand. Not so useful when operands are each 8 bits but very useful when they are only 4 bits (here you can have 16 int registers and also 16 ptr registers, instead of just 16 general-purpose registers).
---
i'm considering adding another layers which is an 'intermediate language optimization layer', or maybe we should make Oot Core that, or make OVM that. This layer has no associated HLL (i guess if you wrote it out it would look like a lisp s-expr, or something similarly 'no extraneous syntax'-y). A CFG and/or CPS or SSA or STG or ANF or something like that.
it would also be cool if sometimes that IL-for-optimization layer could 'skip layers', that is, some instances of pre-oot could be compiled directly to it, skipping the usual step in between, and also some instances of it could be compiled directly to LoVM?, skipping OVM, or even compiled directly to BootX?. Alternately, maybe it could not skip LoVM?, but it could skip the layers above it, all the way up to Oot (or maybe just up to pre-oot, if there are other layers in between it and pre-oot).
another thing to note: some of these layers also have associated HLLs, such as LoVM? and Loot. If we implement OVM in Loot but then ultimately implement everything in Oot, then we won't have any Loot implementation any longer. So what we should actually do is have an alternate compilation pipeline that goes through Loot instead of LoVM?. This also makes sense because we wanted to be able to transpile to another HLL at the end anyhow. The same goes for OVM; there should be a (Lua-like? C#-like? OCaml-like?) HLL corresponding to OVM. Maybe we'll have a third layer like that too ('like that' meaning with both a 'bytecode' representation and also a corresponding HLL, and alternate compilation paths for transpilation (using the HLL) or targeting machine-readability only (using the bytecode)); it could go on the other side of the optimization layer.
I'm not sure if Boot counts as the same sort of thing; the Boot 'HLL' is just assembly code, which isn't very high level. We could make an awesome macroassembly for Boot, but currently i was thinking of saving that for the LoVM? layer. Otoh, Loot is much more than a macroassembler so maybe it belongs here.
If one of the layers is on the other side of the optimization layer, and the Boot layer does count, then i guess the optimization layer goes below OVM. Otherwise, if there is going to be another layer, then the optimization layer goes above OVM.
To summarize this section:
---
ptrace(2) - Linux manual page - man7.org www.man7.org › man-pages › man2 › ptrace.2.html The ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of another process (the "tracee"),
PTRACE_SINGLESTEP
---
previously, i think i had merged BootX? into Boot, and had a bunch of 'optional instructions' in Boot, but now, i feel that that will make the Boot reference document too scary, so, as per the above plan, i'm going to split off BootX? again. i'm going to update boot_reference.md to remove the BootX? instructions, or at least to instruct the implementor to regard them as stubs that either return immediately or alias to other Boot instructions.
are the instruction opcode constant lists in the pyboot implementation currently more up to date than boot_reference.md? maybe
note that there is also boot_reference.adoc, which has the memory order / memory consistency model, and bootReferenceOld191229.txt, which is also old, and also has some old todos. and boot_extended_reference.adoc. There's a much older version on ootAssemblyNotes16.txt.
a few instructions that were taken out are in ootAssemblyNotes23.txt. Some other text that was rewritten (probably dont need to reread) is in ootAssemblyNotes26.txt.
cp ~/prog/oot/boot/doc/boot_reference.md ~/oot/bootReferenceOld200622.txt cp ~/prog/oot/boot/doc/boot_reference.adoc ~/oot/bootReferenceAdocOld200622.txt cp ~/prog/oot/boot/doc/boot_extended_reference.adoc ~/oot/bootExtendedReferenceOld200622.txt cp ~/prog/oot/boot/doc/boot_design.adoc ~/oot/bootDesignOld200622.txt
cp -r ~/prog/oot/boot ~/prog/oot/boot_old_200622
as of now i havent actually changed boot_reference.md, i just made those copies so that it's ready to be changed.
---
jy? lpc? or load jrel ptr? (CONSTANT offset; because it's constant, there is no need for the runtime to understand how boot instructions mapped to compiled instructions; and because it's relative, it can also do lpc when the offset is zero; )
how does the jvm handle callbacks, esp. callbacks passed in from native code? how do harvard architectures handle callbacks? do they have indirect jumps? how does LLVM deal with this stuff? ok i looked that stuff up (i might be wrong about some of this tho):
wondering how jy and lpc would be implemented in a JVM implementation.
lpc in Boot would be used to create callbacks, so how are those done in the JVM? in JNI?
...(i went and looked this up)...looks like in JNI, from C you call registerNatives to register your C functions as Java methods, then in Java you call them as methods.
There are JVM instructions for calling methods:
https://www.jrebel.com/blog/using-objects-and-calling-methods-in-java-bytecode
so, in C code interfacing with JVM via JNI, the way you get a function pointer, and pass it as a callback, call call the callback: is to just get the function pointer in C normally, then use registerNatives to register it as a JVM method, and then to use something like invokeVirtual (invokeDynamic?) to call the method.
so, i guess that in Java, you would do 'lpc' by getting the 'this' pointer, and then you'd pass that like any other value, and then you'd call it by calling some predetermined method on 'this'; OR you'd use Method References along with 'this' to capture a reference to a particular method. This wont' get you to the exact line where lpc is called however, only to the beginning of a method.
i guess that means that it would be very hard to express lpc in the JVM without a full interpreter runtime layer. an 'lpc-JREL' could be expressed if and only if it only captured function entry points. jy however sounds necessary to be able to implement JVM's invokedynamic/invokevirtual etc, and actually i think an lpc-JREL is necessary too in order to implement method references (without a full interpreter runtime layer).
harvard architectures just mean that code and data are separate address spaces (and maybe code can't be written to), it doesn't mean there are no indirect jumps. For example, AVR has indirect jumps.
on AVR, a harvard (or is it modified harvard?) architecture, there is no 'LPC' instruction to directly load the PC, but you can do an RCALL (relative call) to the next instruction, and then pop the stack to get the return address, which effectively accomplishes the same thing.
LLVM's %indirectbr instruction is only for local (within the same function) jumps, but its %call instruction can take an arbitrary (e.g. externally-given, i believe) function pointer. LLVM allows you to capture a pointer to some location in LLVM code by using an llvm::object::SymbolRef?
i'm not sure how indirect jumps/calls to arbitrary function pointers are represented in CFGs?
so yeah all of these things seem to allow both:
---
so i think we do need both
---
in Boot, programs can be assumed to be only 2^32 long; longer ones can only be permitted in BootX?
---
yknow, you could still have 'constant tables' without any specific binary structure just by having the 'constant number' be the location of the constant within the instruction stream... but that kind isn't very helpful because you can't have arbitrarily large native ptrs in there
---
otoh, as much as jy and lci are needed for a full system, they are going to cause some complications on some targets, and i'm not sure that they are really needed for the self-hosting interpreter. Are they?
And what about xcall, xentry, etc?
i guess the question is, what's the point of the self-hosting interpreter in the Boot stage? If we are leaving out floating-point ops, then it's clearly not so that we can run all normal Oot programs. Potential applications of a Boot-only system include:
So it seems that the really necessary uses of Boot (without BootX?) are external system interaction. Which suggests that maybe xcall, xlibcall, xentry, xret are more important than jy and lci. lci could still be useful to create callbacks to be passed, but we could require that it targets an xentry (in Boot; it could be more general in BootX?). Then instead of a general jy we could reuse xcall. This allows us to impose structural constraints like 'every code path starting in xentry ends in an xret, with no more xentrys in between' which would ensure that the xcall stuff can be compiled into ordinary structured programming language targets.
so this route would see Boot not only having less instructions than BootX?, but also more constrained in control flow, which would allow it to be more easily compiled to more targets.
It also suggests that leaving out the I/O functionality is not the way to go, as it impedes the really key use case of hacking together a new toolchain on a new platform.
We can have a 'pure profile'. Also a 'user code profile' which uses less registers.
---
hmm so no jy = no computed goto! terrible for interpreter perf.
but we said that Boot omits some high-perf things, so that it can be easy to implement everywhere, so i guess that's fair (although annoying).
also, we still have lentry (was: lci, but only for xentry labeled things), isn't that hard to implement (without a full interpreter loop)? Only on languages that don't have function/method references, or any other sort of high-level indirection. Even pre-8 Java could implement each xentry as its own class, with a distinguished method 'call'.
but yes, on some especially primitive languages, lentry will force the whole thing to be interpreted. Even Oberon-2 has procedure types.
supporting jy and arbitrary labels for callbacks is much harder because many language have some sort of first-class functions or high-level indirection but do not have any way to refer to individual lines of code. This would lead to each line of code being its own function, which is very inefficient, or to giving up on compiliation and only implementing a Boot interpreter in these languages.
https://en.wikipedia.org/wiki/First-class_function#Language_support shows that a broad variety of languages support this.
otoh, we already have (relative) branches in code, which means that it's already pretty easy to construct something which is hard to transpile to an HLL without either an interpreter, or putting each line into a separate function.
---
can conditional moves ever be worse than branches? yes:
https://kristerw.blogspot.com/2017/03/the-cost-of-conditional-moves-and.html
---
and that same link mentioned a purportedly better approach, that they call Stackifier, that applies apparently an old algorithm when the CFG is 'reducible', and otherwise uses a dispatcher (which is still like falling back to the naive approach). A reducible CFG is defined by https://en.wikipedia.org/wiki/Control-flow_graph#Reducibility as one such that:
https://medium.com/leaningtech/solving-the-structured-control-flow-problem-once-and-for-all-5123117b1ee2 explains that this means "all loops need to be single entry".
In other words, the key constraint is that you can't jump into the middle of a loop from outside of that loop, where a loop is something that begins at the target of a backwards jump.
so to summarize the above:
In summary, the status quo (local arbitrary direct branches and jumps permitted, indirection via function pointers permitted, local indirect jumps prohibited) has the following consequences:
So i think i'll stick with the status quo, because:
The lack of computed goto may be okay because having Boot support everything needed for full 'desktop' Oot is not a design goal of Boot -- after all, it's also missing floating point. BootX? is required for all standard Oot functionality to be unlocked.
The lack of computed goto is bothersome because it means we have to write interpreter loops in Oot twice; once without computed goto (for bootstrapping with Boot), and once with. Ultimately though, that's just manageable cruft, and less complex than, say, implementing Boot on JVM if it had indirect branches. Maybe LoVM? should provide instructions specifically abstracting over interpreter dispatch.
For the Boot implementation, aside from the aforementioned cruft/duplication of code, the main effect of lacking computed goto is reduced efficiency. But that's not a design goal; efficient execution isn't a goal until BootX?.
To summarize that decision:
So, this choice maximizes the ease of implementing a simple compiler from Boot to other platforms, while keeping the Boot implementation easy to compile to in most cases, at the expense of loss of efficiency and difficulty of transpilation. This is in line with the design goals of Boot.
---
okay over about one weekend's worth of time i sort of implemented a Boot VM from scratch (i say sort of because i didn't really test it more than on one tiny program, so it's not really done). So, not quite at my target of being a one-weekend project for others, but close enough.
some things i learned during the process: