proj-oot-ootAssemblyNotes27

Difference between revision 4 and current revision

No diff available.

"Thumb-2 with its unequaled code density" -- https://news.ycombinator.com/item?id=21988614

---

"Interesting that the B5000 didn't make this list. Berkeley CS252 has been reading the Design of the B5000 System paper for years....

 Aloha 1 day ago [-]

I was also surprised - but I wonder thats the computer architectures language designers like, not computer architects.

reply

"

--- https://news.ycombinator.com/item?id=21988096

---

https://people.cs.clemson.edu/~mark/admired_designs.html

" ... CDC-6600 and 7600 - listed by Fisher, Sites, Smith, Worley ... Cray-1 - listed by Hill, Patterson, Sites, Smith, Sohi, Wallach (also Bell, sorta) ... IBM S/360 and S/370 - listed by Alpert, Hill, Patterson, Sites (also Bell) ... MIPS - listed by Alpert, Hill, Patterson, Sohi, Worley ... "

" Related Lists

Perspectives from Eckert-Mauchly Award Winners

Bruce Shriver and Bennett Smith asked six Eckert-Maychly Awards winners what were the 5 or 6 most important books or articles that affected the way they approached the central issues of computer architecture and what 5 or 6 books or articles they would recommend for others to read because of likely impact on future architectures. See pp. 52-61 in Shriver and Bennett, The Anatomy of a High-Performance Microprocessor: A Systems Perspective, IEEE Computer Society Press, 1998. The six award winners are: John Cocke, Harvey Cragon, Mike Flynn, Yale Patt, Dan Siewiorek, and Robert Tomasulo.

Processor design pitfalls

Grant Martin and Steve Leibson listed thirteen failed processor design styles in "Beyond the Valley of the Lost Processors: Problems, Fallacies, and Pitfalls in Processor Design." See Chapter 3 inJari Nurmi (ed.), Processor Design: System-On-Chip Computing for ASICs and FPGAs, Springer, 2007.

    Designing a high-level ISA to support a specific language of language domain
    Use of intermediate ISAs to allow a simple machine to emulate its betters
    Stack machines
    Extreme CISC and extreme RISC
    VLIW
    Overly aggressive pipelining
    Unbalanced processor design
    Omitting pipeline interlocks
    Non-power-of-2 data-word widths for general-purpose computing
    Too small an address space
    Memory segmentation
    Multithreading
    Symmetric multiprocessing"

https://news.ycombinator.com/item?id=21988096

 tachyonbeam 2 days ago [-]

I think one of the most influential designs of recent times has been the DEC Alpha lineage of 64-bit RISC processors[1].

1 point by bshanks 0 minutes ago

parent edit delete on: Which Machines Do Computer Architects Admire? (201...

The ones listed by 4 or more people (not including Bell) were:

Special mention:

---

some notes on Forth implementations:

kazinator 23 hours ago [-]

> Now onto the new instruction:

Adding a special VM instruction for such operations as inheriting from a class is a strikingly bad design decision.

You want to minimize the proliferation of VM instructions as much as possible.

A rule of thumb is: is it unreasonable to compile into a function call? Or else, is it likely to be heavily used in inner loops, requiring the fastest possible dispatch? If no, don't add an instruction for it. Compile it into an ordinary call of a run-time support function.

You're not going to inherit from a class hundreds of millions of times in some hot loop where the application spends 80% of its time; and if someone contrives such a thing, they don't necessarily deserve the language level support for cutting their run-time down.

reply

munificent 22 hours ago [-]

This is a good point. One of the things I have to balance with the book is teaching the optimal way to do something without dragging through reader through a large amount of verbose, grungy code. So sometimes (but not too often) I take a simpler approach even if it's not the best one.

With this VM, we have plenty of opcode space left, so it's natural to just add another instruction for inheritance, even if it does mean that the bytecode dispatch loop doesn't fit in the instruction cache quite as well.

I'll think about this some more. I'm very hesitant to make sweeping changes to the chapter (I want nothing more in life than for the book to be done), but this would be a good place to teach readers this fundamental technique of compiling language constructs to runtime function calls.

reply

---

" The Classical Forth Registers

The classical Forth model has five "virtual registers." These are abstract entities which are used in the primitive operations of Forth. NEXT, ENTER, and EXIT were defined earlier in terms of these abstract registers.

Each of these is one cell wide -- i.e., in a 16-bit Forth, these are 16-bit registers. (There are exceptions to this rule, as you will see later.) These may not all be CPU registers. If your CPU doesn't have enough registers, some of these can be kept in memory. I'll describe them in the order of their importance; i.e., the bottom of this list are the best candidates to be stored in memory.

W is the Working register. It is used for many things, including memory reference, so it should be an address register; i.e., you must be able to fetch and store memory using the contents of W as the address. You also need to be able to do arithmetic on W. (In DTC Forths, you must also be able to jump indirect using W.) W is used by the interpreter in every Forth word. In a CPU having only one register, you would use it for W and keep everything else in memory (and the system would be incredibly slow).

IP is the Interpreter Pointer. This is used by every Forth word (through NEXT, ENTER, or EXIT). IP must be an address register. You also need to be able to increment IP. Subroutine threaded Forths don't need this register.

PSP is the Parameter Stack (or "data stack") Pointer, sometimes called simply SP. I prefer PSP because SP is frequently the name of a CPU register, and they shouldn't be confused. Most CODE words use this. PSP must be a stack pointer, or an address register which can be incremented and decremented. It's also a plus if you can do indexed addressing from PSP.

RSP is the Return Stack Pointer, sometimes called simply RP. This is used by colon definitions in ITC and DTC Forths, and by all words in STC Forths. RSP must be a stack pointer, or an address register which can be incremented and decremented.

If at all possible, put W, IP, PSP, and RSP in registers. The virtual registers that follow can be kept in memory, but there is usually a speed advantage to keeping them in CPU registers.

X is a working register, not considered one of the "classical" Forth registers, even though the classical ITC Forths need it for the second indirection. In ITC you must be able to jump indirect using X. X may also be used by a few CODE words to do arithmetic and such. This is particularly important on processors that cannot use memory as an operand. For example, ADD on a Z80 might be (in pseudo-code)

   POP W   POP X   X+W -> W   PUSH W 

Sometimes another working register, Y, is also defined.

UP is the User Pointer, holding the base address of the task's user area. UP is usually added to an offset, and used by high-level Forth code, so it can be just stored somewhere. But if the CPU can do indexed addressing from the UP register, CODE words can more easily and quickly access user variables. If you have a surplus of address registers, use one for UP. Single-task Forths don't need UP.

...

Use of the Hardware Stack

Most CPUs have a stack pointer as part of their hardware, used by interrupts and subroutine calls. How does this map into the Forth registers? Should it be the PSP or the RSP?

The short answer is, it depends. It is said that the PSP is used more than the RSP in ITC and DTC Forths. If your CPU has few address registers, and PUSH and POP are faster than explicit reference, use the hardware stack as the Parameter Stack.

On the other hand, if your CPU is rich in addressing modes -- and allows indexed addressing -- there's a plus in having the PSP as a general-purpose address register. In this case, use the hardware stack as the Return Stack.

Sometimes you do neither! The TMS320C25's hardware stack is only eight cells deep -- all but useless for Forth. So its hardware stack is used only for interrupts, and both PSP and RSP are general-purpose address registers. (ANS Forth specifies a minimum of 32 cells of Parameter Stack and 24 cells of Return Stack; I prefer 64 cells of each.)

...

Top-Of-Stack in Register

Forth's performance can be improved considerably by keeping the top element of the Parameter Stack in a register!

...

If you have at least six cell-size CPU registers, I recommend keeping the TOS in a register. I consider TOS more important than UP to have in register, but less important than W, IP, PSP, and RSP. (TOS in register performs many of the functions of the X register.) It's useful if this register can perform memory addressing. PDP-11s, Z8s, and 68000s are good candidates.

Nine of the 19 IBM PC Forths studied by Guy Kelly [KEL92] keep TOS in register.

...

What about buffering two stack elements in registers? When you keep the top of stack in a register, the total number of operations performed remains essentially the same. A push remains a push, regardless of whether it is before or after the operation you're performing. On the other hand, buffering two stack elements in registers adds a large number of instructions -- a push becomes a push followed by a move. Only dedicated Forth processors like the RTX2000 and fantastically clever optimizing compilers can benefit from buffering two stack elements in registers.

Some examples

Here are the register assignments made by Forths for a number of different CPUs. Try to deduce the design decisions of the authors from this list.

             Figure 5. Register Assignments
            W     IP    PSP   RSP   UP     TOS   

8086[1] BX SI SP BP memory memory [LAX84] 8086[2] AX SI SP BP none BX [SER90] 68000 A5 A4 A3 A7=SP A6 memory [CUR86] PDP-11 R2 R4 R5 R6=SP R3 memory [JAM80] 6809 X Y U S memory memory [TAL80] 6502 Zpage Zpage X SP Zpage memory [KUN81] Z80 DE BC SP IX none memory [LOE81] Z8 RR6 RR12 RR14 SP RR10 RR8 [MPE92] 8051 R0,1 R2,3 R4,5 R6,7 fixed memory [PAY90]

[1] F83. [2] Pygmy Forth.

"SP" refers to the hardware stack pointer. "Zpage" refers to values kept in the 6502's memory page zero, which are almost as useful as -- sometimes more useful than -- values kept in registers; e.g., they can be used for memory addressing. "Fixed" means that Payne's 8051 Forth has a single, immovable user area, and UP is a hard-coded constant.

Narrow Registers

Notice anything odd in the previous list? The 6502 Forth -- a 16-bit model -- uses 8-bit stack pointers!

It is possible to make PSP, RSP, and UP smaller than the cell size of the Forth. This is because the stacks and user area are both relatively small areas of memory. Each stack may be as small as 64 cells in length, and the user area rarely exceeds 128 cells. You simply need to ensure that either a) these data areas are confined to a small area of memory, so a short address can be used, or b) the high address bits are provided in some other way, e.g., a memory page select.

In the 6502, the hardware stack is confined to page one of RAM (addresses 01xxh) by the design of the CPU. The 8-bit stack pointer can be used for the Return Stack. The Parameter Stack is kept in page zero of RAM, which can be indirectly accessed by the 8-bit index register X. (Question for the advanced student: why use the 6502's X, and not Y? Hint: look at the addressing modes available.) " -- https://www.bradrodriguez.com/papers/moving1.htm


eForth (a (popular?) portable Forth with 31 primitives):

" The following registers are required for a virtual Forth computer:

Forth Register 8086 Register Function IP SI Interpreter Pointer SP SP Data Stack Pointer RP RP Return Stack Pointer WP AX Word or Work Pointer UP (in memory) User Area Pointer

...

eForth Kernel

System interface: BYE, ?rx, tx!, !io Inner interpreters: doLIT, doLIST, next, ?branch, branch, EXECUTE, EXIT Memory access: ! , @, C!, C@ Return stack: RP@, RP!, R>, R@, R> Data stack: SP@, SP!, DROP, DUP, SWAP, OVER Logic: 0<, AND, OR, XOR Arithmetic: UM+ " -- http://www.exemark.com/FORTH/eForthOverviewv5.pdf

---

here are CamelForth?'s ~70 Primitives:

http://www.bradrodriguez.com/papers/glosslo.txt

design rationale for this choice:

" 1. Fundamental arithmetic, logic, and memory operators are CODE. 2. If a word can't be easily or efficiently written (or written at all) in terms of other Forth words, it should be CODE (e.g., U<, RSHIFT). 3. If a simple word is used frequently, CODE may be worthwhile (e.g., NIP, TUCK). 4. If a word requires fewer bytes when written in CODE, do so (a rule I learned from Charles Curley). 5. If the processor includes instruction support for a word's function, put it in CODE (e.g. CMOVE or SCAN on a Z80 or 8086). 6. If a word juggles many parameters on the stack, but has relatively simple logic, it may be better in CODE, where the parameters can be kept in registers. 7. If the logic or control flow of a word is complex, it's probably better in high-level Forth. " -- http://www.bradrodriguez.com/papers/moving5.htm

               ANS Forth Core wordsThese are required words whose definitions are  specified by the ANS Forth document.

! x a-addr -- store cell in memory + n1/u1 n2/u2 -- n3/u3 add n1+n2 +! n/u a-addr -- add cell to memory

x1 x2 -- flag test x1=x2

> n1 n2 -- flag test n1>n2, signed >R x -- R: -- x push to return stack ?DUP x -- 0

@ a-addr -- x fetch cell from memory 0< n -- flag true if TOS negative 0= n/u -- flag return true if TOS=0 1+ n1/u1 -- n2/u2 add 1 to TOS 1- n1/u1 -- n2/u2 subtract 1 from TOS 2* x1 -- x2 arithmetic left shift 2/ x1 -- x2 arithmetic right shift AND x1 x2 -- x3 logical AND CONSTANT n -- define a Forth constant C! c c-addr -- store char in memory C@ c-addr -- c fetch char from memory DROP x -- drop top of stack DUP x -- x x duplicate top of stack EMIT c -- output character to console EXECUTE i*x xt -- j*x execute Forth word 'xt' EXIT -- exit a colon definition FILL c-addr u c -- fill memory with char I -- n R: sys1 sys2 -- sys1 sys2 get the innermost loop index INVERT x1 -- x2 bitwise inversion J -- n R: 4*sys -- 4*sys get the second loop index KEY -- c get character from keyboard LSHIFT x1 u -- x2 logical L shift u places NEGATE x1 -- x2 two's complement OR x1 x2 -- x3 logical OR OVER x1 x2 -- x1 x2 x1 per stack diagram ROT x1 x2 x3 -- x2 x3 x1 per stack diagram RSHIFT x1 u -- x2 logical R shift u places R> -- x R: x -- pop from return stack R@ -- x R: x -- x fetch from rtn stk SWAP x1 x2 -- x2 x1 swap top two items UM* u1 u2 -- ud unsigned 16x16->32 mult. UM/MOD ud u1 -- u2 u3 unsigned 32/16->16 div. UNLOOP -- R: sys1 sys2 -- drop loop parms U< u1 u2 -- flag test u1<n2, unsigned VARIABLE -- define a Forth variable XOR x1 x2 -- x3 logical XOR
x x DUP if nonzero
               ANS Forth ExtensionsThese are optional words whose definitions are specified by the ANS Forth document.

<> x1 x2 -- flag test not equal BYE i*x -- return to CP/M CMOVE c-addr1 c-addr2 u -- move from bottom CMOVE> c-addr1 c-addr2 u -- move from top KEY? -- flag return true if char waiting M+ d1 n -- d2 add single to double NIP x1 x2 -- x2 per stack diagram TUCK x1 x2 -- x2 x1 x2 per stack diagram U> u1 u2 -- flag test u1>u2, unsigned

               Private ExtensionsThese are words which are unique to CamelForth?. Many of these are necessary to implement ANS Forth words, but are not specified by the ANS document. Others are functions I find useful.

(do) n1

u1 n2u2 -- R: -- sys1 sys2
                             run-time code for DO(loop) R: sys1 sys2 --
sys1 sys2
                           run-time code for LOOP(+loop) n -- R: sys1 sys2 --
sys1 sys2
                          run-time code for +LOOP>< x1 -- x2 swap bytes  ?branch  x -- branch if TOS zero BDOS   DE C -- A                   call CP/M BDOS branch -- branch always lit    -- x         fetch inline literal to stack PC! c p-addr -- output char to port PC@ p-addr -- c           input char from port RP! a-addr -- set return stack pointer RP@ -- a-addr         get return stack pointer SCAN   c-addr1 u1 c -- c-addr2 u2 find matching char SKIP   c-addr1 u1 c -- c-addr2 u2 skip matching chars SP! a-addr -- set data stack pointer SP@ -- a-addr           get data stack pointer S= c-addr1 c-addr2 u -- n      string compare n<0: s1<s2, n=0: s1=s2, n>0: s1>s2 USER   n -- define user variable 'n' "

---

CoolGuySteve? 5 hours ago [–]

Systems programming in any language would benefit immensely from better hardware accelerated bounds checking.

While Intel/AMD have put incredible effort into hyperthreading, virtualization, predictive loading, etc, page faults have more or less stayed the same since the NX bit was introduced.

The world would be a lot safer if we had hardware features like cacheline faults, poison words, bounded MOV instructions, hardware ASLR, and auto-encrypted cachelines.

Per-process hardware encrypted memory would also mitigate spectre and friends as reading another process's memory would become meaningless.

reply

---

incidentally, here's the 'passing arguments' part of the osdev wiki page on syscalls:

https://wiki.osdev.org/System_Calls#Passing_Arguments

---

just to review, since it's been many months since i was able to work on this and i forgot a lot:

the current proposal for the Oot layer cake:

1. Boot/BootX? 2. 'Oot Assembly' or 'BootX?'; Loot 3. OVM 4. Oot Core 5. Pre-Oot 6. Oot

1. Boot/BootX?

Boot: An assembly language which is dead easy to implement. Competes with 'toy' assembly languages. Limited functionality. Sufficient for self-hosting compiler and interpreter. Non-concurrent. Does this support ASCII consoles (STDIN, STDOUT)? Does this support console interactivity, or only batch computation? Are the interop xcall etc instructions in here?

The goal of Boot is to be the easiest-to-implement target assembly language, which can support self-hosting and which can be incrementally extended to BootX?