Bayle Shanks's website: proj-oot-ootAssemblyNotes20

see also ootNotes30, where i start working on an ISA that might be suitable for genetic algorithms and the like.

that gave me some ideas that should maybe be applied to Boot or to OVM:

the LABEL and JUMPLABEL instructions
CALL pushing the return operand onto the stack, and RET using it
having more 3-operand instructions (somehow)
maybe allowing direct changes to the PC so as to not need the JD instruction (and maybe not even JREL, if the ADD operand size is large enough)
consider using signed arithmetic
the weird addressing modes i have there

also this makes me question if we shouldn't:

somehow add addressing modes to Boot
find a 32-bit instead of 64-bit encoding for OVM

hmm.. it doesn't seem possible to add addressing modes to Boot without either:

reducing the number of registers
going to a 32-bit encoding
going to a variable-length encoding
going to a 2-operand encoding

a variable-length encoding doesn't have to be irregular; some instructions (like LDI and absolute jumps) can just take some extra words at the end. Which is probably fine in most contexts. But:

it's not good for the 'genetic algorithms' application, and the related 'algorithmic compression contexts' application; it'll be better for the optimizer if all of the instructions are the same length
it's not good for the 'program analysis within the Oot toolchain' application

if we're willing to partially give up on program analysis, however, and restrict each function to ~25 local variables, and encode most constant programmatically, then it's perfectly possibly to make a 32-bit version of OVM: 4 fields (opcode, op2, op1, op0), each field has 3 addr mode bits and 5 value bits. Alternately, the opcode field has 3 addr mode bits and 5 value bits, and the other 3 fields have 3 addr mode bits and 4 value bits, and we have 3 bits left over; two of them are used for instruction encoding format choice, and 1 bit is reserved for future use. Using the same scheme as with Boot, this gives us 31 3-operand instructions, plus 31 2-operand instructions, plus 31 1-operand instructions, plus 31 0-operand instructions; of course the 1-operand and 0-operand instructions aren't as useful, so that's basically 64 2-operand+ instructions. With 8 addressing modes and on the order of 64 major instructions it may not be a 1-weekend project, and it's still lacking some stuff from OVM, most notably generics, address space types, custom instructions, and hooks, but still it seems like a suitable place to start, maybe even more suitable than Boot.

otoh for the 'algorithmic compression' application we may not need main memory at all; a few registers are probably enough to make lots of interesting patterns. In which case we can get rid of the register indirect addressing mode and the load/stores, and we can probably even find some way to get down to 1 addr mode bit instead of 2 (although i'm not sure how yet).

but even if the primary encoding were 32 bits, 16 bits probably still makes some sense, as many instructions won't need the addressing modes, won't access registers above 16, and won't be using the extra opcodes.

does 8 bits make sense? Well, if we use the current version of boot, extended format instructions must have 3 bits set a certain way, and we certainly couldn't afford to lose 3 of 8 bits, so these may not (currently) be compatible. But ignoring that:

you could have 3 operands each with 1 addr mode bit and 1 value bit, plus a 2-bit opcode. That isn't great because there's lots of things you'd want to do with that 1 addr mode bit (immediate vs register direct? register direct vs stack? register direct on TOS/second-on-stack vs stack push/pop? or just have 4 registers with no addr modes?). You could also do stuff like that weird format i thought of which is implicitly stack by default except that some bits say whether to do something else for the first input and for the output. Also, 4 opcodes isn't much. Also, there's not even a single format encoding bit there.
you could have 1 format bit, 4 opcode bits, and 3 operands each of 1 bit
you could have variable-length in the manner of the JVM; some bytecodes indicate more operands in the bytes to follow
you could have fully variable-length instructions, in which the operand fields are in different places. Or, similarly, the instructions could be 'all opcode' with a gigantic lookup table, many variants of many instructions, and only a few registers (6502-style ISA)

so it sounds to me like it would be good to retain the option to have 8 bits, but not to actually work on it (yet). It's probably not good for anything except for code density.

but 16 bits makes sense. And 32 bits makes sense (because, unless you only want less than 8 registers, 16 bits is probably too small to have both 3-operands and good addr modes). And 64 bits makes sense (at least for the program analysis application, because you can have 256 registers, and for OVM, because you can have the generics type field). And reserving the option for something even longer makes sense (for my weird natural language inspired encoding ideas).

if you had the format encoding bits for all of these, however, then it might look like this:

8 bits: 1 format encoding bit, 7 other bits (to be determined)
16 bits: 2 format encoding bits, 5 opcode bits, 3*3 operand bits
OR the current Boot proposal: 16 bits: 4 opcode bits (only one of which is free for longer formats), 3*4 operand bits
32 bits: 3 format encoding bits, (3+4)*4 field bits, 1 reserved bit (or, consume the reserved bit to make 5-bit operands)
OR to be compatible with the current Boot proposal: 32 bits: 4 format encoding bits, (3+4)*4 field bits
64 bits: 4 format encoding bits, (4+8)*5 field bits. NOTE: if we're compatible with the current Boot proposal, then there is no room for format bits for an extended format here, unless opcodes are allocated to that (which is problematic because opcodes are 8 bits long in this format).

working backwards, if we want the 64-bit format to use 4 format encoding bits, and we want 8 bits to use 1 format encoding bit, then 16 bits must be at least 2 format encoding bits, and 32 must be at most 3 format encoding bits. So, we'd want to reserve 4 3-operand opcodes in Boot (for 2 format-encoding bits). With the current Boot proposal, there aren't enough free opcodes for that.

We could do it if we demoted ANNOTATE to a 2-operand instruction, or if we demoted ld and st to 2-operand instructions. However, if you think about it, that isn't really enough; the 8-bit encoding demands an entire bit, not just 2 combinations out of 4. Also, if you think about it, why are we taking bits out of the opcode field anyway? That's the field that needs them the most. If we are short on bits, we really should demote the operand fields to 3-bits before we remove any bits from the opcode field. This would give us the first option above, 2 format encoding bits, 5 opcode bits, 3*3 operand bits; it would, of course, be nice to double our supply of 3-operand instructions. Of course, the issue with that is that since we're reserving 5 special registers, with 8 registers total that means we'd only have 3 GPRs.

Also, in the above, it seems strange to have 5 bits for the opcode for the 16-bit variant but only 4 bits for the opcode for the 32-bit variant. We could add that 'reserved bit' in the 32-bit variant to the opcode field, though -- that's nice but it doesn't do much when the addressing mode is not immediate or split, so it's a little bit of a waste.

Or we could just give up on the 8-bit encoding and the 32-bit encoding (which is the current plan). That's probably better because it's simpler -- but it misses out on 32-bits, the central 'sweet spot' that many other processors hit.

But if we have 5 instruction formats (well, 4 fixed-length ones), does the 'fixed-length' aspect even do us any good any more? Of course, we can always 'up-project' everything to the 64-bit form for program analysis, but if we're going to do that, why not allow JVM-style variable-length in the smaller bitwidths?

otoh i gotta say that the nice arithmetic progression in the number of format bits, and the idea of having a format for each power-of-two doubling, attracts me. The 16-bit and 32-bit ISA would still be easy to build (although now, with 5 bits, the 16-bit ISA could easily accomodate 'way too many' instructions -- a good problem to have, i guess).

also, note, instead of my 'natural-languagy-ISA' idea for the 'extended' language, we could just have something like SEXPRs (of course, SEXPRs could be easily encoded using my 'natural-languagy-ISA' so that's not an argument against it, except for simplicity).

and how does having 4 different instruction lengths with fixed field locations (or at least 3, since the 8-bit format might not follow that pattern) compare to JVM-style opcode-in-the-first-byte-plus-more-bytes-of-data-depending-on-opcode?

It's slightly easier to delineate instructions in this system, but in the JVM system instruction lengths are just an opcode-lookup-table, which isn't that hard (for a computer).
This system is slightly easier to code a compiler or interpreter for (excluding the 8-bit variant again).
In this system, instructions with very few 3 operands are inefficiently coded (eg zero-operand instructions still use all of their operand fields for 'instr-two', 'instr-one' and 'instr-zero' opcodes), whereas instructions with many operands are efficient; by contrast, in the JVM system, instructions with zero operands are most efficient (just a bytecode), and instructions with many operands are less efficient. This system probably has the upper hand there; according to [1], 60% of x86 instructions have 2 operands, 37% have 1, and only 3% have zero operands.

Adding the 2 format bits to the 16-bit instructions would make them almost unusable (at least, in terms of efficiency) for general computation, though; with only 3 GPRs it's very inefficient. This means that with this system, 16-bits is pretty much only good for compressing 32-bit code; so if we do this, maybe Boot should be moved to 32-bits (even though that means 8 addressing modes).

Also, it should be noted that this won't help with the thing that originally got me thinking about this today; for the purpose of compressing strings using Kolmogorov algorithmic complexity (finding short programs that make interesting patterns similar to the patterns in the string you are trying to compress), we'd prefer to have 16-bit instructions with fewer registers and addressing modes, instead of more registers and no addressing mode; and we don't want 32 opcodes (the format bits themselves are okay because we can just ignore them and not even try to mutuate them). The bespoke 16-bit format i made up earlier, with 16-bit opcodes and 3 fields each with 2 addr mode bits and 2 value bits, would probably be better for this application. I guess the remaining contribution to the argument for more formats from that line of thinking is just that addressing modes are useful, and since 32-bit instruction encodings can have addressing modes, it's a shame to just skip over 32-bit right to 64-bit.

other objections are:

the 32-bit one seems like the one you'd want beginners to target -- but then they have to implement 8 addressing modes
we have to rewrite it -- will i ever stop redesigning this project and just get it done?

Perhaps my biggest objection is that this destroys the beauty/usefulness of the 16-bit format, by cutting it from 11 GPRs to 4, and for what? More complexity.

hmm. i can't decide.

an alternative would be to reduce the operand length in the 64-bit format, freeing up more format encoding bits up there. That allows us to have one more format encoding bit in the 32-bit format (using the reserved bit; so we'd have 4 operand bits instead of 5 in the 32-bit format). Then if we eliminate the 8-bit format, we can go back to only reserving two opcodes (instead of any full bits) for the 16-bit format to indicate a format change.

OK i went back and looked at the old pymovm source to see how bad it actually is to add addressing modes. In terms of interpreter complexity, it's bad but not horrible. However, it also adds significant complexity to the assembler because you have to parse the addressing modes. Another issue with addressing modes is that now none of our formats are good RISC formats (no implicit load/store, and many registers); instead, we have one good 32-bit CISC format, and two subencodings which could be mixed in for code density (16-bit and 8-bit).

i guess the real issue is: the 16-bit current proposal has a sort of beauty. Most full, fixed-length, simple, orthogonal, 3-operand instruction sets are at least 32 bits, but here was one that was just 16 bits -- twice as good code density. And it really felt like it could be both simple to implement even in a small amount of ROM, and sufficient for all sorts of stuff. But this new proposal robs it of its beauty; instead we get a 16-bit mode that is not sufficient for stuff, and a futuristic mixed ISA with so many parts that the implementation would get larger.

and the new proposal still doesn't give us what i was thinking about when i came up with the new proposal; a 16-bit ISA that would be great for genetic algorithms (because for that we want addressing modes, and even less registers).

So it seems we have three design points in front of us; a 32-bit ISA with extensions up and down from 8- to 64-bits+, suitable for general computation but not terribly easy to implement; a beautiful 16-bit ISA with room to extend to 64-bits+ but not to 8- or 32, suitable for Boot; and a beautiful 16-bit ISA with addressing modes but only 2 registers, suitable for genetic algorithms.

Can we gain anything by giving up the extension down to 8-bits? This frees up one bit from the 16-bit mode. We can then give up the 32 opcodes (go back from 5-bit to 4-bit opcode field) to get another bit. We still need the third bit to switch into 32-bit mode, so we can extend 2 operands to 4-bits, but not all 3. This is not as elegant but it is sufficient; we extend op0 and op1, which lets us do CPYs (MOVs) and any other 2-operand stuff with all 16 registers; the restriction is only for 3-operand instructions, and only in the destination operand, which is restricted to the first 8 regs. So there will be some extra CPYs here sometimes, but that's it. This has the additional benefit of freeing up 1 bit in 32-bit mode, and 1 bit in 64-bit mode. But it completely sacrifices 8-bit mode, which may be useful for compression; and 16-bit mode, while now passable, is maybe not beautiful like it was.

Can we do better? Yes; if we wanted 8-bit mode to be a real fixed-length simple ISA then we need all the bits we can get there. But if we only wanted to use 8-bit mode for compressed instructions (a predefined lookup table of common instructions, giving extra weight to instructions that require 64-bit or 32-bit mode, and/or if we need alignment, then common components of common bigrams of instructions), then we can give up more than one bit for formatting there. If we give up two bits for formatting in 8-bit mode (allowing the remaining 6 bits to specify one of 64 common instructions, instead of 128 for 7 bits), then we can give back the extra reserved bits in 32-bit mode and 64-bit mode, and have 1 formatting bit for 16-bit mode, 2 for 8-bit mode, 3 for 32-bit mode, and 4 for 64-bit mode. So we can go back to having 4 modes, and 16-bit mode can again access 16 registers, except not in the output operand. Its beauty is marred but it's again passable as a bootstrap language.

I like this.

There is another way, though. We could keep 16-bit mode how it is, with 2 adjacent instructions reserved for instruction encoding. This means that any other mode burns 3 bits to say that it isn't 16-bit mode. Instead of adding a forth or fifth instruction selection bit to 32-bits, we can do the same thing in 32-bits (this time reserve 4 instructions), meaning that 6 bits are needed for an instruction to say that it is neither 16-bit nor 32-bit. Now in 64-bit mode, we reduce the operand value size from 8 to 7 bits, and get 5 bits to spare; since we already had 4 reserved, now we have 9, of which we consume 6, and consume 1 more for the extended format, leaving 2 bits reserved for 64 bits. This is probably better in a practical sense, because the 16-bit mode probably needs its third operand more than 64-bits needs to include 1 more bit of values. But it's aethetically annoying, because now 64-bit mode can't load any 16-bit immediate in one instruction, and can't use 8-bit immediate constants.

Is there another way? Yes -- we could demote ANNOTATE to 2-operands, remove LDX and STX (load indirect indexed) from BootX?, then reserve 4 3-operand instructions in the 16-bit mode. Now other modes need 2 bits to say that they aren't 16-bit mode. So there is no 8-bit mode, and 32-bit mode has 3 format bits, and 64-bit mode has 4 format bits.

With all of these reserved 3-operand instructions, though, we aren't really getting the full benefit out of the 3-operand format; we end up making operations like MUL, SUB, SLL, AND, OR, XOR in-place. In fact, the only 'normal, computational' 3-operand instruction is ADD!

---

i figured out how to do CALL and RET, where CALL takes not only an address to CALL, but arguments to pass in, and an operand with the effective address of where to put the return value:

just push the effective address for the return value onto the stack along with the return address. Then RET uses this to determine where to put the returned value.

one issue with this is that it is weird; it may hinder interop with other runtimes; for example it sort of prevents the CALL from being in tail position; even if we have a separate TAILCALL form of call, now we are expecting RET to know whether the caller called it as tail or not. Better would be just to consider CALL a compile-time macro that desugars to a real CALL and then a CPY; but that doesn't quite work either because we need to be able to execute dynamically-generated CALLs.

---

this guy making a Forth assembly language layer feels that Forth's separate stacks can be a detriment (which suggests that maybe we should just unify the DATASTACK and CODESTACK like most other languages out there):

" 2) The entire point of PAF is to get rid of separate stacks and separate stack pointers, to allow to put all stack items in registers that fit, and use a common stack for the spilled stack items. Implementing PAF on top of a conventional Forth would not achieve that objective. "

---

this guy (Hugh Aguilar) comments on the PAF project saying that you can't really make a portable assembly layer below a HLL unless the processors share a reasonable minimum number of registers and a reasonably orthogonal instruction set (i don't really see how this could be true; perhaps he means 'assuming you want fairly good performance'?).

he opines that today, you can assume 16 registers, which is enough for all popular modern processors:

" The idea of a portable assembler is good, as this would allow Forth to be quickly ported to any processor that fits the basic profile (there is not much difference between the MIPS, ARM and ColdFire?, for example) --- all that would be needed would be to rewrite the portable-assembler, and then the Forth system program would assemble for a new processor --- but the target processors have to be similar in that they all have the minimum number of registers and a reasonably orthogonal instruction set (there is a lot of difference between the 6502 and the Z80, or between the 6812 and the 8086, so a portable assembler wasn't realistic for them).

This is what I said in the thread mentioned previously: On Tuesday, January 20, 2015 at 9:40:31 PM UTC-7, hughag...@gmail.com wrote: > This is a good idea. This idea has been around since the 1980s though, so you shouldn't take credit for it. > > You can't abstract away the register set size --- you have to assume a certain number of registers. This is why the idea is feasible now, and wasn't feasible in the 1980s --- because you can assume 16 registers now (this leaves out the 32-bit x86 though, but it is obsolete anyway) --- in the old days, processors had very few registers, and juggling these few registers was different on every processor (the 6812 and the 8086, for example). > > Your PAF won't work on the MiniForth? though --- it will only work on mainstream processors. > > Anyway, what I'm doing right now is targeting the 8-bit AVR. This is the smallest processor that I can imagine my Forth running on (16 registers with 3 that can be used as pointers) --- porting it to a larger more powerful processor should be easy (although going in the other direction, from a powerful processor to a more limited one, would be difficult). I don't actually consider the AVR to be very useful, because it is not powerful enough for most applications and it is well on its way to being obsolete --- I am just interested in the AVR as a proof-of-concept platform. "

[2]

he claims that if you assume 16 registers, you can target x86-64 and ARM, and that most of desktop computing and most of microcontrollers -- he claims that not making that committment is worse than nothing because then you will effectively be targetting zero CPUs.

---

a Forth guy says:

I'm extremely frustrated because I think that the Forth leadership has lost their way. Your Gforth is not about performance; two decades after the MiniForth?, you aren't even in the ballpark yet. Gforth was written in GCC to make it portable, but this assumes that GCC is portable everywhere, which is not true (GCC is only portable to processors with a C-style locals' frame). Also, portability for Gforth is meaningless anyway because nobody is using Gforth (too slow) and so no programs are getting ported because there aren't any programs. Now you have this PAF that seems highly speculative as to whether it will actually work, or if the language will still be Forth after it has been contorted enough to make it work --- you believe that the point is portability --- but portability was solved years ago when everybody standardized on the 64-bit x86 (for desktop-computers) and ARM (for micro-controllers), so you are solving a problem that has already been solved.

The point is performance. "

[3]

---

the PAF guy notes:

" The parameter of PICK has to be a constant or otherwise its value must be statically determined in order to keep the stack accesses statically determinable. "

that a good point...

---

another comments on PAF:

" >You are going to have one common stack to handle spill-over for multiple st= >acks? WTF??? This doesn't sound trivial, or even possible.

Right, it's not trivial. It took me many years to find out which features of Forth are incompatible with this idea; these features are absent from PAF, and that makes it possible. "

---

6502 only has 3 registers (A, X, Y) (excluding the stack pointer, the status flags, and the program counter (PC)) [4]. Which suggests that having 3 GPRs, in addition to a stack pointer (we also have a TOS alias register, and actually we have multiple stack pointers), is (barely) doable. However, the 6502 also has the 'zero page', and "code for the 6502 uses the zero page much as code for other processors would use registers" [5] -- TODO i'll have to look into this a little to see if the zero-page is adding expressivity or if it's just making 'load'-style instructions shorter by "saving the cycle normally required to fetch the high-order byte of the address" [6]

also, the 6502 instruction set might be a good place to start to look for things for the 8-bit mode, although it would probably be better to just profile real code later and find commonly-used instructions from the longer modes.

---

otoh GNU lightning has 6 GPR regs (3 caller-saved and 3 callee-saved) which would suggest that 3 isn't nearly enough. Also the PAF critic (Hugh Aguilar) noted above, when you get down to 8 registers or less (he doesn't say 8, but he gives the example of the 8086 and the 6812, which have 8 more or less GPRs (according to wikipedia) and 4 registers (2 accumulators A and B and two others X and Y) respectively). 3 regs is less than either of those.

This makes me think maybe we should just bite the (addressing mode) bullet and move Boot to 32-bit.

---

also, todo i should look at the m68k series since everyone loves how orthogonal it is.

---

[7] gives some examples of older processors with few registers and how that makes the ISAs very different:

"there is a lot of difference between the 6502 and the Z80, or between the 6812 and the 8086, so a portable assembler wasn't realistic for them"

Let's see how many registers these had:

6502:

" one 8-bit accumulator register (A), two 8-bit index registers (X and Y), 7 processor status flag bits (P), an 8-bit stack pointer (S), and a 16-bit program counter (PC). The stack's address space is hardwired to memory page $01, i.e. the address range $0100–$01FF (256–511). Software access to the stack is done via four implied addressing mode instructions, whose functions are to push or pop (pull) the accumulator or the processor status register. The same stack is also used for subroutine calls via the JSR (Jump to Subroutine) and RTS (Return from Subroutine) instructions and for interrupt handling." [8]

Z80:

8080 registers:

AF: 8-bit accumulator (A) and flag bits (F) carry, zero, minus, parity/overflow, half-carry (used for BCD), and an Add/Subtract flag (usually called N) also for BCD
BC: 16-bit data/address register or two 8-bit registers
DE: 16-bit data/address register or two 8-bit registers
HL: 16-bit accumulator/address register or two 8-bit registers
SP: stack pointer, 16 bits
PC: program counter, 16 bits

The new registers introduced with the Z80 are:

IX: 16-bit index or base register for 8-bit immediate offsets or two 8-bit registers
IY: 16-bit index or base register for 8-bit immediate offsets or two 8-bit registers
I: interrupt vector base register, 8 bits
R: DRAM refresh counter, 8 bits (msb does not count) ((not very useful))
AF': alternate (or shadow) accumulator and flags (toggled in and out with EX AF,AF' )
BC', DE' and HL': alternate (or shadow) registers (toggled in and out with EXX)
Four bits of interrupt status and interrupt mode status "

" There is no direct access to the alternate registers; instead, two special instructions, EX AF,AF' and EXX,[32] each toggles one of two multiplexer flip-flops. This enables fast context switches for interrupt service routines: EX AF, AF' may be used alone, for really simple and fast interrupt routines, or together with EXX to swap the whole BC, DE, HL set. This is still several times as fast as pushing the same registers on the stack. Slower, lower priority, or multi level interrupts normally use the stack to store registers, however. "

"No multiply instruction is available in the original Z80."

[9]

so the normalish registers for the 8080 are (by 'normalish' i mean to exclude the stack pointer):

4, if the 16-bit ones are used as 16-bit ones: AF, BC, DE, HL
7, if the 16-bit ones are split into two 8-bit ones

so the normalish registers for the Z80 are:

10, if the 16-bit ones are used as 16-bit ones: 4 8080 plus IX, IY plus 4 'alternates'
16, if the 16-bit ones are split into two 8-bit ones, 7 plus 7 alternates plus IX, IY

6812:

"two 8-bit accumulators A and B (referred to as a single 16-bit accumulator, D, when A & B are cascaded so as to allow for operations involving 16 bits), two 16-bit registers X and Y, a 16-bit program counter, a 16-bit stack pointer and an 8-bit Condition Code Register." [10]

so, the normalish registers for the 6812 are:

3, if the 16-bit ones are used as 16-bit ones: D, X, Y
4, if the 16-bit ones are split into two 8-bit ones: A, B, X, Y

8086: "The 8086 has eight more or less general 16-bit registers (including the stack pointer but excluding the instruction pointer, flag register and segment registers). Four of them, AX, BX, CX, DX, can also be accessed as twice as many 8-bit registers (see figure) while the other four, BP, SI, DI, SP, are 16-bit only." [11]

so, the normalish registers for the 8086 are:

7, if the 16-bit ones are used as 16-bit ones
11, if the 16-bit ones are split into two 8-bit ones

---

we might want to consider requiring that DATASTACK exists in memory, so that HLL implementations can use frame pointers and the like.

done

remove the CODE stack (b/c i thought: maybe just get rid of architectural support for data and code stacks and instead allow user land to implement them as normal custom Stacks if desired? Or even if the rationale for keeping a stack is integration with the host languages stack (which isn't a good rationale because we may want to do more first class tax stuff than most host language s allow) at least get rid of code stack)

done

---

so, reflecting on the previous, my 16-bit proposal does seem at first a little register-poor, in that the 3-operand instructions have one operand that can only access 4 GPRs. However, the 2-operand and less instructions can access all 16 registers, and 2 of the 3 operands for the 3-operand ones can too (which in many cases is all the register operands anyways), and many (or even all?) of the 4 above-listed ISAs have 2-operand or less instruction encodings anyhow. So far it's really just the 3-operand ADD that suffers a lack of generality in my proposal. Also, i'm not counting the T (TOS) register, but in many cases it probably helps with register pressure too.

so i think we're still good. Might want to add an alternate in-place ADD instruction.

---

removing architectural support for DATA and CODE (and requiring them to be in-memory) mean that we may now be unable to use platform primitives to implement stacks on some platforms, and also that we may not be able to implement something fancy like segmented stacks. However segmented stacks, although they were used in early versions of Rust and Go, are currently out of favor again [12] [13] (even though LLVM and GCC still have options for them; but not sure if these are sufficiently cross-platform options [14]) (however GHC uses them and Simon Marlow loves them: [15]), and also we could still add support for them in OVM (LLVM's segmented stack support requires knowledge of functions and of allocations onto the stack anyways [16]). Also we may be able to implement segmented stacks on an in-memory DATA anyways if we require that the stack is only accessed via PUSH or POP, or if we internally use fat pointers and LOAD and STORE check to see if the fat pointer for a 'frame pointer' is of type STACKPTR. We could also just create a new data type, STACKPTR, and possibly new LOAD and STORE instructions just for it -- i kinda like that -- but then wouldn't we also want a new ADD instruction just for it also? Hmm... we'd have to duplicate all of the ptr instructions... we definitely dont have opcode space for that... so no. Anyway, i think if we want to handle fancy stack stuff, that should be labeled 'fancy thing for OVM or higher-level to take care of'. Remember, even if we want to interop this stuff with the platform (which can be hard (see [17]) and subsequent note), higher levels can do the interop

done

---

making op2 3 bits instead of 4 bits will make it impossible to fit the following important 2-operand instructions into BootX?: bit(or

and

xor)-int, picki, rolli (although we could demote to 1-operand, make the bitwise guys inplace, and make picki and rolli only apply to SMALLSTACK) -- is that worth having the 8-bit encoding??? Yeah it probably is... This will also put pressure on the other stuff currently in bootX 1-operand, but again it's probably worth that, esp. since now we're adding a new 32-bit representation with more opcode space.

done

---

" important RISC-V (ISA ) 2 posts by 2 authors Awais Ahmed 12/5/17

I have to implement RISC-V architecture (ISA) in C++. As all ISA can not be implemented, can some one tell me about most important approx. 40 Instructions set ? please help me in this matter. Bruce Hoult 12/6/17 Other recipients: awaan....@gmail.com ADDI, SLTI, SLTIU, ANDI, ORI, XORI, SLLI, SRLI, SRAI, LUI, AUIPC, ADD, SLT, SLTU, AND, OR, XOR, SLL, SRL, SUB, SRA, JAL, JALR, BNE, BEQ, BLT, BLTU, BGE, BGEU, LOAD, STORE.

That's 31, as listed in the User-Level ISA manual for the 32 bit integer instruction set, minus only FENCE, CSR transfer, ECALL, EBREAK. I guess you're still within your 40 instructions limit even with them, if you don't break out LOAD and STORE into LB/LH/LW/LBU/LWU/SB/SH/SW when counting them.

---

" Let's say you're working on code for which speed is paramount, perhaps a new video codec or a library to process tensors. There are individual instructions that will dramatically improve performance, like fused multiply-add, as well as entire instruction sets like SSE2 and AVX, that can give the critical portions of your code a speed boost. "

---

"Some applications use different/narrower definitions of float. For example GPUs have 16 bit floats and TPUs (ie. AI co-processors) also use narrower floats which are specialized for neural networks."

---

pros and cons of eliminating the unspecified bitwidth of ints and fixing it to 16-bits:

pro: this idea that the bidwidth is undefined is weird/uncommon and more complex
pro: if we fix to a specified bitwidth then applications can use modular arithmetic (probably most math libraries i find will assume modular arithmetic btw)
con: almost all present-day systems will have pointers bigger than 16-bits, and Boot will require memory locations big enough to hold a pointer; this means that most of the memory in which 16-bit integers are stored would be wasted
con: some systems may want to transcribe pointers to integers (if we allow that, which we probably will), and this will be impossible if the integers are 16-bits but the pointer bitwidth is bigger than 16-bits, which it almost always will be
con: almost no CPUs or even microcontrollers actually offer only 16-bits, and most programming languages have at least 32 bits as the default integer type; this choice of 16-bits is purely based on my idea that massively concurrent systems of the type i'm thinking of don't need 32 bits, which idea is backed by only a little evidence

hmm.. the transcribing pointers to integers case is a doozy... maybe we need to keep it unspecified, because we do want to allow implementations the freedom to address regions larger than 64k or to use native pointers

---

actually we don't need 'cast-uint-ptr', because given a set of pointers and their distances from each other, you can just take their distances from the earliest one, and then given some other base pointer, you can recreate that structure using add-ptr-int.

similarly we dont need a 'cast-ptr-uint' because we have pointer subtraction.

---

one idea for how it can come together now:

Boot can express pure serial computations as a streaming iterator, that is a one-way stream of input and a one-way stream of output. Bootx adds the ability to express concurrency and I/O. Ovm adds the 32-bit, 8-bit, and 64-bit encodings.

another idea:

Boot has the base I/O (syscalls) already. BootX? adds concurrency and 32-bit, 8-bit. OVM adds 64-bit (and all the complicated OVM stuff).

---

AspenCore? 2017 Embedded Markets Study: top answers for:

Which of the following 8-bit chip families would you consider for your next embedded project?

Microchip PIC: 46% Atmel AVR: 43% STMicroelectronics ST6, ST7, ST8: 18% Freescale HC: 13%

Which of the following 16-bit chip families would you consider for your next embedded project?

Microchip PIC24 / dsPIC: 45% TI MSP430: 42% STMicroelectronics ST9, ST10: 22% Freescale HC16: 15%

Which of the following 32-bit chip families would you consider for your next embedded project?

STMicro STM32 (ARM): 30% Microchip PIC 32-bit (MIPS): 20% Xilinx Zynq (with dual ARM Cortex-A9): 17% Freescale i.MX (ARM): 17%

Which vendor has the best ecosystem for your needs?

Microchip or Atmel (Microchip): 14% Texas Instruments (TI): 14% ST Microelectronics: 11% NXP/Freescale/Qualcom: 11% Xilinx: 5%

-- [18]

(already copied to plChIsaComparisons)

this suggests that the guys for us to look at are:

8-bit: PIC, AVR, mb ST6,7,8 16-bit: PIC24/dsPIC, MSP320, mb ST9,ST10 32-bit: ARM, MIPS

top vendors: microchip, atmel, TI, ST, NXP/Freescale/Qualcomm

---

openasocket 1 day ago [-]

I'm not an expert on these things, but it seems to me if you're implementing a database in Java you wouldn't want to keep your data on the JVM Heap, as this seems to indicate. My understanding is that in most applications (like servers) the average object lives for a very short period of time, and most GC implementations are built from that idea. But, in a database, especially an in-memory database, the majority of the objects are going to live for a very long time. That makes the mark phase of GC a lot more expensive, puts more pressure on the generations, etc.

Is my guess here correct, or are there things I'm missing or mistaken on?

jakewins 1 day ago [-]

This is correct; the standard approach here is to use regular c-style memory management for the data the system is managing, and the JVM heap only for the database "infrastructure".

This hybrid approach gives the benefit of a managed runtime and safety of GC for most of your code, but allows the performance of raw pointers/malloc for key code paths.

Some examples of this pattern on the JVM:

The Neo4j Page Cache, Muninn, https://github.com/neo4j/neo4j/blob/3.4/community/io/src/mai...
The Netty projects implementation of jemalloc for the JVM: https://github.com/netty/netty/blob/4.1/buffer/src/main/java...

ndesaulniers 16 hours ago [-]

> but allows the performance of ... malloc for key code paths.

Everything is relative, I guess.

---

interestingly, in http://www.ultratechnology.com/moore4th.htm , Chuck Moore gives his opinion that CPUs should have 2 stacks, but that:

" A Forth word should not have more than one or two arguments. This stack which people have so much trouble manipulating should never be more than three or four deep...On the i21 we have an on-chip stack 18 deep. This size was chosen as a number effectively infinite. "

so our SMALLSTACK would seem to meet his requirements.

My only concern is that in OVM we're using most of SMALLSTACK for custom instructions, so that most of it isn't really available for OVM programs (maybe we could reserve the top 4 locations to be available, which is all Moore says is needed, but i like to have a little leeway just in case); if we have 3 layers of custom instructions, and each layer consumes 4 spots of stack space, then we only have 4 spots left. This discrepency with Moore is probably because, for custom instructions, we are using SMALLSTACK as a call stack/result stack, not just a temporaries/data/operand stack. So maybe we should expand SMALLSTACK from 16 to 32. In OVM, we'll require that the custom instructions can cumulatively only 16 of those 32 positions, leaving 16 spots for OVM user code.

If each spot is 64-bits (8 bytes), then 32 spots is 256 bytes. If each spot is 16-bits (2 bytes), then 32 spots is 64 bytes. This still seems reasonable, if not as nice.

With this, you can't reach the entire SMALLSTACK with a single displaced LOAD or STORE in 16-bit mode -- but you can't anyways, since we narrowed the displacement operand to 3 bits.

---

what about the JVM's (operand) stack? The stack ops appear to be just DUP (and DUP variants) and SWAP. No ROT, ROLL, OVER as far as i can see.

The operand stack is statically allocated per-frame and the parameter for this is typed as 'u2', eg a 16-bit unsigned integer, so it seems that the operand stack size could go up to 64k. Way too big for us.

how about Python (CPython)? The relevant stack is called the 'evaluation stack' or 'valuestack'. The relevant size appears to be (?) "co_stacksize", which is an "int" in C: co_stacksize.

a comment in https://github.com/python/cpython/blob/520b7ae27e39d1c77ea74ccd1b184d7cb43f9dcb/Python/ceval.c says "The stack can grow at most MAXINT deep, as co_nlocals and co_stacksize are ints"

(interestingly, CPython has a test, https://github.com/python/cpython/blob/b21d155f57d284aecf9092a9bd24258293965c2f/Lib/test/test_compile.py 'check_stack_size' that appears to check that it grows with O(log(N)) with bytecode length)

so again, way too big for us.

---

If you had asked me to guess what the fundamental stack ops are, i would have said DUP SWAP ROT.

DROP is also fundamental but we already got that (POP to the register DISCARD)

Some dynamic frequency profiling finds that DUP and SWAP and OVER seem to be the most used guys (in one case OVER is even more frequent than SWAP) (eg figure 5 in The Common Case in Forth Programs). OVER is sometimes mentioned before ROT [19]. ROT isn't used much.

The Gforth 'core' stack ops are drop dup over swap rot ?dup (and some '2's, eg 2dup) [20]

The JVM has DUP and SWAP but not OVER or ROT.

Moore's colorForth has DUP and OVER but no SWAP or rot (in colorForth, SWAP can be synthesized from OVER and PUSH and POP and OR).

http://wiki.laptop.org/go/Forth_stack_operators mentions dup ?dup drop swap over rot -rot nip tuck

https://www.forth.com/starting-forth/2-stack-manipulation-operators-arithmetic/ mentions swap dup over rot drop

http://www.murphywong.net/hello/simple.htm mentions DUP ?DUP SWAP OVER ROT NIP TUCK , in addition to the 2s

https://softwareengineering.stackexchange.com/questions/179501/which-are-the-fundamental-stack-manipulation-operations the question asker mentions drop dup swap rot, and then later nip -rot tuck over. One of the answers, [21], mentions dup swap drop.

http://useless-factor.blogspot.com/2007/09/survey-of-stack-shufflers.html mentions drop, dup, swap, and then later, swapd, rot, nip, dupd, over, tuck, pick. At the bottom, e mentions drop swap dup over tuck rot dip keep, and then drop swap dup.

(i'm omitting mentions of PICK and ROLL btw, since they are more generic, and since we dont have opcode space for them in 16-bit Boot anyhow; also not PUSH and POP b/c these are obviously fundamental)

But people talk about ROT a lot (eg in intro-to-Forth stuff like [22]) so maybe i'll leave it in. Otoh ?dup (which does nothing if the TOS is 0, and otherwise executes a DUP [23]) seems to be more frequent than rot, and is also 'core' in Gforth. So maybe leave out ROT. However, https://users.ece.cmu.edu/~koopman/stack_computers/sec6_3.html "Table 6.1. Dynamic Instruction Execution Frequencies for important Forth primitives" includes ROT in the top 20.

hmm.. i dont think ROT belongs in there. but i'll leave it there for now.

---

i didn't add this to my PL book notes b/c it looks like this is just a side project for learning by one guy but for my own future reference:

[24]

the ISA is:

VALUE OPCODE EXPLANATION 0x00000000 NOP do nothing 0x00000001 ADD pop a, pop b, push a + b 0x00000002 SUB pop a, pop b, push a - b 0x00000003 AND pop a, pop b, push a & b 0x00000004 OR pop a, pop b, push a

0x00000005 XOR pop a, pop b, push a ^ b 0x00000006 NOT pop a, push !a 0x00000007 IN read one byte from stdin, push as word on stack 0x00000008 OUT pop one word and write to stream as one byte 0x00000009 LOAD pop a, push word read from address a 0x0000000A STOR pop a, pop b, write b to address a 0x0000000B JMP pop a, goto a 0x0000000C JZ pop a, pop b, if a == 0 goto b 0x0000000D PUSH push next word 0x0000000E DUP duplicate word on stack 0x0000000F SWAP swap top two words on stack 0x00000010 ROL3 rotate top three words on stack once left, (a b c) -> (b c a) 0x00000011 OUTNUM pop one word and write to stream as number 0x00000012 JNZ pop a, pop b, if a != 0 goto b 0x00000013 DROP remove top of stack 0x00000014 PUSHIP push a in IP stack 0x00000015 POPIP pop IP stack to current IP, effectively performing a jump 0x00000016 DROPIP pop IP, but do not jump 0x00000017 COMPL pop a, push the complement of a

looks like i got all that already in Boot..

notably, the stack ops are DUP SWAP ROL3

a blog post by the same guy is: https://csl.name/post/vm/

the operations in the blog post are:

+ - /

cast_int cast_str drop dup if jmp over print println read stack (dump stack) swap

again, looks like i got all those, except 'dump stack' and div and mod and cast_str (but i have read and write in bytes so cast_str sort of isnt needed -- although of course it would be needed to print out a number, unless you wrote sprinftf in the language)

---

i read this post and the comments down to https://yosefk.com/blog/my-history-with-forth-stack-machines.html

---

[25] a person asks for the most important 40 or so RISC-V instructions and someone answers:

" ADDI, SLTI, SLTIU, ANDI, ORI, XORI, SLLI, SRLI, SRAI, LUI, AUIPC, ADD, SLT, SLTU, AND, OR, XOR, SLL, SRL, SUB, SRA, JAL, JALR, BNE, BEQ, BLT, BLTU, BGE, BGEU, LOAD, STORE.

i got many of those, except:

most of the immediate variants
SRA (note: added this just now)
SLT "SLTI (set less than immediate) places the value 1 in register rd if register rs1 is less than thesign-extended immediate when both are treated as signed numbers, else 0 is written to rd"
no link register
BEQ
BGE
BLT (we have ble though)
no U variants

---

what is shift right arithmetic good for? It's good for not losing the sign bit:

http://chortle.ccsu.edu/assemblytutorial/Chapter-14/ass14_13.html

---

some ideas for 8-bit instructions: Dup swap over roll3 drop skipz1 (bz with 1 byte branch) skipz2 inc dec sll srl sll8 srl8 jd, push/pop between MEMSTACK and SMALLSTACK, load/store through T or top of DATASTACK or each of regs 4-7 and into T or top of DATASTACK or each of regs 4-7 also variants that push/pop the stacks while load/storing also postincrement or predecrement variants, add bitand bitxor call ret nop
- also just check the assembly frequency data in my PL book notes chapter
- also RISC-V compressed

8-bit instruction ideas: 0. SMALLSTACK DUP 1. SMALLSTACK SWAP 2. SMALLSTACK OVER 3. SMALLSTACK DROP 4. SMALLSTACK ADD-int 5. POP SMALLSTACK and PUSH to MEMSTACK 6. POP MEMSTACK and PUSH to SMALLSTACK 7. POP SMALLSTACK into ERR 8. POP SMALLSTACK into top of MEMSTACK 9. POP SMALLSTACK into R4 10. POP SMALLSTACK into R5 11. POP SMALLSTACK into R6 12. POP SMALLSTACK into R7 13. PUSH ERR onto SMALLSTACK 14. PUSH T onto SMALLSTACK 15. PUSH R4 onto SMALLSTACK 16. PUSH R5 onto SMALLSTACK 17. PUSH R6 onto SMALLSTACK 18. PUSH R7 onto SMALLSTACK 19. PUSHPC onto MEMSTACK 20. POP MEMSTACK and JD 21. POP MEMSTACK into ERR 22. POP MEMSTACK into T 23. POP MEMSTACK into R4 24. POP MEMSTACK into R5 25. POP MEMSTACK into R6 26. POP MEMSTACK into R7 27. PUSH ERR onto MEMSTACK 28. PUSH T onto MEMSTACK 29. PUSH R4 onto MEMSTACK 30. PUSH R5 onto MEMSTACK 31. PUSH R6 onto MEMSTACK 32. PUSH R7 onto MEMSTACK 33. BNZ-int SMALLSTACK 34. BZ-int SMALLSTACK 34. BNZ-int MEMSTACK 35. BZ-int MEMSTACK 36. BNE-int SMALLSTACK 37. BNE-ptr SMALLSTACK 38. BEQ-int SMALLSTACK 39. BEQ-ptr SMALLSTACK 40. BNE-int MEMSTACK 41. BNE-ptr MEMSTACK 42. BEQ-int MEMSTACK 43. BEQ-ptr MEMSTACK 44. BLE-int SMALLSTACK 45. BLE-ptr SMALLSTACK 46. BLT-int SMALLSTACK 47. BLT-ptr SMALLSTACK 48. BGE-int SMALLSTACK 49. BGE-ptr SMALLSTACK 50. BGT-int SMALLSTACK 51. BGT-ptr SMALLSTACK 52. BLE-int MEMSTACK 53. BLE-ptr MEMSTACK 54. BLT-int MEMSTACK 55. BLT-ptr MEMSTACK 56. BGE-int MEMSTACK 57. BGE-ptr MEMSTACK 58. BGT-int MEMSTACK 59. BGT-ptr MEMSTACK 60. LD SMALLSTACK 61. LD MEMSTACK 62. ST SMALLSTACK 63. ST MEMSTACK

ok that was quick... so yeah i guess it's easy to fill up 6 bits if you allow many combinations of targets between MEMSTACK, SMALLSTACK, the first 8 registers, ints, and ptrs.

of course for real it would be better just to leave this reserved and profile later, but i'm still interested. So let's try again.

so what if we only allow most compute ops on SMALLSTACK, not MEMSTACK? also, we'll prohibit stack JREL and branches, because we don't want to be able to jump to fully-dynamically-computed PC addresses (which is what would happen if we could jump to any offset on the stack):

8-bit instruction ideas: 0. ANNOTATE

1. SMALLSTACK SWAP 2. SMALLSTACK OVER 3. SMALLSTACK DROP 4. SMALLSTACK DUP

5. SMALLSTACK ROT

6. POP SMALLSTACK and PUSH to MEMSTACK 7. POP MEMSTACK and PUSH to SMALLSTACK

8. POP SMALLSTACK into ERR 8. POP SMALLSTACK into R4 10. POP SMALLSTACK into R5 11. POP SMALLSTACK into R6 12. POP SMALLSTACK into R7 13. PUSH ERR onto SMALLSTACK 14. PUSH R4 onto SMALLSTACK 15. PUSH R5 onto SMALLSTACK 16. PUSH R6 onto SMALLSTACK 17. PUSH R7 onto SMALLSTACK

18. PUSHPC onto MEMSTACK 19. POP MEMSTACK and JD

20. SMALLSTACK ADD-ptr-int 21. SMALLSTACK ADD-int 22. SMALLSTACK NEG 23. sub-ptr SMALLSTACK 24. MUL SMALLSTACK 25. SLL SMALLSTACK 26. SLR SMALLSTACK

27. SKIPNZ-int SMALLSTACK 28. SKIPZ-int SMALLSTACK 29. SKIPNE-int SMALLSTACK 30. SKIPNE-ptr SMALLSTACK 31. SKIPEQ-int SMALLSTACK 32. SKIPEQ-ptr SMALLSTACK 33. SKIPLE-int SMALLSTACK 34. SKIPLE-ptr SMALLSTACK 35. SKIPLT-int SMALLSTACK 36. SKIPLT-ptr SMALLSTACK 37. SKIPGE-int SMALLSTACK 38. SKIPGE-ptr SMALLSTACK

39. LD SMALLSTACK 40. ST SMALLSTACK

41. PUSH 0 onto SMALLSTACK 42. PUSH 1 onto SMALLSTACK

43. PUSH addr of MEMSTACK onto SMALLSTACK 44. POP from SMALLSTACK to addr of MEMSTACK

45. JREL +1 46. JREL +2 47. JREL +3 48. JREL +4 49. JREL +5 50. JREL +6 51. JREL +7 52. JREL -2 53. JREL -3 54. JREL -4 55. JREL -5 56. JREL -6 57. JREL -7 58. JREL -8

59. BITAND SMALLSTACK 60. BITOR SMALLSTACK 61. BITXOR SMALLSTACK 62. BITNOT SMALLSTACK

63. CAS SMALLSTACK 64. FENCE

now we're getting somewhere. Seems like we can do all of our compute operations, except SRA, and except we can only access the first 8 registers; and no syscalls, not even malloc or sysinfo; and no way to load immediate constants; and JREL is restricted to +-7 (JREL is +-1024 in the 16-bit ISA). And we even managed to fit in EQ, LT, and GE.

i copied this to the BootX? reference. I know i should profile, but this is nice.

---

copied this out of Boot reference:

(todo: no, we want to be able to load PC-relative constants; maybe CODEPTRs can be treated as PTRs, but not vice versa? eg CODEPTR is a subclass of PTR? furthermore, the output of pointer arithmetic on CODEPTRs is an ordinary PTR, not a CODEPTR; this would mean that something loaded from an offset of the PC could not be jumped to)

TODO: ok so the current code can only have gotos via 'JREL' (or the external platform LIBRARY), and JREL is limited to +-1k. That's pretty unacceptable. Of course, a compiler COULD make a long chain of JRELs threading through the whole executable but i bet that existing toolchains are unable to do that. In the 32-bit ISA we're good, because we have +-20 bits in JREL (and if even that's not enough we also have the constant table there). We might have to swallow our pride and either (a) allow jumping to pointers loaded relative to the PC, or (b) allow some 'data' to come after JREL, or (c) introduce a constant table, or (d) have pairs of JRELs that are read together, or something like that, or (e) reintroduce LOADCODEPTR to coerce a constant to a CODEPTR. For now i got rid of the 'log' instruction and introduced 2 0-operand instructions followed by 16 bits of constant data, jrel-16 and jmp-16.

TODO: really, though, if Boot is to be capable of interpreting OVM and we ever want to do JIT, we probably need a way to jump into code that we ourself have written -- and since BootX? can have programs bigger than 64k, we need a way to compute addresses at runtime that are more than 64k away and jump to them. So maybe we need to give up the CODEPTR distinction, or at least have something like LOADCODEPTR to coerce into it.

i decided to (a) relax the CODEPTR restriction, and (b) add instruction coerce-int-ptr.

---

WASM's floating point looks very simple and usable (more instructions than RISC-V but also easier to understand; otoh it seems like WASM isn't even attempting to be fully IEEE 754 -- they don't support various exception and rounding config options)

http://webassembly.org/docs/semantics/ section 'Floating point operators' looks informative.

https://web.archive.org/web/20180119092244/http://webassembly.org/docs/semantics/

---

notes on

[26]

(i recommend it, it's a good read btw)

e thinks the only good reason to use bytecode is code density
e is targeting eg a 32-bit MCU with 40K ROM (flash)
e is testing the code density of various bytecode languages with the following example:

    fib: n
            n < 2 ifTrue: [^1] 
                  ifFalse: [^(self fib: n - 1) + (self fib: n - 2)]

in squeak:

    9 <10> pushTemp: 0
    10 <77> pushConstant: 2
    11 <B2> send: <
    12 <99> jumpFalse: 15
    13 <76> pushConstant: 1
    14 <7C> returnTop
    15 <70> self
    16 <10> pushTemp: 0
    17 <76> pushConstant: 1
    18 <B1> send: -
    19 <E0> send: fib:
    20 <70> self
    21 <10> pushTemp: 0
    22 <77> pushConstant: 2
    23 <B1> send: -
    24 <E0> send: fib:
    25 <B0> send: +
    26 <7C> returnTop

in pseudo-FORTH:

n 2 < if 1 return then self n 1 - recurse self n 2 - recurse + return

for VM bytecode languages you can do stuff that isn't a good idea for a CPU ISA:
- have a very large register set "without incurring slow procedure call and return"; not sure what he means here, mb keeping locals on the stack or heap but allowing the bytecode to access it as if in registers, eg GETLOCAL in Lua?
- each procedure has its own register set
- a single instruction to create a new preemptively-scheduled thread
- for OOP, have some instructions for calling certain distinguished methods of self (which is just the first argument)
- context-dependent instruction semantics
- instructions for eg associative array lookups, appending to unbounded-size arrays, etc

not sure i correctly understood what he meant about large register sets for each procedure; his actual words are: "

you can have a very large register set (which is more or less what Squeak's VM does, treating local variables as registers) without incurring slow procedure call and return; MMIX suggests how this could be done in hardware as well.
you could imagine that every procedure could have its own register set (as, perhaps, on the SPARC), and a few of the instructions could access the contents of these registers; again, Squeak's VM does this "
he notes that Lua has an "interesting implementation of closures, in which captured variables migrate into heap-allocated structures upon function return.
PowerPC? assembly of his Fibonacci example: 96 bytes
Squeak: 26 bytes
Indirect Threaded 16-bit FORTH: 38 bytes
Naive (tree-walking interpreter) Lisp: 108 bytes
Lua's VM: he didn't evaluate this with Fibonacci.
- E notes that the Lua VM has 35 instructions
MuP?21 and F21 instruction sets (Chuck Moore Forth CPUs): 28 to 33 bytes, depending on how much you optimize
- E notes that they have 24 (23?) and 27 instructions in the ISA, respectively.
he profiled Squeak version 3.8-6665 to see the distribution of how many temporary variable and arguments were needed by methods. The result is:

('temps' -> a Dictionary( 0->18952 1->13665 2->6366 3->3697 4->2301 5->1492 6->939 7->676 8->426 9->346 10->196 11->193 12->139 13->99 14->60 15->47 16->46 17->30 18->15 19->20 20->12 21->11 22->15 23->6 24->3 25->5 26->4 27->3 28->5 32->1 33->2 37->1 39->1 50->1) 'args' -> a Dictionary( 0->26114 1->15903 2->4717 3->1712 4->756 5->309 6->138 7->64 8->37 9->13 10->8 11->2 12->1 13->1 ) 'total' -> a Dictionary( 0->18952 1->3240 2->11976 3->3128 4->4290 5->1760 6->1947 7->999 8->983 9->558 10->521 11->293 12->276 13->155 14->196 15->115 16->81 17->57 18->52 19->31 20->43 21->24 22->20 23->11 24->17 25->9 26->6 27->9 28->7 29->5 30->6 33->1 35->1 36->1 38->1 42->1 44->1 46->1 62->1 ) )

"roughly 95% of these methods have 8 or fewer arguments and temporaries, 90% have 6 or fewer, 75% have 3 or fewer, and 69% have 2 or fewer."

Then he counted how many bytecodes were in methods that required various numbers of (temps + args):

    'total' -> a Dictionary(
        0->171811  1->95663  2->182923  3->117718  4->125591
         5->92320  6->92671   7->72908  8->61526    9->49807
        10->46568 11->36943  12->27096 13->21759   14->27821
        15->17379 16->13309  17->10390  18->9528    19->7659
        20->10162  21->5969   22->5238  23->3087    24->5804
         25->5229  26->2915   27->3747  28->2551    29->2217
         30->3014    33->12    35->405   36->505     38->341
         42->357    44->235    46->336  62->1028 )

50% of them are defined in a context with 4 or fewer locals and args; 60% with 6 or fewer; 70% with 7 or fewer; 80% with 10 or fewer; 90% with 14 or fewer; 95% with 27 or fewer. "

"That suggests that in a codebase like Smalltalk, it would probably be a marginal cost to use two stacks in the bytecode instead of a local-argument vector.", even if you take into account that the methods that need many variables are longer (however, he is only measuring static frequency here, not dynamic frequency)

here's the code:

" Indirect Threaded 16-bit FORTH: Not Great For Density

I think the definition looks something like this in FORTH:

    : FIB DUP 2 <  IF DROP 1  
      ELSE DUP 1- RECURSE  SWAP 2 - RECURSE + THEN ;

which I think, in an indirect-threaded FORTH, compiles to a dictionary entry containing something like this:

    DUP (2) < (IF) #3 DROP (1) (ELSE) #8 DUP 1- FIB SWAP (2) - FIB + ;

...

Naive Lisp: Terrible for Density

What if we interpret fib with a simple Lisp interpreter that walks tree structures? We could define it as follows:

    (labels fib (n) (if (< 2 n) 1 (+ (fib (- n 1)) (fib (- n 2)))))

...

Lua's VM: Four-Byte Register-Based Instructions

They mention that it has 35 instructions, which would almost fit in 5 bits of opcode: MOVE, LOADK, LOADBOOL (converts to boolean and conditionally skips an instruction), LOADNIL (clears a bunch of registers), GETUPVAL, GETGLOBAL, GETTABLE, GETGLOBAL, SETUPVAL, SETTABLE, NEWTABLE, SELF, ADD, SUB, MUL, DIV, POW, UNM (unary minus), NOT, CONCAT (string concatenation of a bunch of registers), JMP, EQ, LT, LE, TEST, CALL, TAILCALL, RETURN, FORLOOP, TFORLOOP, TFORPREP, SETLIST, SETLISTO, CLOSE, and CLOSURE.

CALL passes a range of registers to a function and stores its result in a range of registers; this implies that the virtual machine handles saving and restoring of the stack frame. The paper uses the term "register window" to compare it to what the SPARC does.

...

Here's their example code to show how much better the register machine is:

    local a, t, i   LOADNIL 0 2 0
    a = a + i       ADD     0 0 2
    a = a + 1       ADD     0 0 250
    a = t[i]        GETTABLE 0 1 2

The old stack machine compiled this as follows:

    PUSHNIL 3
    GETLOCAL 0
    GETLOCAL 2
    ADD
    SETLOCAL 0
    GETLOCAL 0
    ADDI 1
    SETLOCAL 0
    GETINDEXED 2
    SETLOCAL 0

It seems that you should be able to compile this on a two-stack machine as NIL NIL DUP >R + 1+ R> NIL GETTABLE, which is 9 instructions instead of 11, and also clearly stupid, since nil is neither a table nor a number.

...

The MuP?21 and F21 instruction sets

((note: these are Chuck Moore Forth CPUs)

MuP?21..instruction set:

 Transfer Instructions:  JUMP, CALL, RET, JZ, JCZ
 Memory Instructions:    LOAD, STORE, LOADP, STOREP, LIT
 ALU Instructions:       COM, XOR, AND, ADD, SHL, SHR, ADDNZ
 Register Instructions:  LOADA, STOREA, DUP, DROP, OVER, NOP

The F21 had 27 instructions to the MuP?21's 24. (Only 23 are listed above, hmm.) They were renamed:

Code Name Description Forth (with a variable named A) 00 else unconditional jump ELSE 01 T0 jump if T0-19 is false w/ no drop DUP IF 02 call push PC+1 to R, jump : 03 C0 jump if T20 is false CARRY? IF 06 RET pop PC from R (subroutine return) ; 08 @R+ fetch from address in R, increment R R@ @ R> 1+ >R 09 @A+ fetch from address in A, increment A A @ @ 1 A +! 0A # fetch from PC+1, increment PC LIT 0B @A fetch from address in A A @ @ 0C !R+ store to address in R, increment R R@ ! R> 1+ >R 0D !A+ store to address in A, increment A A @ ! 1 A +! 0F !A store to address in A A @ ! 10 com complement T -1 XOR 11 2* left shift T, 0 to T0 2* 12 2/ right shift T, T20 to T19 2/ 13 +* add S to T if T0 is true DUP 1 AND IF OVER + THEN 14 -or exclusive-or S to T XOR 15 and and S to T AND 17 + add S to T + 18 pop pop R, push to T R> 19 A push A to T A @ 1A dup push T to T DUP 1B over push S to T OVER 1C push pop T, push to R >R 1D A! pop T to A A ! 1E nop delay 2ns NOP 1F drop pop T DROP

T is top-of-stack; R is top-of-return-stack; S is the element right under the top of stack. I think @R+ and !R+ are two of the three new instructions; push and pop are probably the other one, since they don't seem to be listed in the MuP?21 list.

I'm not sure what +* is for, but I'm guessing it was ADDNZ.

I'm not sure where the else, T0, and C0 instructions jump to; maybe the next address on the operand stack.

it's notable how similar the MuP?21, F21, and ColorForth? (not mentioned in the post) instruction sets are. Some things i notice about them:

they have jump (i think absolute, not relative?)
they have call and ret
they have left shift and arithmetic right shift (but not logical right shift)
they have +* (he doesn't know what this is for but i think on the ColorForth? page Moore says it's one step out of a multiplication routine)
they have and, xor (but not or)
they have load and store (@ and !)
they have nop
they have dup, over, drop (but not swap)
they have postincrement load and store
they have add and neg (do they use 1s complement or 2s?) and 'one step' out of a multiplication routine
load and store cannot go through the normal stack but can go through the top of the return stack, or the A register.
they have pop T to A, push A to T, pop R push to T, pop T push to R
literals are 'data' in the instruction stream, where the next instruction would be. This is the only data, i think? it's unclear where you get the jump targets from.
afaict the only way to get the PC is to use CALL and then pop R push to T

---

for JIT, the runtime wants to note how many times a given piece of code has been run (esp. just 4 bits: once, twice, three times, more than three times) so that it can consider compiling it (or optimizing it more). We used to have space in the 64-bit OVM encoding for that, but now we don't, since we're now using 4 whole bits just for instruction format. But i had an idea on what you can do instead: every now and then, put in an annotation. This annotation is just empty space for the compiler to write how many times it has visited. This actually solves two problems at once, because in addition to providing space, it also tells the compiler when to record, so it doesn't have to record every instruction, nor does it have to count instructions.

---