proj-oot-ootAssemblyNotes20

see also ootNotes30, where i start working on an ISA that might be suitable for genetic algorithms and the like.

that gave me some ideas that should maybe be applied to Boot or to OVM:

also this makes me question if we shouldn't:

hmm.. it doesn't seem possible to add addressing modes to Boot without either:

a variable-length encoding doesn't have to be irregular; some instructions (like LDI and absolute jumps) can just take some extra words at the end. Which is probably fine in most contexts. But:

if we're willing to partially give up on program analysis, however, and restrict each function to ~25 local variables, and encode most constant programmatically, then it's perfectly possibly to make a 32-bit version of OVM: 4 fields (opcode, op2, op1, op0), each field has 3 addr mode bits and 5 value bits. Alternately, the opcode field has 3 addr mode bits and 5 value bits, and the other 3 fields have 3 addr mode bits and 4 value bits, and we have 3 bits left over; two of them are used for instruction encoding format choice, and 1 bit is reserved for future use. Using the same scheme as with Boot, this gives us 31 3-operand instructions, plus 31 2-operand instructions, plus 31 1-operand instructions, plus 31 0-operand instructions; of course the 1-operand and 0-operand instructions aren't as useful, so that's basically 64 2-operand+ instructions. With 8 addressing modes and on the order of 64 major instructions it may not be a 1-weekend project, and it's still lacking some stuff from OVM, most notably generics, address space types, custom instructions, and hooks, but still it seems like a suitable place to start, maybe even more suitable than Boot.

otoh for the 'algorithmic compression' application we may not need main memory at all; a few registers are probably enough to make lots of interesting patterns. In which case we can get rid of the register indirect addressing mode and the load/stores, and we can probably even find some way to get down to 1 addr mode bit instead of 2 (although i'm not sure how yet).

but even if the primary encoding were 32 bits, 16 bits probably still makes some sense, as many instructions won't need the addressing modes, won't access registers above 16, and won't be using the extra opcodes.

does 8 bits make sense? Well, if we use the current version of boot, extended format instructions must have 3 bits set a certain way, and we certainly couldn't afford to lose 3 of 8 bits, so these may not (currently) be compatible. But ignoring that:

so it sounds to me like it would be good to retain the option to have 8 bits, but not to actually work on it (yet). It's probably not good for anything except for code density.

but 16 bits makes sense. And 32 bits makes sense (because, unless you only want less than 8 registers, 16 bits is probably too small to have both 3-operands and good addr modes). And 64 bits makes sense (at least for the program analysis application, because you can have 256 registers, and for OVM, because you can have the generics type field). And reserving the option for something even longer makes sense (for my weird natural language inspired encoding ideas).

if you had the format encoding bits for all of these, however, then it might look like this:

working backwards, if we want the 64-bit format to use 4 format encoding bits, and we want 8 bits to use 1 format encoding bit, then 16 bits must be at least 2 format encoding bits, and 32 must be at most 3 format encoding bits. So, we'd want to reserve 4 3-operand opcodes in Boot (for 2 format-encoding bits). With the current Boot proposal, there aren't enough free opcodes for that.

We could do it if we demoted ANNOTATE to a 2-operand instruction, or if we demoted ld and st to 2-operand instructions. However, if you think about it, that isn't really enough; the 8-bit encoding demands an entire bit, not just 2 combinations out of 4. Also, if you think about it, why are we taking bits out of the opcode field anyway? That's the field that needs them the most. If we are short on bits, we really should demote the operand fields to 3-bits before we remove any bits from the opcode field. This would give us the first option above, 2 format encoding bits, 5 opcode bits, 3*3 operand bits; it would, of course, be nice to double our supply of 3-operand instructions. Of course, the issue with that is that since we're reserving 5 special registers, with 8 registers total that means we'd only have 3 GPRs.

Also, in the above, it seems strange to have 5 bits for the opcode for the 16-bit variant but only 4 bits for the opcode for the 32-bit variant. We could add that 'reserved bit' in the 32-bit variant to the opcode field, though -- that's nice but it doesn't do much when the addressing mode is not immediate or split, so it's a little bit of a waste.

Or we could just give up on the 8-bit encoding and the 32-bit encoding (which is the current plan). That's probably better because it's simpler -- but it misses out on 32-bits, the central 'sweet spot' that many other processors hit.

But if we have 5 instruction formats (well, 4 fixed-length ones), does the 'fixed-length' aspect even do us any good any more? Of course, we can always 'up-project' everything to the 64-bit form for program analysis, but if we're going to do that, why not allow JVM-style variable-length in the smaller bitwidths?

otoh i gotta say that the nice arithmetic progression in the number of format bits, and the idea of having a format for each power-of-two doubling, attracts me. The 16-bit and 32-bit ISA would still be easy to build (although now, with 5 bits, the 16-bit ISA could easily accomodate 'way too many' instructions -- a good problem to have, i guess).

also, note, instead of my 'natural-languagy-ISA' idea for the 'extended' language, we could just have something like SEXPRs (of course, SEXPRs could be easily encoded using my 'natural-languagy-ISA' so that's not an argument against it, except for simplicity).

and how does having 4 different instruction lengths with fixed field locations (or at least 3, since the 8-bit format might not follow that pattern) compare to JVM-style opcode-in-the-first-byte-plus-more-bytes-of-data-depending-on-opcode?

Adding the 2 format bits to the 16-bit instructions would make them almost unusable (at least, in terms of efficiency) for general computation, though; with only 3 GPRs it's very inefficient. This means that with this system, 16-bits is pretty much only good for compressing 32-bit code; so if we do this, maybe Boot should be moved to 32-bits (even though that means 8 addressing modes).

Also, it should be noted that this won't help with the thing that originally got me thinking about this today; for the purpose of compressing strings using Kolmogorov algorithmic complexity (finding short programs that make interesting patterns similar to the patterns in the string you are trying to compress), we'd prefer to have 16-bit instructions with fewer registers and addressing modes, instead of more registers and no addressing mode; and we don't want 32 opcodes (the format bits themselves are okay because we can just ignore them and not even try to mutuate them). The bespoke 16-bit format i made up earlier, with 16-bit opcodes and 3 fields each with 2 addr mode bits and 2 value bits, would probably be better for this application. I guess the remaining contribution to the argument for more formats from that line of thinking is just that addressing modes are useful, and since 32-bit instruction encodings can have addressing modes, it's a shame to just skip over 32-bit right to 64-bit.

other objections are:

Perhaps my biggest objection is that this destroys the beauty/usefulness of the 16-bit format, by cutting it from 11 GPRs to 4, and for what? More complexity.

hmm. i can't decide.

an alternative would be to reduce the operand length in the 64-bit format, freeing up more format encoding bits up there. That allows us to have one more format encoding bit in the 32-bit format (using the reserved bit; so we'd have 4 operand bits instead of 5 in the 32-bit format). Then if we eliminate the 8-bit format, we can go back to only reserving two opcodes (instead of any full bits) for the 16-bit format to indicate a format change.

OK i went back and looked at the old pymovm source to see how bad it actually is to add addressing modes. In terms of interpreter complexity, it's bad but not horrible. However, it also adds significant complexity to the assembler because you have to parse the addressing modes. Another issue with addressing modes is that now none of our formats are good RISC formats (no implicit load/store, and many registers); instead, we have one good 32-bit CISC format, and two subencodings which could be mixed in for code density (16-bit and 8-bit).

i guess the real issue is: the 16-bit current proposal has a sort of beauty. Most full, fixed-length, simple, orthogonal, 3-operand instruction sets are at least 32 bits, but here was one that was just 16 bits -- twice as good code density. And it really felt like it could be both simple to implement even in a small amount of ROM, and sufficient for all sorts of stuff. But this new proposal robs it of its beauty; instead we get a 16-bit mode that is not sufficient for stuff, and a futuristic mixed ISA with so many parts that the implementation would get larger.

and the new proposal still doesn't give us what i was thinking about when i came up with the new proposal; a 16-bit ISA that would be great for genetic algorithms (because for that we want addressing modes, and even less registers).

So it seems we have three design points in front of us; a 32-bit ISA with extensions up and down from 8- to 64-bits+, suitable for general computation but not terribly easy to implement; a beautiful 16-bit ISA with room to extend to 64-bits+ but not to 8- or 32, suitable for Boot; and a beautiful 16-bit ISA with addressing modes but only 2 registers, suitable for genetic algorithms.

Can we gain anything by giving up the extension down to 8-bits? This frees up one bit from the 16-bit mode. We can then give up the 32 opcodes (go back from 5-bit to 4-bit opcode field) to get another bit. We still need the third bit to switch into 32-bit mode, so we can extend 2 operands to 4-bits, but not all 3. This is not as elegant but it is sufficient; we extend op0 and op1, which lets us do CPYs (MOVs) and any other 2-operand stuff with all 16 registers; the restriction is only for 3-operand instructions, and only in the destination operand, which is restricted to the first 8 regs. So there will be some extra CPYs here sometimes, but that's it. This has the additional benefit of freeing up 1 bit in 32-bit mode, and 1 bit in 64-bit mode. But it completely sacrifices 8-bit mode, which may be useful for compression; and 16-bit mode, while now passable, is maybe not beautiful like it was.

Can we do better? Yes; if we wanted 8-bit mode to be a real fixed-length simple ISA then we need all the bits we can get there. But if we only wanted to use 8-bit mode for compressed instructions (a predefined lookup table of common instructions, giving extra weight to instructions that require 64-bit or 32-bit mode, and/or if we need alignment, then common components of common bigrams of instructions), then we can give up more than one bit for formatting there. If we give up two bits for formatting in 8-bit mode (allowing the remaining 6 bits to specify one of 64 common instructions, instead of 128 for 7 bits), then we can give back the extra reserved bits in 32-bit mode and 64-bit mode, and have 1 formatting bit for 16-bit mode, 2 for 8-bit mode, 3 for 32-bit mode, and 4 for 64-bit mode. So we can go back to having 4 modes, and 16-bit mode can again access 16 registers, except not in the output operand. Its beauty is marred but it's again passable as a bootstrap language.

I like this.

There is another way, though. We could keep 16-bit mode how it is, with 2 adjacent instructions reserved for instruction encoding. This means that any other mode burns 3 bits to say that it isn't 16-bit mode. Instead of adding a forth or fifth instruction selection bit to 32-bits, we can do the same thing in 32-bits (this time reserve 4 instructions), meaning that 6 bits are needed for an instruction to say that it is neither 16-bit nor 32-bit. Now in 64-bit mode, we reduce the operand value size from 8 to 7 bits, and get 5 bits to spare; since we already had 4 reserved, now we have 9, of which we consume 6, and consume 1 more for the extended format, leaving 2 bits reserved for 64 bits. This is probably better in a practical sense, because the 16-bit mode probably needs its third operand more than 64-bits needs to include 1 more bit of values. But it's aethetically annoying, because now 64-bit mode can't load any 16-bit immediate in one instruction, and can't use 8-bit immediate constants.

Is there another way? Yes -- we could demote ANNOTATE to 2-operands, remove LDX and STX (load indirect indexed) from BootX?, then reserve 4 3-operand instructions in the 16-bit mode. Now other modes need 2 bits to say that they aren't 16-bit mode. So there is no 8-bit mode, and 32-bit mode has 3 format bits, and 64-bit mode has 4 format bits.

With all of these reserved 3-operand instructions, though, we aren't really getting the full benefit out of the 3-operand format; we end up making operations like MUL, SUB, SLL, AND, OR, XOR in-place. In fact, the only 'normal, computational' 3-operand instruction is ADD!

---

i figured out how to do CALL and RET, where CALL takes not only an address to CALL, but arguments to pass in, and an operand with the effective address of where to put the return value:

just push the effective address for the return value onto the stack along with the return address. Then RET uses this to determine where to put the returned value.

one issue with this is that it is weird; it may hinder interop with other runtimes; for example it sort of prevents the CALL from being in tail position; even if we have a separate TAILCALL form of call, now we are expecting RET to know whether the caller called it as tail or not. Better would be just to consider CALL a compile-time macro that desugars to a real CALL and then a CPY; but that doesn't quite work either because we need to be able to execute dynamically-generated CALLs.

---

this guy making a Forth assembly language layer feels that Forth's separate stacks can be a detriment (which suggests that maybe we should just unify the DATASTACK and CODESTACK like most other languages out there):

" 2) The entire point of PAF is to get rid of separate stacks and separate stack pointers, to allow to put all stack items in registers that fit, and use a common stack for the spilled stack items. Implementing PAF on top of a conventional Forth would not achieve that objective. "

---

this guy (Hugh Aguilar) comments on the PAF project saying that you can't really make a portable assembly layer below a HLL unless the processors share a reasonable minimum number of registers and a reasonably orthogonal instruction set (i don't really see how this could be true; perhaps he means 'assuming you want fairly good performance'?).

he opines that today, you can assume 16 registers, which is enough for all popular modern processors:

" The idea of a portable assembler is good, as this would allow Forth to be quickly ported to any processor that fits the basic profile (there is not much difference between the MIPS, ARM and ColdFire?, for example) --- all that would be needed would be to rewrite the portable-assembler, and then the Forth system program would assemble for a new processor --- but the target processors have to be similar in that they all have the minimum number of registers and a reasonably orthogonal instruction set (there is a lot of difference between the 6502 and the Z80, or between the 6812 and the 8086, so a portable assembler wasn't realistic for them).

This is what I said in the thread mentioned previously: On Tuesday, January 20, 2015 at 9:40:31 PM UTC-7, hughag...@gmail.com wrote: > This is a good idea. This idea has been around since the 1980s though, so you shouldn't take credit for it. > > You can't abstract away the register set size --- you have to assume a certain number of registers. This is why the idea is feasible now, and wasn't feasible in the 1980s --- because you can assume 16 registers now (this leaves out the 32-bit x86 though, but it is obsolete anyway) --- in the old days, processors had very few registers, and juggling these few registers was different on every processor (the 6812 and the 8086, for example). > > Your PAF won't work on the MiniForth? though --- it will only work on mainstream processors. > > Anyway, what I'm doing right now is targeting the 8-bit AVR. This is the smallest processor that I can imagine my Forth running on (16 registers with 3 that can be used as pointers) --- porting it to a larger more powerful processor should be easy (although going in the other direction, from a powerful processor to a more limited one, would be difficult). I don't actually consider the AVR to be very useful, because it is not powerful enough for most applications and it is well on its way to being obsolete --- I am just interested in the AVR as a proof-of-concept platform. "

[2]

he claims that if you assume 16 registers, you can target x86-64 and ARM, and that most of desktop computing and most of microcontrollers -- he claims that not making that committment is worse than nothing because then you will effectively be targetting zero CPUs.

---

a Forth guy says:

"

I'm extremely frustrated because I think that the Forth leadership has lost their way. Your Gforth is not about performance; two decades after the MiniForth?, you aren't even in the ballpark yet. Gforth was written in GCC to make it portable, but this assumes that GCC is portable everywhere, which is not true (GCC is only portable to processors with a C-style locals' frame). Also, portability for Gforth is meaningless anyway because nobody is using Gforth (too slow) and so no programs are getting ported because there aren't any programs. Now you have this PAF that seems highly speculative as to whether it will actually work, or if the language will still be Forth after it has been contorted enough to make it work --- you believe that the point is portability --- but portability was solved years ago when everybody standardized on the 64-bit x86 (for desktop-computers) and ARM (for micro-controllers), so you are solving a problem that has already been solved.

The point is performance. "

[3]

---

the PAF guy notes:

" The parameter of PICK has to be a constant or otherwise its value must be statically determined in order to keep the stack accesses statically determinable. "

that a good point...

---

another comments on PAF:

" >You are going to have one common stack to handle spill-over for multiple st= >acks? WTF??? This doesn't sound trivial, or even possible.

Right, it's not trivial. It took me many years to find out which features of Forth are incompatible with this idea; these features are absent from PAF, and that makes it possible. "

---

6502 only has 3 registers (A, X, Y) (excluding the stack pointer, the status flags, and the program counter (PC)) [4]. Which suggests that having 3 GPRs, in addition to a stack pointer (we also have a TOS alias register, and actually we have multiple stack pointers), is (barely) doable. However, the 6502 also has the 'zero page', and "code for the 6502 uses the zero page much as code for other processors would use registers" [5] -- TODO i'll have to look into this a little to see if the zero-page is adding expressivity or if it's just making 'load'-style instructions shorter by "saving the cycle normally required to fetch the high-order byte of the address" [6]

also, the 6502 instruction set might be a good place to start to look for things for the 8-bit mode, although it would probably be better to just profile real code later and find commonly-used instructions from the longer modes.

---

otoh GNU lightning has 6 GPR regs (3 caller-saved and 3 callee-saved) which would suggest that 3 isn't nearly enough. Also the PAF critic (Hugh Aguilar) noted above, when you get down to 8 registers or less (he doesn't say 8, but he gives the example of the 8086 and the 6812, which have 8 more or less GPRs (according to wikipedia) and 4 registers (2 accumulators A and B and two others X and Y) respectively). 3 regs is less than either of those.

This makes me think maybe we should just bite the (addressing mode) bullet and move Boot to 32-bit.

---

also, todo i should look at the m68k series since everyone loves how orthogonal it is.

---

[7] gives some examples of older processors with few registers and how that makes the ISAs very different:

"there is a lot of difference between the 6502 and the Z80, or between the 6812 and the 8086, so a portable assembler wasn't realistic for them"

Let's see how many registers these had:

6502:

" one 8-bit accumulator register (A), two 8-bit index registers (X and Y), 7 processor status flag bits (P), an 8-bit stack pointer (S), and a 16-bit program counter (PC). The stack's address space is hardwired to memory page $01, i.e. the address range $0100–$01FF (256–511). Software access to the stack is done via four implied addressing mode instructions, whose functions are to push or pop (pull) the accumulator or the processor status register. The same stack is also used for subroutine calls via the JSR (Jump to Subroutine) and RTS (Return from Subroutine) instructions and for interrupt handling." [8]

Z80:

8080 registers:

"

The new registers introduced with the Z80 are:

" There is no direct access to the alternate registers; instead, two special instructions, EX AF,AF' and EXX,[32] each toggles one of two multiplexer flip-flops. This enables fast context switches for interrupt service routines: EX AF, AF' may be used alone, for really simple and fast interrupt routines, or together with EXX to swap the whole BC, DE, HL set. This is still several times as fast as pushing the same registers on the stack. Slower, lower priority, or multi level interrupts normally use the stack to store registers, however. "

"No multiply instruction is available in the original Z80."

[9]

so the normalish registers for the 8080 are (by 'normalish' i mean to exclude the stack pointer):

so the normalish registers for the Z80 are:

6812:

"two 8-bit accumulators A and B (referred to as a single 16-bit accumulator, D, when A & B are cascaded so as to allow for operations involving 16 bits), two 16-bit registers X and Y, a 16-bit program counter, a 16-bit stack pointer and an 8-bit Condition Code Register." [10]

so, the normalish registers for the 6812 are:

8086: "The 8086 has eight more or less general 16-bit registers (including the stack pointer but excluding the instruction pointer, flag register and segment registers). Four of them, AX, BX, CX, DX, can also be accessed as twice as many 8-bit registers (see figure) while the other four, BP, SI, DI, SP, are 16-bit only." [11]

so, the normalish registers for the 8086 are:

---

done

done

 ---

so, reflecting on the previous, my 16-bit proposal does seem at first a little register-poor, in that the 3-operand instructions have one operand that can only access 4 GPRs. However, the 2-operand and less instructions can access all 16 registers, and 2 of the 3 operands for the 3-operand ones can too (which in many cases is all the register operands anyways), and many (or even all?) of the 4 above-listed ISAs have 2-operand or less instruction encodings anyhow. So far it's really just the 3-operand ADD that suffers a lack of generality in my proposal. Also, i'm not counting the T (TOS) register, but in many cases it probably helps with register pressure too.

so i think we're still good. Might want to add an alternate in-place ADD instruction.

---

done

---

making op2 3 bits instead of 4 bits will make it impossible to fit the following important 2-operand instructions into BootX?: bit(or

andxor)-int, picki, rolli (although we could demote to 1-operand, make the bitwise guys inplace, and make picki and rolli only apply to SMALLSTACK) -- is that worth having the 8-bit encoding??? Yeah it probably is... This will also put pressure on the other stuff currently in bootX 1-operand, but again it's probably worth that, esp. since now we're adding a new 32-bit representation with more opcode space.

done

---

" important RISC-V (ISA ) 2 posts by 2 authors Awais Ahmed 12/5/17

I have to implement RISC-V architecture (ISA) in C++. As all ISA can not be implemented, can some one tell me about most important approx. 40 Instructions set ? please help me in this matter. Bruce Hoult 12/6/17 Other recipients: awaan....@gmail.com ADDI, SLTI, SLTIU, ANDI, ORI, XORI, SLLI, SRLI, SRAI, LUI, AUIPC, ADD, SLT, SLTU, AND, OR, XOR, SLL, SRL, SUB, SRA, JAL, JALR, BNE, BEQ, BLT, BLTU, BGE, BGEU, LOAD, STORE.

That's 31, as listed in the User-Level ISA manual for the 32 bit integer instruction set, minus only FENCE, CSR transfer, ECALL, EBREAK. I guess you're still within your 40 instructions limit even with them, if you don't break out LOAD and STORE into LB/LH/LW/LBU/LWU/SB/SH/SW when counting them.

"

---

" Let's say you're working on code for which speed is paramount, perhaps a new video codec or a library to process tensors. There are individual instructions that will dramatically improve performance, like fused multiply-add, as well as entire instruction sets like SSE2 and AVX, that can give the critical portions of your code a speed boost. "

---

"Some applications use different/narrower definitions of float. For example GPUs have 16 bit floats and TPUs (ie. AI co-processors) also use narrower floats which are specialized for neural networks."

---

pros and cons of eliminating the unspecified bitwidth of ints and fixing it to 16-bits:

hmm.. the transcribing pointers to integers case is a doozy... maybe we need to keep it unspecified, because we do want to allow implementations the freedom to address regions larger than 64k or to use native pointers

---

actually we don't need 'cast-uint-ptr', because given a set of pointers and their distances from each other, you can just take their distances from the earliest one, and then given some other base pointer, you can recreate that structure using add-ptr-int.

similarly we dont need a 'cast-ptr-uint' because we have pointer subtraction.

---

one idea for how it can come together now:

Boot can express pure serial computations as a streaming iterator, that is a one-way stream of input and a one-way stream of output. Bootx adds the ability to express concurrency and I/O. Ovm adds the 32-bit, 8-bit, and 64-bit encodings.

another idea:

Boot has the base I/O (syscalls) already. BootX? adds concurrency and 32-bit, 8-bit. OVM adds 64-bit (and all the complicated OVM stuff).

---

---

AspenCore? 2017 Embedded Markets Study: top answers for:

Which of the following 8-bit chip families would you consider for your next embedded project?

Microchip PIC: 46% Atmel AVR: 43% STMicroelectronics ST6, ST7, ST8: 18% Freescale HC: 13%

Which of the following 16-bit chip families would you consider for your next embedded project?

Microchip PIC24 / dsPIC: 45% TI MSP430: 42% STMicroelectronics ST9, ST10: 22% Freescale HC16: 15%

Which of the following 32-bit chip families would you consider for your next embedded project?

STMicro STM32 (ARM): 30% Microchip PIC 32-bit (MIPS): 20% Xilinx Zynq (with dual ARM Cortex-A9): 17% Freescale i.MX (ARM): 17%

Which vendor has the best ecosystem for your needs?

Microchip or Atmel (Microchip): 14% Texas Instruments (TI): 14% ST Microelectronics: 11% NXP/Freescale/Qualcom: 11% Xilinx: 5%

-- [18]

(already copied to plChIsaComparisons)

this suggests that the guys for us to look at are:

8-bit: PIC, AVR, mb ST6,7,8 16-bit: PIC24/dsPIC, MSP320, mb ST9,ST10 32-bit: ARM, MIPS

top vendors: microchip, atmel, TI, ST, NXP/Freescale/Qualcomm

---

openasocket 1 day ago [-]

I'm not an expert on these things, but it seems to me if you're implementing a database in Java you wouldn't want to keep your data on the JVM Heap, as this seems to indicate. My understanding is that in most applications (like servers) the average object lives for a very short period of time, and most GC implementations are built from that idea. But, in a database, especially an in-memory database, the majority of the objects are going to live for a very long time. That makes the mark phase of GC a lot more expensive, puts more pressure on the generations, etc.

Is my guess here correct, or are there things I'm missing or mistaken on?

reply

jakewins 1 day ago [-]

This is correct; the standard approach here is to use regular c-style memory management for the data the system is managing, and the JVM heap only for the database "infrastructure".

This hybrid approach gives the benefit of a managed runtime and safety of GC for most of your code, but allows the performance of raw pointers/malloc for key code paths.

Some examples of this pattern on the JVM:

reply

ndesaulniers 16 hours ago [-]

> but allows the performance of ... malloc for key code paths.

Everything is relative, I guess.

reply

---

interestingly, in http://www.ultratechnology.com/moore4th.htm , Chuck Moore gives his opinion that CPUs should have 2 stacks, but that:

" A Forth word should not have more than one or two arguments. This stack which people have so much trouble manipulating should never be more than three or four deep...On the i21 we have an on-chip stack 18 deep. This size was chosen as a number effectively infinite. "

so our SMALLSTACK would seem to meet his requirements.

My only concern is that in OVM we're using most of SMALLSTACK for custom instructions, so that most of it isn't really available for OVM programs (maybe we could reserve the top 4 locations to be available, which is all Moore says is needed, but i like to have a little leeway just in case); if we have 3 layers of custom instructions, and each layer consumes 4 spots of stack space, then we only have 4 spots left. This discrepency with Moore is probably because, for custom instructions, we are using SMALLSTACK as a call stack/result stack, not just a temporaries/data/operand stack. So maybe we should expand SMALLSTACK from 16 to 32. In OVM, we'll require that the custom instructions can cumulatively only 16 of those 32 positions, leaving 16 spots for OVM user code.

If each spot is 64-bits (8 bytes), then 32 spots is 256 bytes. If each spot is 16-bits (2 bytes), then 32 spots is 64 bytes. This still seems reasonable, if not as nice.

With this, you can't reach the entire SMALLSTACK with a single displaced LOAD or STORE in 16-bit mode -- but you can't anyways, since we narrowed the displacement operand to 3 bits.

---

what about the JVM's (operand) stack? The stack ops appear to be just DUP (and DUP variants) and SWAP. No ROT, ROLL, OVER as far as i can see.

The operand stack is statically allocated per-frame and the parameter for this is typed as 'u2', eg a 16-bit unsigned integer, so it seems that the operand stack size could go up to 64k. Way too big for us.

how about Python (CPython)? The relevant stack is called the 'evaluation stack' or 'valuestack'. The relevant size appears to be (?) "co_stacksize", which is an "int" in C: co_stacksize.

a comment in https://github.com/python/cpython/blob/520b7ae27e39d1c77ea74ccd1b184d7cb43f9dcb/Python/ceval.c says "The stack can grow at most MAXINT deep, as co_nlocals and co_stacksize are ints"

(interestingly, CPython has a test, https://github.com/python/cpython/blob/b21d155f57d284aecf9092a9bd24258293965c2f/Lib/test/test_compile.py 'check_stack_size' that appears to check that it grows with O(log(N)) with bytecode length)

so again, way too big for us.

---

If you had asked me to guess what the fundamental stack ops are, i would have said DUP SWAP ROT.

DROP is also fundamental but we already got that (POP to the register DISCARD)

Some dynamic frequency profiling finds that DUP and SWAP and OVER seem to be the most used guys (in one case OVER is even more frequent than SWAP) (eg figure 5 in The Common Case in Forth Programs). OVER is sometimes mentioned before ROT [19]. ROT isn't used much.

The Gforth 'core' stack ops are drop dup over swap rot ?dup (and some '2's, eg 2dup) [20]

The JVM has DUP and SWAP but not OVER or ROT.

Moore's colorForth has DUP and OVER but no SWAP or rot (in colorForth, SWAP can be synthesized from OVER and PUSH and POP and OR).

http://wiki.laptop.org/go/Forth_stack_operators mentions dup ?dup drop swap over rot -rot nip tuck

https://www.forth.com/starting-forth/2-stack-manipulation-operators-arithmetic/ mentions swap dup over rot drop

http://www.murphywong.net/hello/simple.htm mentions DUP ?DUP SWAP OVER ROT NIP TUCK , in addition to the 2s

https://softwareengineering.stackexchange.com/questions/179501/which-are-the-fundamental-stack-manipulation-operations the question asker mentions drop dup swap rot, and then later nip -rot tuck over. One of the answers, [21], mentions dup swap drop.

http://useless-factor.blogspot.com/2007/09/survey-of-stack-shufflers.html mentions drop, dup, swap, and then later, swapd, rot, nip, dupd, over, tuck, pick. At the bottom, e mentions drop swap dup over tuck rot dip keep, and then drop swap dup.

(i'm omitting mentions of PICK and ROLL btw, since they are more generic, and since we dont have opcode space for them in 16-bit Boot anyhow; also not PUSH and POP b/c these are obviously fundamental)

But people talk about ROT a lot (eg in intro-to-Forth stuff like [22]) so maybe i'll leave it in. Otoh ?dup (which does nothing if the TOS is 0, and otherwise executes a DUP [23]) seems to be more frequent than rot, and is also 'core' in Gforth. So maybe leave out ROT. However, https://users.ece.cmu.edu/~koopman/stack_computers/sec6_3.html "Table 6.1. Dynamic Instruction Execution Frequencies for important Forth primitives" includes ROT in the top 20.

hmm.. i dont think ROT belongs in there. but i'll leave it there for now.

---

i didn't add this to my PL book notes b/c it looks like this is just a side project for learning by one guy but for my own future reference:

[24]

the ISA is:

VALUE OPCODE EXPLANATION 0x00000000 NOP do nothing 0x00000001 ADD pop a, pop b, push a + b 0x00000002 SUB pop a, pop b, push a - b 0x00000003 AND pop a, pop b, push a & b 0x00000004 OR pop a, pop b, push a

0x00000005 XOR pop a, pop b, push a ^ b 0x00000006 NOT pop a, push !a 0x00000007 IN read one byte from stdin, push as word on stack 0x00000008 OUT pop one word and write to stream as one byte 0x00000009 LOAD pop a, push word read from address a 0x0000000A STOR pop a, pop b, write b to address a 0x0000000B JMP pop a, goto a 0x0000000C JZ pop a, pop b, if a == 0 goto b 0x0000000D PUSH push next word 0x0000000E DUP duplicate word on stack 0x0000000F SWAP swap top two words on stack 0x00000010 ROL3 rotate top three words on stack once left, (a b c) -> (b c a) 0x00000011 OUTNUM pop one word and write to stream as number 0x00000012 JNZ pop a, pop b, if a != 0 goto b 0x00000013 DROP remove top of stack 0x00000014 PUSHIP push a in IP stack 0x00000015 POPIP pop IP stack to current IP, effectively performing a jump 0x00000016 DROPIP pop IP, but do not jump 0x00000017 COMPL pop a, push the complement of a
b

looks like i got all that already in Boot..

notably, the stack ops are DUP SWAP ROL3

a blog post by the same guy is: https://csl.name/post/vm/

the operations in the blog post are:

%

cast_int cast_str drop dup if jmp over print println read stack (dump stack) swap

again, looks like i got all those, except 'dump stack' and div and mod and cast_str (but i have read and write in bytes so cast_str sort of isnt needed -- although of course it would be needed to print out a number, unless you wrote sprinftf in the language)

---

i read this post and the comments down to https://yosefk.com/blog/my-history-with-forth-stack-machines.html

---

[25] a person asks for the most important 40 or so RISC-V instructions and someone answers:

" ADDI, SLTI, SLTIU, ANDI, ORI, XORI, SLLI, SRLI, SRAI, LUI, AUIPC, ADD, SLT, SLTU, AND, OR, XOR, SLL, SRL, SUB, SRA, JAL, JALR, BNE, BEQ, BLT, BLTU, BGE, BGEU, LOAD, STORE.

That's 31, as listed in the User-Level ISA manual for the 32 bit integer instruction set, minus only FENCE, CSR transfer, ECALL, EBREAK. I guess you're still within your 40 instructions limit even with them, if you don't break out LOAD and STORE into LB/LH/LW/LBU/LWU/SB/SH/SW when counting them. "

i got many of those, except:

---

what is shift right arithmetic good for? It's good for not losing the sign bit:

http://chortle.ccsu.edu/assemblytutorial/Chapter-14/ass14_13.html

---

8-bit instruction ideas: 0. SMALLSTACK DUP 1. SMALLSTACK SWAP 2. SMALLSTACK OVER 3. SMALLSTACK DROP 4. SMALLSTACK ADD-int 5. POP SMALLSTACK and PUSH to MEMSTACK 6. POP MEMSTACK and PUSH to SMALLSTACK 7. POP SMALLSTACK into ERR 8. POP SMALLSTACK into top of MEMSTACK 9. POP SMALLSTACK into R4 10. POP SMALLSTACK into R5 11. POP SMALLSTACK into R6 12. POP SMALLSTACK into R7 13. PUSH ERR onto SMALLSTACK 14. PUSH T onto SMALLSTACK 15. PUSH R4 onto SMALLSTACK 16. PUSH R5 onto SMALLSTACK 17. PUSH R6 onto SMALLSTACK 18. PUSH R7 onto SMALLSTACK 19. PUSHPC onto MEMSTACK 20. POP MEMSTACK and JD 21. POP MEMSTACK into ERR 22. POP MEMSTACK into T 23. POP MEMSTACK into R4 24. POP MEMSTACK into R5 25. POP MEMSTACK into R6 26. POP MEMSTACK into R7 27. PUSH ERR onto MEMSTACK 28. PUSH T onto MEMSTACK 29. PUSH R4 onto MEMSTACK 30. PUSH R5 onto MEMSTACK 31. PUSH R6 onto MEMSTACK 32. PUSH R7 onto MEMSTACK 33. BNZ-int SMALLSTACK 34. BZ-int SMALLSTACK 34. BNZ-int MEMSTACK 35. BZ-int MEMSTACK 36. BNE-int SMALLSTACK 37. BNE-ptr SMALLSTACK 38. BEQ-int SMALLSTACK 39. BEQ-ptr SMALLSTACK 40. BNE-int MEMSTACK 41. BNE-ptr MEMSTACK 42. BEQ-int MEMSTACK 43. BEQ-ptr MEMSTACK 44. BLE-int SMALLSTACK 45. BLE-ptr SMALLSTACK 46. BLT-int SMALLSTACK 47. BLT-ptr SMALLSTACK 48. BGE-int SMALLSTACK 49. BGE-ptr SMALLSTACK 50. BGT-int SMALLSTACK 51. BGT-ptr SMALLSTACK 52. BLE-int MEMSTACK 53. BLE-ptr MEMSTACK 54. BLT-int MEMSTACK 55. BLT-ptr MEMSTACK 56. BGE-int MEMSTACK 57. BGE-ptr MEMSTACK 58. BGT-int MEMSTACK 59. BGT-ptr MEMSTACK 60. LD SMALLSTACK 61. LD MEMSTACK 62. ST SMALLSTACK 63. ST MEMSTACK

ok that was quick... so yeah i guess it's easy to fill up 6 bits if you allow many combinations of targets between MEMSTACK, SMALLSTACK, the first 8 registers, ints, and ptrs.

of course for real it would be better just to leave this reserved and profile later, but i'm still interested. So let's try again.

so what if we only allow most compute ops on SMALLSTACK, not MEMSTACK? also, we'll prohibit stack JREL and branches, because we don't want to be able to jump to fully-dynamically-computed PC addresses (which is what would happen if we could jump to any offset on the stack):

8-bit instruction ideas: 0. ANNOTATE

1. SMALLSTACK SWAP 2. SMALLSTACK OVER 3. SMALLSTACK DROP 4. SMALLSTACK DUP

5. SMALLSTACK ROT

6. POP SMALLSTACK and PUSH to MEMSTACK 7. POP MEMSTACK and PUSH to SMALLSTACK

8. POP SMALLSTACK into ERR 8. POP SMALLSTACK into R4 10. POP SMALLSTACK into R5 11. POP SMALLSTACK into R6 12. POP SMALLSTACK into R7 13. PUSH ERR onto SMALLSTACK 14. PUSH R4 onto SMALLSTACK 15. PUSH R5 onto SMALLSTACK 16. PUSH R6 onto SMALLSTACK 17. PUSH R7 onto SMALLSTACK

18. PUSHPC onto MEMSTACK 19. POP MEMSTACK and JD

20. SMALLSTACK ADD-ptr-int 21. SMALLSTACK ADD-int 22. SMALLSTACK NEG 23. sub-ptr SMALLSTACK 24. MUL SMALLSTACK 25. SLL SMALLSTACK 26. SLR SMALLSTACK

27. SKIPNZ-int SMALLSTACK 28. SKIPZ-int SMALLSTACK 29. SKIPNE-int SMALLSTACK 30. SKIPNE-ptr SMALLSTACK 31. SKIPEQ-int SMALLSTACK 32. SKIPEQ-ptr SMALLSTACK 33. SKIPLE-int SMALLSTACK 34. SKIPLE-ptr SMALLSTACK 35. SKIPLT-int SMALLSTACK 36. SKIPLT-ptr SMALLSTACK 37. SKIPGE-int SMALLSTACK 38. SKIPGE-ptr SMALLSTACK

39. LD SMALLSTACK 40. ST SMALLSTACK

41. PUSH 0 onto SMALLSTACK 42. PUSH 1 onto SMALLSTACK

43. PUSH addr of MEMSTACK onto SMALLSTACK 44. POP from SMALLSTACK to addr of MEMSTACK

45. JREL +1 46. JREL +2 47. JREL +3 48. JREL +4 49. JREL +5 50. JREL +6 51. JREL +7 52. JREL -2 53. JREL -3 54. JREL -4 55. JREL -5 56. JREL -6 57. JREL -7 58. JREL -8

59. BITAND SMALLSTACK 60. BITOR SMALLSTACK 61. BITXOR SMALLSTACK 62. BITNOT SMALLSTACK

63. CAS SMALLSTACK 64. FENCE

now we're getting somewhere. Seems like we can do all of our compute operations, except SRA, and except we can only access the first 8 registers; and no syscalls, not even malloc or sysinfo; and no way to load immediate constants; and JREL is restricted to +-7 (JREL is +-1024 in the 16-bit ISA). And we even managed to fit in EQ, LT, and GE.

i copied this to the BootX? reference. I know i should profile, but this is nice.

---

copied this out of Boot reference:

(todo: no, we want to be able to load PC-relative constants; maybe CODEPTRs can be treated as PTRs, but not vice versa? eg CODEPTR is a subclass of PTR? furthermore, the output of pointer arithmetic on CODEPTRs is an ordinary PTR, not a CODEPTR; this would mean that something loaded from an offset of the PC could not be jumped to)

TODO: ok so the current code can only have gotos via 'JREL' (or the external platform LIBRARY), and JREL is limited to +-1k. That's pretty unacceptable. Of course, a compiler COULD make a long chain of JRELs threading through the whole executable but i bet that existing toolchains are unable to do that. In the 32-bit ISA we're good, because we have +-20 bits in JREL (and if even that's not enough we also have the constant table there). We might have to swallow our pride and either (a) allow jumping to pointers loaded relative to the PC, or (b) allow some 'data' to come after JREL, or (c) introduce a constant table, or (d) have pairs of JRELs that are read together, or something like that, or (e) reintroduce LOADCODEPTR to coerce a constant to a CODEPTR. For now i got rid of the 'log' instruction and introduced 2 0-operand instructions followed by 16 bits of constant data, jrel-16 and jmp-16.

TODO: really, though, if Boot is to be capable of interpreting OVM and we ever want to do JIT, we probably need a way to jump into code that we ourself have written -- and since BootX? can have programs bigger than 64k, we need a way to compute addresses at runtime that are more than 64k away and jump to them. So maybe we need to give up the CODEPTR distinction, or at least have something like LOADCODEPTR to coerce into it.

i decided to (a) relax the CODEPTR restriction, and (b) add instruction coerce-int-ptr.

---

WASM's floating point looks very simple and usable (more instructions than RISC-V but also easier to understand; otoh it seems like WASM isn't even attempting to be fully IEEE 754 -- they don't support various exception and rounding config options)

http://webassembly.org/docs/semantics/ section 'Floating point operators' looks informative.

https://web.archive.org/web/20180119092244/http://webassembly.org/docs/semantics/

---

notes on

[26]

(i recommend it, it's a good read btw)

    fib: n
            n < 2 ifTrue: [^1] 
                  ifFalse: [^(self fib: n - 1) + (self fib: n - 2)]

in squeak:

    9 <10> pushTemp: 0
    10 <77> pushConstant: 2
    11 <B2> send: <
    12 <99> jumpFalse: 15
    13 <76> pushConstant: 1
    14 <7C> returnTop
    15 <70> self
    16 <10> pushTemp: 0
    17 <76> pushConstant: 1
    18 <B1> send: -
    19 <E0> send: fib:
    20 <70> self
    21 <10> pushTemp: 0
    22 <77> pushConstant: 2
    23 <B1> send: -
    24 <E0> send: fib:
    25 <B0> send: +
    26 <7C> returnTop

in pseudo-FORTH:

n 2 < if 1 return then self n 1 - recurse self n 2 - recurse + return

not sure i correctly understood what he meant about large register sets for each procedure; his actual words are: "

  1. ('temps' -> a Dictionary( 0->18952 1->13665 2->6366 3->3697 4->2301 5->1492 6->939 7->676 8->426 9->346 10->196 11->193 12->139 13->99 14->60 15->47 16->46 17->30 18->15 19->20 20->12 21->11 22->15 23->6 24->3 25->5 26->4 27->3 28->5 32->1 33->2 37->1 39->1 50->1) 'args' -> a Dictionary( 0->26114 1->15903 2->4717 3->1712 4->756 5->309 6->138 7->64 8->37 9->13 10->8 11->2 12->1 13->1 ) 'total' -> a Dictionary( 0->18952 1->3240 2->11976 3->3128 4->4290 5->1760 6->1947 7->999 8->983 9->558 10->521 11->293 12->276 13->155 14->196 15->115 16->81 17->57 18->52 19->31 20->43 21->24 22->20 23->11 24->17 25->9 26->6 27->9 28->7 29->5 30->6 33->1 35->1 36->1 38->1 42->1 44->1 46->1 62->1 ) )

"roughly 95% of these methods have 8 or fewer arguments and temporaries, 90% have 6 or fewer, 75% have 3 or fewer, and 69% have 2 or fewer."

Then he counted how many bytecodes were in methods that required various numbers of (temps + args):

"

    'total' -> a Dictionary(
        0->171811  1->95663  2->182923  3->117718  4->125591
         5->92320  6->92671   7->72908  8->61526    9->49807
        10->46568 11->36943  12->27096 13->21759   14->27821
        15->17379 16->13309  17->10390  18->9528    19->7659
        20->10162  21->5969   22->5238  23->3087    24->5804
         25->5229  26->2915   27->3747  28->2551    29->2217
         30->3014    33->12    35->405   36->505     38->341
         42->357    44->235    46->336  62->1028 )

50% of them are defined in a context with 4 or fewer locals and args; 60% with 6 or fewer; 70% with 7 or fewer; 80% with 10 or fewer; 90% with 14 or fewer; 95% with 27 or fewer. "

here's the code:

" Indirect Threaded 16-bit FORTH: Not Great For Density


I think the definition looks something like this in FORTH:

    : FIB DUP 2 <  IF DROP 1  
      ELSE DUP 1- RECURSE  SWAP 2 - RECURSE + THEN ;

which I think, in an indirect-threaded FORTH, compiles to a dictionary entry containing something like this:

    DUP (2) < (IF) #3 DROP (1) (ELSE) #8 DUP 1- FIB SWAP (2) - FIB + ;

...

Naive Lisp: Terrible for Density


What if we interpret fib with a simple Lisp interpreter that walks tree structures? We could define it as follows:

    (labels fib (n) (if (< 2 n) 1 (+ (fib (- n 1)) (fib (- n 2)))))

...

Lua's VM: Four-Byte Register-Based Instructions


They mention that it has 35 instructions, which would almost fit in 5 bits of opcode: MOVE, LOADK, LOADBOOL (converts to boolean and conditionally skips an instruction), LOADNIL (clears a bunch of registers), GETUPVAL, GETGLOBAL, GETTABLE, GETGLOBAL, SETUPVAL, SETTABLE, NEWTABLE, SELF, ADD, SUB, MUL, DIV, POW, UNM (unary minus), NOT, CONCAT (string concatenation of a bunch of registers), JMP, EQ, LT, LE, TEST, CALL, TAILCALL, RETURN, FORLOOP, TFORLOOP, TFORPREP, SETLIST, SETLISTO, CLOSE, and CLOSURE.

CALL passes a range of registers to a function and stores its result in a range of registers; this implies that the virtual machine handles saving and restoring of the stack frame. The paper uses the term "register window" to compare it to what the SPARC does.

...

Here's their example code to show how much better the register machine is:

    local a, t, i   LOADNIL 0 2 0
    a = a + i       ADD     0 0 2
    a = a + 1       ADD     0 0 250
    a = t[i]        GETTABLE 0 1 2

The old stack machine compiled this as follows:

    PUSHNIL 3
    GETLOCAL 0
    GETLOCAL 2
    ADD
    SETLOCAL 0
    GETLOCAL 0
    ADDI 1
    SETLOCAL 0
    GETINDEXED 2
    SETLOCAL 0

It seems that you should be able to compile this on a two-stack machine as NIL NIL DUP >R + 1+ R> NIL GETTABLE, which is 9 instructions instead of 11, and also clearly stupid, since nil is neither a table nor a number.

...

The MuP?21 and F21 instruction sets


((note: these are Chuck Moore Forth CPUs)

MuP?21..instruction set:

 Transfer Instructions:  JUMP, CALL, RET, JZ, JCZ
 Memory Instructions:    LOAD, STORE, LOADP, STOREP, LIT
 ALU Instructions:       COM, XOR, AND, ADD, SHL, SHR, ADDNZ
 Register Instructions:  LOADA, STOREA, DUP, DROP, OVER, NOP

The F21 had 27 instructions to the MuP?21's 24. (Only 23 are listed above, hmm.) They were renamed:

Code Name Description Forth (with a variable named A) 00 else unconditional jump ELSE 01 T0 jump if T0-19 is false w/ no drop DUP IF 02 call push PC+1 to R, jump : 03 C0 jump if T20 is false CARRY? IF 06 RET pop PC from R (subroutine return) ; 08 @R+ fetch from address in R, increment R R@ @ R> 1+ >R 09 @A+ fetch from address in A, increment A A @ @ 1 A +! 0A # fetch from PC+1, increment PC LIT 0B @A fetch from address in A A @ @ 0C !R+ store to address in R, increment R R@ ! R> 1+ >R 0D !A+ store to address in A, increment A A @ ! 1 A +! 0F !A store to address in A A @ ! 10 com complement T -1 XOR 11 2* left shift T, 0 to T0 2* 12 2/ right shift T, T20 to T19 2/ 13 +* add S to T if T0 is true DUP 1 AND IF OVER + THEN 14 -or exclusive-or S to T XOR 15 and and S to T AND 17 + add S to T + 18 pop pop R, push to T R> 19 A push A to T A @ 1A dup push T to T DUP 1B over push S to T OVER 1C push pop T, push to R >R 1D A! pop T to A A ! 1E nop delay 2ns NOP 1F drop pop T DROP

T is top-of-stack; R is top-of-return-stack; S is the element right under the top of stack. I think @R+ and !R+ are two of the three new instructions; push and pop are probably the other one, since they don't seem to be listed in the MuP?21 list.

I'm not sure what +* is for, but I'm guessing it was ADDNZ.

I'm not sure where the else, T0, and C0 instructions jump to; maybe the next address on the operand stack.

"

it's notable how similar the MuP?21, F21, and ColorForth? (not mentioned in the post) instruction sets are. Some things i notice about them:

---

for JIT, the runtime wants to note how many times a given piece of code has been run (esp. just 4 bits: once, twice, three times, more than three times) so that it can consider compiling it (or optimizing it more). We used to have space in the 64-bit OVM encoding for that, but now we don't, since we're now using 4 whole bits just for instruction format. But i had an idea on what you can do instead: every now and then, put in an annotation. This annotation is just empty space for the compiler to write how many times it has visited. This actually solves two problems at once, because in addition to providing space, it also tells the compiler when to record, so it doesn't have to record every instruction, nor does it have to count instructions.

---