Bayle Shanks's website: proj-oot-ootLovmNotes2

if we have smallstacks, then i feel like we're not making enough use of the potential for smallstacks to give us zero-operand instructions that can be encoded tightly.

can we use the operands in an ordinary 16-bit Boot instruction to encode this? (i don't want to throw this into BootX? though b/c it's purely for perf, and b/c if you throw that in, why not throw in the 8-bit format, etc)

consider:

fixed_opcode op0 op1 op2

op0+op1+op2 = 9 bits

so, yes. as noted elsewhere instead of zero-operand i think we can do a lot more with 2-bit operands b/c then you can address:

smallstack push/pop
smallstack TOS
one ordinary register
something else

if each 'operand' is 2 bits then we have used 6 bits, and we have 3 bits left for the opcode. So we can encode 8 instructions this way.

we already have CP and LD taken care of in the 8-bit format (arg should that be added that in at this point now too? if so then what's the difference between lovm and bootx, really? if we already have smallstacks and multiple instruction formats... i think this should be postponed until LOVM). So how about:

Recall (from the list in lovmNotes1) that the most frequent instructions seemed to be:

10 mv, 11 ld, 2 lm, 2 addi, 2 call 3 mv, 8 ld, 3 lm, 5 cond branch or compare, 2 add

lm isn't useful here because we'd have to have the immediate in the following 8 bits, and we can already do a lm9 using Boot instructions. Similarly for addi, although maybe inc would be useful? So some suggestions are:

bne
beq?
ble?
add
fma (a = a + b*c) with c immediate (useful for address computation; c is the scale, b*c is the scaled index register, a is the base register; interpret c as choosing between 1,2,4,8)
fma (a = a + b*c) with b and c immediate (useful for address computation; c is the scale, b*c is the scaled offset, a is the base register)
call
push

those are popular instructions but should also consider what would be needed for a 'minimal subset'

eh hold on a minute, how is this at all helpful? we can do all that with ordinary 16-bit Boot instructions...

---

ruminations on the 8-bit encoding:

## 8-bit instruction encoding The two most-significant-bits are always 10.

first two bits: 10
1 opcode bit (CP or LW)
1 register bank bit (int32 or ptr) (choose between cp and cpp, and between l32 and lp)
2 source register operand bits (registers 1 thru 3)
2 dest register operand bits (registers 1 thru 3)

i feel like this could be a lot more useful -- isn't part of the whole point of the smallstacks so that we can have a bunch of zero-operand instructions that can be encoded tightly?

cp where the src and dest registers are equal is useless, so can define 4 more 0-operand instructions that way. Actually 8, because for most other instructions we don't need the register bank bit. OK, now we're talking. Recall that the most frequent instructions seemed to be:

10 mv, 11 ld, 2 lm, 2 addi, 2 call 3 mv, 8 ld, 3 lm, 5 cond branch or compare, 2 add

bne
add
call

however how do we make those 2-operand? mb we need to reinterpret the src and dest register bits to make them 3-operand.. but then we have an extra bit...

should we make the last 8-bit accessible register a temporary instead of the stack pointer?

i dunno, maybe so much for regularity. Maybe we should go back to just having a table of 64 common instructions. Maybe making some effort to encourage smallstack usage there. Maybe noting that if we have 4 common registers, then complete cp and ld between them would take up 56 spots.

note that we already have dup: cp smallstack tos if we make one of the registers 0, then we also have drop: cp 0 smallstack altho mb just add drop and swap

well i dunno that means we can stick with the original idea for a regular cp and ld and now we have a table of 8 common zero-operand instructions, presumably with a lot of smallstack usage in there. So consider:

bne smallstack smallstack smallstack
add smallstack smallstack smallstack
call smallstack smallstack smallstack
fma (a = a + b*c) with a, b smallstack and c = 1
fma (a = a + b*c) with a, b smallstack and c = 2
fma (a = a + b*c) with a, b smallstack and c = 4
fma (a = a + b*c) with a, b smallstack and c = 8

hmm maybe regularize by only having fmas (addr mode-ish calcs):

smallstack-push(reg + tos)
smallstack-push(reg + tos*2)
smallstack-push(reg + tos*4)
smallstack-push(reg + tos*4)
and 4 others

or, maybe have predecremeent/postincrement instructions useful for loops:

smallstack = load(reg1 + (reg2++))
smallstack = load(reg1 + (reg2++)*2)
smallstack = load(reg1 + (reg2++)*4)
smallstack = load(reg1 + (reg2++)*PTR_SIZE)
smallstack = load(reg1 - (++reg2))
smallstack = load(reg1 - (++reg2)*2)
smallstack = load(reg1 - (++reg2)*4)
smallstack = load(reg1 - (++reg2)*PTR_SIZE)

or:

reg1 = reg1 + reg2; smallstack = load(reg1)
reg1 = reg1 - reg2; smallstack = load(reg1)

(see Adrian Bocaniciu comment in ootAssemblyNotes28)

or maybe have:

drop
swap
over
call smallstack smallstack smallstack

also note that lp tos tos = lp smallstack smallstack; so this frees up another spot (the same isn't true for l32)

i'm leaning towards something like:

tos_p = tos_p + reg_int
tos_p = tos_p - reg_int
smallstack_= reg_p + tos_int; tos_int += INT32_SIZE
tos_int -= INT32_SIZE; smallstack_ptr = reg_p + tos_int;
smallstack_ptr = reg_p + tos_int; tos_int += PTR_SIZE
tos_int -= PTR_SIZE; smallstack_ptr = reg_p + tos_int;
swap (int)
swap (ptr)
RESERVED (?drop? ?over? ?call? ?ret? ?bne? ?add?)

since we have both this addrmode-ish computation in 8 bits, and l32/lp in 8 bits, this means that we can do an incrementing-addr-mode-ish load in 16 bits, which could be helpful for compact loops (see Adrian Bocaniciu comment in ootAssemblyNotes28).

we already have dup via cp smallstack tos. so choosing between swap, drop, over, it looks like swap is broadly common in plChAssemblyFrequent... and https://users.ece.cmu.edu/~koopman/stack_computers/sec6_3.html lists it as common both staticly and dynamicly.

---

If Entry assumes that the memstack pointer is the place to spill to, Now there are three operands free, And we have enough space for two bits for each of: int ptr fp * caller callee * reg stack.

---

removed for now (RESERVED for later)

## 8-bit instruction encoding

first bit: 0

The all-zero instruction (8 bits of zeros) is illegal.

scheme for CP and LW

1 opcode bit (CP or LW)
1 register bank bit (int32 or ptr) (choose between cp and cpp, and between l32 and lp)
2 source register operand bits (registers 1 thru 4)
2 dest register operand bits (registers 1 thru 4)

   -- note: register 2 is no longer TOS, but we want TOS here; so when the register bits say 2, instead we mean TOS

except:

cp where the src and dest registers are equal is useless, so can define 4 more 0-operand instructions that way. Actually 8, because for most other instructions we don't need the register bank bit. And one more, because lp smallstack smallstack = lp tos tos. So we have 9 special cases:
tos_p = tos_p + reg_int
tos_p = tos_p - reg_int
smallstack_ptr = reg_p + tos_int; tos_int += INT32_SIZE
tos_int -= INT32_SIZE; smallstack_ptr = reg_p + tos_int;
smallstack_ptr = reg_p + tos_int; tos_int += PTR_SIZE
tos_int -= PTR_SIZE; smallstack_ptr = reg_p + tos_int;
swap (int)
swap (ptr)
RESERVED

16-bit instruction encoding The most-significant-bit (the first bit of the first byte) is always 0.

first bit: 0
next 6 bits: opcode
next 3 bits: op0
next 3 bits: op1
next 3 bits: op2

Note that op0 spans two bytes.

The 16-bit instruction encoding represents the Boot instructions. The interpretation of instructions from opcodes and operands is identical to Boot's. Note that Boot code can only access the first 8 registers in each bank, and cannot access the smallstacks.

---

old/removed for now

# LOVM calling convention

here's from BootX?, do we keep this and then extend? probably..

Registers:

0: constant 0/null
1: $1: number of arguments passed, if needed; &1: pointer to other arguments, if needed (callee-save)
2: smallstack push/pop pseudoregister
3: memory stack ptr
4: link registers (caller-save)
5: temporary (caller-save)
6: first argument (caller-save)
7: second argument (caller-save)

should extend the 8 global regs with:

?8 input regs?
?8 callee-save regs?
?8 output regs?

should maybe first 8 are global, last 8 are output regs, so at least 16 total (0-15)? If no others then the new output regs = the current 8 input regs; otherwise the new output regs will be the last 8 regs. OR should the # of output regs be specified below (that's what i did for now)?

When a function is called, typically the callee begins with the ENTRY instruction, which takes 3 arguments:

24 bits over two operands, with 12 fields of 2 bits each, encoding:
the number of callee-save int registers needed above 7
the number of callee-save ptr registers needed above 7
the number of callee-save fp registers needed above 7
the number of int smallstack locations needed below TOS (that is, the number of integer arguments)
the number of ptr smallstack locations needed below TOS (that is, the number of ptr arguments)
the number of fp smallstack locations needed below TOS (that is, the number of fp arguments)
the number of caller-save int registers needed above 7 (callee-save + caller-save)
the number of caller-save ptr registers needed above 7 (callee-save + caller-save)
the number of caller-save fp registers needed above 7 (callee-save + caller-save)
the number of smallstack locations needed above TOS
the number of ptr smallstack locations needed above TOS
the number of fp smallstack locations needed above TOS

   for each of the first 6 fields, the encoding is:
      00: 0
      01: 2
      10: 4
      11: 8
   for each of the last 6 fields, the encoding is:
      00: 0
      01: half as many as the corresponding callee-save spots/argument spots
      10: twice as many as the corresponding callee-save spots/argument spots
      11: as many as possible (given the 32 registers/smallstack item limit)
        umm this isn't quite good

i guess for each of the 6 regions, you want something like:

0000 both zero 0001 callee 0, caller 2 0010 callee 2, caller 0 0011 callee 2, caller 2 0100 callee 0, caller 4 0101 callee 4, caller 0 0110 callee 4, caller 4 0111 callee 2, caller 4 1000 callee 4, caller 2 1001 ??? 1010 callee 8, caller 4 1011 callee 4, caller 8 1100 callee 8, caller 8 1101 callee 8, caller 16 1110 callee 16, caller 8 1111 callee 16, caller 16 (only applicable to the stacks, b/c the regs have 8 regs spoken for already, and the limit is 32 total)

probably there's some way to regularize the above

If any callee-save registers are requested, ENTRY spills them to the memory stack pointed to by the stack pointer (see section below on memory stack layout).

After ENTRY, the 16 OUTPUT

CALL

RET

---

old

we have a little bit of a problem with branch range.

RISC-V has 12-bit target offsets in their conditional branch instructions (+-4 KiB? range, because they are signed offsets in units of 2 bytes) and https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.pdf figure 5.6 shows that branch ranges up to 8-bit width can be used very effectively (i think that 8 bits includes the signed offsets, so this would yield a range of +-256). But we only have 7 bits per operand, which means our conditional branches could use at least one more bit.

I suggest consuming an extra set of branch opcodes, and represent forward and reverse conditional branches separately. Now we would get a range of +-256.

done

---

old

addr modes: +-32 immediate, 32 registers, 16 stack, 8 register indirect by register, 8 register indirect by stack (or should we do 16 register indirect by just register). Also, for ptr operands, need to do stuff else for 'immediate' I guess, maybe here we can have offset (by either ptr or int32 scale) or indexed or postpre inc dec.

so:

high bit: immediate or other
if immediate: negative or positive
if negative: -1 thru -32
if positive: 1 thru 32 (0 unneeded b/c we still have the 0 register)
if other:
if high bit is 0: register 0 thru 31
if high bit is 1:
if high bit is 0: smallstack 0 thru 15; smallstack 0 ('SS') is push/pop
if high bit is 1:
if high bit is 0: register indirect 0 thru 7 (and mb find special meaning for register indirect 0, maybe involving TOS or SP or SS or predecrement/postincrement)
if high bit is 1: register indirect on smallstack 0 thru 7

alternatively, mb just have 0 thru 31 smallstack locs so:

high bit: immediate or other
if immediate: negative or positive
if negative: -1 thru -32
if positive: 1 thru 32 (0 unneeded b/c we still have the 0 register)
if other:
if high bit is 0: register 0 thru 31
if high bit is 1: smallstack 0 thru 31; smallstack 0 ('SS') is push/pop

this allows us to have up to ~60 program variables of each type, and it's also simpler. The disadvantage is that the large smallstack size probably prohibits keeping the whole smallstack in registers and using register MOV chaining to quickly push/pop (although this could still be done as a special case when a function allocates a small smallstack, which will be the common case).

i guess simpler is better, so let's go with this.

one remaining question is whether it's worth the complexity to avoid having two ways to say immediate zero (and to gain immediate +32). It might be simpler to just say: if the value is >= 64, subtract 64 (or, bitwise-AND with 63) then sign-extend that 5-bit twos-complement representation. That way there's no extra conditional.

According to a sampling of some of https://www.agner.org/optimize/instruction_tables.pdf , on various x86 archs, MOVSX (move with sign extend) is just as quick as MOV, which is just as quick as ADD.

Perhaps it would be even simpler to treat immediate, rather than register, as the 'default'. So:

-32 thru +31: immediate mode
32 thru 63: registers (32 = S0, 33 = SP, etc)
-33: SS
-63 thru -34: smallstack (-34 = S1, ... -63 = S30)

i gotta say, that's simpler to specify. Let's do it.

actually i think sign-extending 7 bits could be annoying (and notice how the representation of an unsigned immediate gets messed up). How about specifying everything in terms of unsigned 7 bits:

0 thru +63: immediate mode
64 thru 95: registers (64 = S0, 65 = SP, etc)
96 thru 126: smallstack (97 = S1, ... 126 = S30)
127: SS

ok but let's think about how we will process the immediate mode constant, since this will usually be signed.

Sign extending from i6 might not be cheap on some targets (might have to actually do something like if >= 32... then result = or(result, 0xfffffffe0, and we branching (if) isnt fast). A sign bit might also require testing for the sign bit and branching. So a bias might be best. So:

0 thru +63: immediate mode; signed value is 0 is -32, 32 is 0, 63 is +31
64 thru 95: registers (64 = S0, 65 = SP, etc)
96 thru 126: smallstack (97 = S1, ... 126 = S30)
127: SS

well actually i think 'sign extension' is always branchless as long as we have left shift: left shift 2 bits, then OR with 128, then AND with the original value -- actually that's wrong, b/c e.g. x=32; ((x 1 2. Something tells me that will be even worse, however. I guess could do:

x*(1 - ((x&32) >> 5)) + -(64-x)*((x&32) >> 5)

that works and has no conditionals but it's a bit ugly and it has multiplication instead! In Python a conditional is probably best.

this encoding would be:

0 thru +63: immediate mode; signed value is 0 is 0, 31 is 31, 32 is -32, 63 is -1
64 thru 95: registers (64 = S0, 65 = SP, etc)
96 thru 126: smallstack (97 = S1, ... 126 = S30)
127: SS

in x86 i verified with https://carlosrafaelgn.com.br/asm86/ that: sal edx, 26 sar edx, 26

seems to work. But yknow what is just as easy? sub edx, 32 (for the bias encoding where 32 is 0 and 0 is -32). And as i noted earlier, on intel the timing of add is the same as shifts.

What about AArch64. On Cortex-A57, https://developer.arm.com/documentation/uan0015/b/ it's the same (ADD/SUB and immediate shifts are just as fast as each other).

So, bias takes one less instruction in assembly, is not slower, and is easier in Python. I'm going to go with bias.

0 thru +63: immediate mode; signed value is 0 is -32, 32 is 0, 63 is +31
64 thru 95: registers (64 = S0, 65 = SP, etc)
96 thru 126: smallstack (97 = S1, ... 126 = S30)
127: SS

no, i changed my mind about this. 0 is an important special case sometimes, and there may be times when it's faster to recognize 0 (e.g. with "bz", "bnz") than to recognize 32. The decode is worse without bias but it's not terrible. So we'll do:

0 thru +63: immediate mode; signed value is 0 is 0, 31 is 31, 32 is -32, 63 is -1
64 thru 95: registers (64 = S0, 65 = SP, etc)
96 thru 126: smallstack (97 = S1, ... 126 = S30)
127: SS

done

---

regarding

"- cmp and branch-on-status register-like instructions (to allow larger branch immediate; mb use R1 (S0) as fixed 'status register' location)"

mb consider this comment: "(Note that this is still better than ISAs which write flags to a GPR and then branch upon the resulting flags)" [1] - mb ask er why/what e meant?

--- old

on an old change from LOVM from a 64-bit encoding with more features, to just a 32-bit encoding which is a 'basic assembly language':

the above is out of date. the new plan is for LOVM to go up to 32-bit encoding, and leave the 64-bit encoding for OVM. 64 regs in each bank instead of 256. 4 addr modes. In other words, instead of making this be something with more variables that compiles to an assembly language, i'm thinking of making this just an assembly language -- ie if an implementor were more experienced or patient, they would probably skip Boot totally and implement this directly.
So, LOVM would be like Boot, but with 64 regs per bank instead of 8, with smallstack(s), with addr modes, and with enough opcodes to fit all of the basic instructions (including e.g. arrays). call, ret. some higher level control flow like switch, and some loop helpers like lua bytecode. A usable bare bones assembler with labels and macros. Nothing too high level (so probably not 255 variables). "

---

-- should we require that at any given point in the code, that the smallstack sizes (relative to the previously encountered ENTRY) are always fixed (eg you could make 'stack maps')? that might make exception handling/debugging hard..

no.. i think that kind of restriction is for OVM. LOVM offers structured control flow but doesn't rely on it.

---

old

section "%rip-relative addressing" of https://cs61.seas.harvard.edu/site/2018/Asm1/ explains that, yeah, we probably do want a PC-relative LOAD somewhere. Right now in Boot we can do this already; use lpc to load the address, then LOAD from that. But remember that for addr modes

ok i added it to the list of addr-modes-like-instructions to consider

---

old

the ENTRY instruction takes 3 operands, all of them information about how many registers are needed of various types (in one 8-bit immediate operand, so we can't separately given information on numbers of callee-saved and caller-saved for each bank).
ENTRY assumes that the place to spill to is whatever the stack pointer points to
the ENTRY instruction spills any necessary callee-saved registers to designated memory (typically the stack)
the CALL instruction spills the link register, and any necessary caller-saves to designated memory, writes the current address into the link register, and jumps
the RET instruction restores callee-saved registers (the number of which is in an immediate operand), remembers the current link register, restores the old link registers, then jumps to the remembered current link register
for coroutines, maybe break these up into subfunctions differently; you want to do the saves and restores of both RET and CALL before transferring control
i think this works for tail calls already

---

old

yknow i think we should just make the smallstacks conceptually a cache of part of the tip of the stack.

define exactly when they are written to the stack and the stack ptr is updated; BUT the implementation is free to store them below the SP before that; someone (the implementation? the program?) should ensure that the stack has space for that.

this probably means that the program shouldn't read or write the SP in between allocating smallstacks and flushing them. Dunno about that though, what if the program really has so many variables that it wants to store some stuff on the stack? We should accomodate that somehow. If the implementation saves a copy of SP at that time then i guess we're good. How about if we just say that (a) the area below (the stack pointer at the time of smallstack allocation) is reserved for the implementation, (b) until smallstacks are flushed, SP can be read but not written, although it's fine to allocate new smallstack stuff; so upon a function call, if all you need to do is call ENTRY, you're good, but if you need to manually move the stack pointer down the make room for your own manual stuff, then you must flush smallstacks first.

if the implementation wants to change the SP to keep track of stuff in the meantime, it needs to make a copy of the SP at time of allocation so that the program doesn't see the changes, and then it can hide and change the target platform's 'real' SP.

This way all smallstacks are in the main stack, so there's only one. The stack layout of the various caller-save, callee-save, smallstacks is standardized/defined. ENTRY etc aren't (completely) magical. Now since there's only one stack, we could still add a second stack for return addresses, although i guess that should be visible too.

---

brandmeyer 1 day ago [–]

rotate-and-xor (and xor-and-rotate) are both common operations in ARX ciphers. They demand 4 macro-ops in RISC-V, but only one in ARMv8.

Bitfield insertion is only one instruction in most RISC ISAs, but 5 or more in RISC-V.

Veedrac 1 day ago [–]

The (unfinalized) bitmanip extension has single-op rotates.

CalChris? 1 day ago [–]

I really don't like it. It seems like they copied x86 (bext, bdep) where they should have been plagiarizing armv8 (BFM, ...).

brandmeyer 1 day ago [–]

I've watched that extension's development since it was little more than one smart guy's wish list. I don't think its fair to say that they copied any one architecture. The authors have put in a ton of time researching the tradeoffs and investigating the trade space over the years.

That said, until it gets ratified by the consortium and implemented in silicon its still just a (well-researched) wishlist.

---

pizlonator 1 day ago [–]

The lack of condition codes is a big deal for anyone relying on overflow checked arithmetic, like modern safe languages that do this for all integer math by default, or dynamic languages where it’s necessary for the JIT to speculate that the dynamic “number” type (which in those languages is either like a double or like a bigint semantically) is being used as an integer.

RISC-V means three instructions instead of two in the best case. It requires five or more instead of two in bad cases. That’s extremely annoying since these code sequences will be emitted frequently if that’s how all math in the language works.

---

pizlonator 1 day ago [–]

I mean you will do overflow checks on the following. I’ll use the “s” and “u” prefixes to mean signed and unsigned. Unsigned matters less than signed.

sadd32, sadd64, uadd32, uadd64, ssub32, ssub64, usub32, usub64, smul32, smul64, umul32, umul64

---

lists some more things we should include, toread:

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-130.pdf

---

" RISC-V has some closely-related sharp corners in indexed address arithmetic as well. Some choices for the type of the index variable perform much worse on rv64.

Consider: an LP64 machine uses 32-bit integers for 'int' and 'unsigned', but 64-bit integers for `long`, `size_t`, `ptrdiff_t` and so on.

If you use an array index variable of type `unsigned`, then the compiler must prove that wraparound doesn't happen. That's pretty weird considering that half the point of using unsigned is to elide such proofs of correctness. If it cannot prove the absence of unsigned wraparound, then it will be forced to emit zero-extension sequences prior to using the index variable to generate the addresses.

ARMv8 side-steps the whole problem by providing indexed memory addressing modes that include the complete suite of zero and sign extension of a narrow-width index in the load or store instruction itself. "

fulafel 23 hours ago [–]

Nitpick: s/LP64 machine/LP64 C implementation/

But isn't size_t (or ptrdiff_t) the preferred indexing type in C for this reason (among others)? Sometimes you of course do want wrap around modulo semantics but that's much rarer, right?

saagarjha 1 day ago [–]

> If you use an array index variable of type `unsigned`

This is usually why your array indexing should be done with an iterator or size_t :)

_chris_ 1 day ago [–]

The problem is that providing extra bits for "sign-extension mode" and "read 32b or 64b" blows through the opcode space very quickly.

---

brandmeyer 1 day ago [–]

(jumping up the thread to try and hop over some confusion...)

The problem isn't with unsigned types generally. Its with subregister unsigned types. So, size_t and uintptr_t are fine. uint32_t, uint16_t, uint8_t (on LP64 ABIs) are pessimized and demand zero-extension instructions (or proofs that they can be safely elided) prior to causing side-effects. uint64_t on a LP128 ABI would also be problematic.

signed 32-bit int is also fine... because RISC-V specifically has a suite of arithmetic instructions that unconditionally sign-extend from bit 31. Even without those, it would still be fine because the carve-out for undefined behavior is wide enough for INT_MAX+1 to remain positive. Same thing for all of the other narrow-width signed integer types. If you increment SHRT_MAX and then use it to index memory, its perfectly legal undefined behavior to access base + SHRT_MAX+1 instead of base + SHRT_MIN.

However, that's not legal for the unsigned types. They are all mandated to wrap in 2's complement. base + UINT_MAX+1 must access base + 0 when the index is `unsigned int`, even on a 64-bit machine.

a1369209993 1 day ago [–]

> base + UINT_MAX+1 must access base + 0 when the index is `unsigned int`, even on a 64-bit machine.

Ironically, given the topic, that's actually not true, because `base + UINT_MAX+1` is `(base + UINT_MAX)+1` (with a pointer, not a unsigned int, as the temporary value). That should probably be `base + (UINT_MAX+1)`.

brandmeyer 8 hours ago [–]

This whole spiel is only relevant when the programmer specifies an array index in a distinct variable.

---

zozbot234 1 day ago [–]

The RISC-V spec includes recommended code sequences to check for overflow, so that the hardware can potentially use insn fusion as an optimization. The "bad" cases you mention can be a bit clunky, but they should also be rare.

pizlonator 1 day ago [–]

I’m aware of those sequences and it’s a myth that they will be rare. For dynamic languages they will be super common.

brandmeyer 1 day ago [–]

We know the origins of that myth by examining the papers that the RISC-V designers wrote. They got a C compiler back-end working and didn't incorporate any other languages in their benchmarking corpus.

---

" I think the only mistake was in finalizing the ISA without any support for checked arithmetic. My belief is that doing it well will not be orthogonal to the rest of the ISA's design, and therefore is a poor candidate for an extension. "

-- brandmeyer [2]

---

the5avage 22 hours ago [–]

At the time C was designed not all machines did represent signed integers in 2 complement. Therefore it was not possible to define behavior for signed overflow. They probably should change that 2020^^

GCC has intrinsics for integer math with overflow checks

ncmncm 13 hours ago [–]

C++ formalized two's complement already. Formalizing power-of-two word size and 8-bit bytes might come.

---

jhallenworld 1 day ago [–]

Maybe overflow checking could included as an ISA extension. If it is included, what is the least impactful design?

Overflow is part of the result, so maybe include extra bits to each register that can be arithmetic destination. These bits are not included in moves, but could be tested with new instructions.

Another way that avoids flags is new arithmetic instructions: add but jump on overflow. Maybe this is reduced to add and skip next instruction except for overflow, but maybe things are simplified if the only allowed next instruction is a jump, so the result is a single longer instruction.

jhallenworld 1 day ago [–]

After thinking about this some more: I think the extension instruction should work like "slt" (set on less than). So we have "sov"- set if add would overflow:

    add t2, t1, t0
    sov t3, t1, t0
    bnez t3, overflow

Why this way? "extra bits on destination registers"- this is really flags. The flags have to be preserved during interrupts, so extending the registers is not so easy (I think it just reduces to classic flags).

"add but jump on overflow" or "add and skip on no overflow"- I don't like this because you can not break it into separate operations without flags. I think you might have to add hidden flags in a real implementation.

An add followed by an sov could be fused, but requires an expensive multi-register write. Fusing maybe could be more likely if the destination is always to a fixed destination register:

    add t2, t1, t0
    sov tflags, t1, t0
    bnez tflags, overflow

wbl 1 day ago [–]

Control bits as in ARM and x86 force serialization of arithmetic due to the RW dependency in every instruction on that bit. There are some tricks but it still needs tracking. For higher order superscalar or out of order processors this gets annoying.

ansible 1 day ago [–]

Yes, the old, old way of having a single condition code register or the like (which dates back 40+ years) doesn't work well these days.

...

wbl 1 day ago [–]

That's one of the tricks. But it doesn't solve the issue of clobbers, which Intel had to introduce new variants of ADD and MUL to solve. Named predicate registers make it all much easier for everyone.

tom_mellior 1 day ago [–]

ARM has separate instruction variants with and without setting of flags. Normally one uses the flag-less versions, so you don't have this problem.

---

so i'm thinking:

have instructions that generate condition codes, and push to SS
that way easy to do 'liveness analysis' on the generated condition code
only a few registers need hardware path if in hardware
have instructions that combine what you usually use condition codes for
eg. add and skip if overflow/skip if not overflow/branch if overflow
multiple predication bit register so different 'lanes' which can be written to by overflow?

---

this guy indep.ly had the same idea as one of those ideas:

spacenick88 1 day ago [–]

I wonder how this interacts with branch prediction. Since overflows should happen very rarely I guess the branch on overflow should almost always predict as non taken. So wouldn't it be possible to have a "branch if add would overflow" instruction or even canonical sequence that a higher end CPU can completely speculate around and just use speculation rollback if it overflows?

...

pizlonator 1 day ago [–]

And yeah, it’s true that the overflow check is well predicted. And yeah, it’s true that what arm and x86 do here isn’t the best thing ever, just better than risc-v.

this guy warns that extra branching should be avoided tho:

brandmeyer 1 day ago [–]

The current world record holder (in the published literature) for branch prediction is TAGE and its derivatives. The G stands for Geometric. It is composed of a family of global predictors that increase in length with a geometric progression. That's somewhat relieving since it means that the storage growth is not unlike that of mipmapping in computer graphics. A small constant k times maximum history length N.

But to a first approximation, if you double the density of conditional branches in the program, then you will need to roughly double the size of the branch prediction tables to get the same performance, even if all of them are correctly predicted 100% of the time.

---

implementation detail:

bertr4nd 1 day ago [–]

I’d be curious to see the instruction sequences for handling overflow without condition codes. I’m not even sure I see how to do it as efficiently as 3 or 5 instructions :-/

pizlonator 1 day ago [–]

One example of 3 is branching on 32-bit add overflow on a 64-bit cpu where you do a 32-bit add, a 64-bit add, and compare/branch on the result.

---

Veedrac 1 day ago [–]

Mostly the concern around the lack of instructions in RISC-V revolves around a few well-known cases (eg. indexed loads) where the instructions to fuse are pretty canonical.

done

---

(about RISC-V)

bonzini 1 day ago [–]

The worst issue, at least for the versions of the ISA that will run a "real" OS, are the lack of conditional move instructions and lack of bitwise rotation instructions. Lack of shift-and-sum instructions or equivalently addresses with shifted indexes is usually mitigated by optimization of induction variables in the compiler. They are nice to have (I have written code where I took advantage of x86's ability to compute a+b*9 with a single instruction) but not particularly common with the massive inlining that is common in C++ or Rust.

The ugly parts are indeed all ugly, though they have now added hint instructions.

---

Maybe give up on the idea of abstracting activation frames in lovm. That could be done in OVM.

later: well, i think we can introduce the abstractions without banning direct access to the stack (because Boot can access the stack directly, and Boot is a subset of LOVM). OVM can then take away the direct access but keep the abstractions.

---

in Zig, you can give #DEFINE settings like function arguments when importing a file

" const c = @cImport({ @cDefine("_NO_CRT_STDIO_INLINE", "1"); @cInclude("stdio.h"); });

pub fn main() void { _ = c.printf("hello\n"); } "

---

QBE lets users define composite types! cool. Maybe something we should do in Lo?

---

the MIR project looks pretty great.

https://github.com/vnmakarov/mir https://github.com/vnmakarov/mir/blob/master/MIR.md

Some notes:

abstracted over calling conventions
defines and types functions
abstracted over fixed number of registers
strongly typed
a few 'address modes': "local variable (or a function argument), immediate, memory, label, or reference" [3]
"Memory operand has a type, displacement, base and index integer local variable, and integer constant as a scale for the index"
"Reference operand is used to refer to functions and declarations in the current module, in other MIR modules, or for C external functions or declarations"
does not require SSA input (i like that)
its highest optimization level does an SSA analysis
'lightweight'
seems to lack indirect jump, although i didn't look too hard. It has switch though.
seems to lack the ability to do things like context saves or any concurrency stuff
it has a binary format but that seems to be undocumented and from glancing at the code briefly it didn't look immediately obvious how the binary format worked. The code appears clean but uncommented. Tools 'm2b' and 'b2m' in the 'mir-utils' directory convert from textual MIR to binary and back. The functions MIR_read reads the binary and MIR_write writes it; these are in mir.c in the top directory; the main functions appear to be MIR_read_with_func and MIR_write_module_with_func.
https://github.com/vnmakarov/mir/blob/master/HOW-TO-PORT-MIR.md looks like a good read for exposing how to design BootX? so that it's easy to target
the MIR author says that key advantages over QBE are:
if i understand correctly, QBE generates textual assembly to be sent to an assembler, which is much slower than actually generating binary, which MIR does
QBE does not do inlining optimization, MIR does. This is important b/c you want to do inlining at this IL level, not higher up, because if you try to do it higher up (above QBE/MIR), then if you want to inline things written in one language to something written in another, this is complex and makes compiles take too long (e.g. in Ruby, primitives like 'times' are written in C, so if you have "10.times {x *= 2}", how would you inline this into a larger containing function? You are inlining something in Ruby (x *= 2) into something in C (10.times), and you are then inlining that whole thing into the Ruby containing function. Better to let the QBE/MIR implementation do this after both languages have been translated to QBE or MIR.

Although it says it is lightweight, it is probably still more heavyweight than we want. https://github.com/vnmakarov/mir/blob/master/HOW-TO-PORT-MIR.md says that it will probably take 1 month of work for an experienced person to port MIR to a new backend.

It's MIT licensed.

The author also reviews and contrasts QBE, LibJIT?, and others at https://github.com/vnmakarov/mir#mir-project-competitors . MIR compiler is about 16K LOC ( https://github.com/vnmakarov/mir#current-mir-performance-data ). QBE is about 10K LOC. The others have much more LOC.

Conclusions:

we should look at this and QBE, take their intersection, simplify, and consider making something even more lightweight (in implementation if not in IL)
our IL will have to have some more instructions b/c we need the additional ability to do concurrency stuff that this doesn't seem to provide, e.g. context saves/restores, and also stuff like tailcalls. Presumably the MIR guy doesn't need context saves/restores b/c his application is JITing Ruby; presumably other aspects of the Ruby executable can handle weird stuff like that, and he just needs to JIT sequential user code in MIR.
we should also look at putting others of their features into Lo
porting MIR to target BootX? would be a good exercise to make sure BootX? is easy to target. At least, i should read https://github.com/vnmakarov/mir/blob/master/HOW-TO-PORT-MIR.md
looking at the optimizations that MIR provides is probably a good way to:
(a) learn which are the few optimizations that provide 80% of the gains (the ones we should implement)
(b) read their implementation to learn how to do the optimizations and/or at least which particular algorithms we should use. See https://github.com/vnmakarov/mir#mir-jit-compiler . E.g.:
for register assignment they use "Assign: fast RA for -O0 or priority-based linear scan RA for -O1 and above" [4]
for SSA analysis (only in their highest optimization level): "We use a form of Braun's algorithm to build SSA (M. Braun et al. "Simple and Efficient Construction of Static Single Assignment Form"). We keep SSA in conventional form all the time to make out-of-SSA pass trivial" [5]
note however that the MIR guy found that just simple register assignment + instruction combiner on top of gcc -O0 gave 70% of the perf of -O3. So maybe that's all we should have. That's the same or even less than MIR's -O1 (-O1 has 'find loops' and 'dead code elimination' too but mb that's part of what he considers "simple register assignment + instruction combiner", or mb its cheap) [6]

--- design motivation doc

LOVM should be:

an okay IL for low-level optimization (a la LLVM, QBE, MIR)
an okay IL language for a transpilation

---

mb 10 bits 'registers' and banking (above 32) is enough for SSA if the original variables are 8 bits (non-banked):

there are 2 bits for duplication and then banking gives extra 2 bits; so we have 4096 variables, which is 16x the number of allowed vars (256)
remember these only have to be unique within basic blocks, not entire functions, so that means if a single basic block used all 255 variables and modified each of them 16 times, it would be maxed out. I think it's reasonable to say that if a function is so complicated that it does more than that in one basic block, it has to either be split into 2 basic blocks, or else use some more complicated compilation scheme

---

design motivation doc opaque activation records so that an implementation can choose to: - implement the BootX? 'native' activation record layout (with our link register rather than return addr on the stack at the memory-top of the stack frame, our # of arguments passed on the smallstacks, our saved caller's frame pointer just below the return address, etc), or - implement some sort of native calling convention and stack frame layout, or - have our function calls on top of some HLL abstract 'call stack'

--- ops

double-wide CAS?

--- encoding

on how many operands to have: If new LOVm operanda are 8 bits then Can fit two extra ones

Could have 64-bit LOVM format that has 1024 or 4096 locals, three or four fields of 12+4 bits or 10+4 bits and 8 bits. Defer a bit in 16 bit format to this (later: i don't understand what i meant by this sentence; probably 'devote' instead of 'defer' (a lot of this was dictated/transcribed by the poor-quality AI in my phone etc)). This can encapsulate the standard library with regards to context switching etc

on encoding types: Maybe LVM polymorphism only specifies types on the two-source operands. Now with eight bits we can specify a choice between: source operands distinct, source operands the same, dynamic type, aggregate type, each with six bits except separate type literals (source operand distinct) only have three bits each

---

"...several existing well-understood design families for minimal syntax: Lisp-like, Forth-like, APL-like" [7]

---

What's the purpose of OVM low again? I had said a more convenient language to implement garbage collection and other language services in, but really that is a property of the high level language (Oberon-like?) that compiles to that implementation. But another purpose might be to have something that the "trusted" language implementation can use to write stuff that bypasses the garbage collection and preemptive concurrency stuff which is implicitly enforced in ovmhigh - sort of an "inline assembly" for ovmhigh (LLVM-like). And yet another purpose could be as a compilation target. For these latter two purposes perhaps we should enforce SSA and also have a CFG? So perhaps OVMlow is LLVM-like. Or mb QBE or MIR-like. Or mb eliminate OVMlow all together -- if you want to do fancy stuff that bypasses the conventions of OVMhigh while implementing oot core, perhaps you have to directly implement OVMhigh on your platform. Another purpose for OVM low could be as a transpilation IL used when the ultimate target platform doesn't directly support garbage collection eg like Rust or C. Also OVMlow provide instructions for anything that in something like c would be done with inline assembly, to remove the temptation to ever use inline oot assembly in the OVMhigh or oot core implementation.

also recall that OVM was supposed to be able to hold higher-level things like objects in its registers. And that it was supposed to be able to do simple operations using the same encoding as oot assembly. And that OVM instructions were supposed to be implemented in LOVM (Forth-like metaprogramming could be useful here).

---

forth style metaprogramming is appropriate for any language that is linear, for example, assembly language or virtual machine. So mb oot assemble, or mb ovm

---

PE executable format variant in .NET Micro Framework PE file Format

" Major differences from ECMA-335

    The number and size of the metadata tables is limited in NETMF to keep the overall memory footprint as low as possible.
    Since NETMF is designed to operate without an OS the Windows PE32/COFF header, tables and information is stripped out
    Switch instruction branch table index is limited to 8 bits
    Table indexes are limited to 12 bits
        This also means that the metadata tokens are 16 bits and not 32 so the actual IL instruction stream is different for NETMF
    Resources are handled in a very different manner with their own special table in the assembly header" -- [8]

---

does it make sense to have a C-like/Forth-like/Lisp-like/Oberon-like low-level HLL that compiles to the Lovm virtual machine?

What would be the purpose of the low-level HLL? It would be a language in which the following could be implemented, at least for the reference implementation:

the runtime
"library functions" for the language implementation
eg garbage collection
eg greenthread scheduling
eg hash maps

One question would be, why not just write this stuff in (higher-level language) Oot? Answers might include:

yes, this language is unnecessary, just write that stuff in Oot
that stuff is written in Oot but then it is compiled to this language, for portability
that stuff is written in Oot but then it is compiled to this language, because this language is also our lowest-common-denominator "readable transpilation interchange language"
Oot will not be expressive enough to express these low-level things (eg maybe Oot wont have "goto")
Oot will not be expressive enough to express these low-level things in a way that obviously maps to a language like C, which reduces the value of the reference implementation (as something that porters can easily translate to another language)
writing this stuff in this language instead of Oot will allow us to make the reference implementation more performant

An obvious argument against having a low-level HLL here is that we already have languages like C, Rust, Zig, Oberon, Forth, Hare that other people worked on that are probably better than what we would come up with.

arguments rebutting this depend on some of the decisions above:

if we don't want to write the reference runtime in Oot, then, for portability, we want it in an easy-to-implement language. Even C may be too complex for what we want in that case. Rust is definitely too complex.
if we want the runtime written in our lowest-common-denominator "readable transpilation interchange language" then, again, C is probably too complex (and too expressive / low-level / not portable enough) for what is wanted there
by contrast, maybe we want to use GOTO or stack manipulation or continuations etc in this code and other language might not have that or might have restricted versions of it (eg we may not want to use setjmp and longjmp in C)
if this code is being auto-generated from higher-level code (in Oot or in Rust, for example), then it doesn't matter that it lacks safety constructs such as what Rust and Zig provide

some possibilities to keep in mind that might justify a new low-level HLL:

our low-level HLL could be less performant than others; its "niche" could be in writing non-performance reference implementation
if we decide that we need less safety and expressivity than others (but in exchange for simplicity, rather than in exchange for performance), either because this code is being compiled-to from higher-level code, or because we aren't writing much of this so we don't mind it being difficult to (correctly) use, then this might be a different niche
we want the implementation of the language that this is written in to be in the project; we really want everything to be 'in house' to keep the overall system simple and understandable
this will be our lowest-common-denominator "readable transpilation interchange language"
we want GOTOs

---

Footnotes:

 2) & 128) | x = 160. Recall that we want to do the sign extension operator, that is, copy bit 6 into bits 7 and 8. So y = x & 32; result = x | (y << 1) | (y << 2). I assume that would be easier for electronics than adding a bias, even though it's a little harder in assembly. To get rid of the y introduction, could do x | ((x & 32) << 1) | ((x & 32) << 2). But wait, this doesnt work in Python, it yields 224.

another method is shift left followed by arithmetic shift right. I think that would work on CPUs but not on Python because it has that arbitrary bit-width thing going on.

But in Python we can still do int.from_bytes(int(x<<2).to_bytes(length=1,byteorder=sys.byteorder,signed=False),byteorder=sys.byteorder,signed=True)