proj-oot-ootLovmNotes2

if we have smallstacks, then i feel like we're not making enough use of the potential for smallstacks to give us zero-operand instructions that can be encoded tightly.

can we use the operands in an ordinary 16-bit Boot instruction to encode this? (i don't want to throw this into BootX? though b/c it's purely for perf, and b/c if you throw that in, why not throw in the 8-bit format, etc)

consider:

fixed_opcode op0 op1 op2

op0+op1+op2 = 9 bits

so, yes. as noted elsewhere instead of zero-operand i think we can do a lot more with 2-bit operands b/c then you can address:

if each 'operand' is 2 bits then we have used 6 bits, and we have 3 bits left for the opcode. So we can encode 8 instructions this way.

we already have CP and LD taken care of in the 8-bit format (arg should that be added that in at this point now too? if so then what's the difference between lovm and bootx, really? if we already have smallstacks and multiple instruction formats... i think this should be postponed until LOVM). So how about:

Recall (from the list in lovmNotes1) that the most frequent instructions seemed to be:

10 mv, 11 ld, 2 lm, 2 addi, 2 call 3 mv, 8 ld, 3 lm, 5 cond branch or compare, 2 add

lm isn't useful here because we'd have to have the immediate in the following 8 bits, and we can already do a lm9 using Boot instructions. Similarly for addi, although maybe inc would be useful? So some suggestions are:

those are popular instructions but should also consider what would be needed for a 'minimal subset'

eh hold on a minute, how is this at all helpful? we can do all that with ordinary 16-bit Boot instructions...

---

ruminations on the 8-bit encoding:

  1. ## 8-bit instruction encoding The two most-significant-bits are always 10.

i feel like this could be a lot more useful -- isn't part of the whole point of the smallstacks so that we can have a bunch of zero-operand instructions that can be encoded tightly?

cp where the src and dest registers are equal is useless, so can define 4 more 0-operand instructions that way. Actually 8, because for most other instructions we don't need the register bank bit. OK, now we're talking. Recall that the most frequent instructions seemed to be:

10 mv, 11 ld, 2 lm, 2 addi, 2 call 3 mv, 8 ld, 3 lm, 5 cond branch or compare, 2 add

lm isn't useful here because we'd have to have the immediate in the following 8 bits, and we can already do a lm9 using Boot instructions. Similarly for addi, although maybe inc would be useful? So some suggestions are:

however how do we make those 2-operand? mb we need to reinterpret the src and dest register bits to make them 3-operand.. but then we have an extra bit...

should we make the last 8-bit accessible register a temporary instead of the stack pointer?

i dunno, maybe so much for regularity. Maybe we should go back to just having a table of 64 common instructions. Maybe making some effort to encourage smallstack usage there. Maybe noting that if we have 4 common registers, then complete cp and ld between them would take up 56 spots.

note that we already have dup: cp smallstack tos if we make one of the registers 0, then we also have drop: cp 0 smallstack altho mb just add drop and swap

well i dunno that means we can stick with the original idea for a regular cp and ld and now we have a table of 8 common zero-operand instructions, presumably with a lot of smallstack usage in there. So consider:

hmm maybe regularize by only having fmas (addr mode-ish calcs):

or, maybe have predecremeent/postincrement instructions useful for loops:

or:

(see Adrian Bocaniciu comment in ootAssemblyNotes28)

or maybe have:

also note that lp tos tos = lp smallstack smallstack; so this frees up another spot (the same isn't true for l32)

i'm leaning towards something like:

since we have both this addrmode-ish computation in 8 bits, and l32/lp in 8 bits, this means that we can do an incrementing-addr-mode-ish load in 16 bits, which could be helpful for compact loops (see Adrian Bocaniciu comment in ootAssemblyNotes28).

we already have dup via cp smallstack tos. so choosing between swap, drop, over, it looks like swap is broadly common in plChAssemblyFrequent... and https://users.ece.cmu.edu/~koopman/stack_computers/sec6_3.html lists it as common both staticly and dynamicly.

---

If Entry assumes that the memstack pointer is the place to spill to, Now there are three operands free, And we have enough space for two bits for each of: int ptr fp * caller callee * reg stack.

---

removed for now (RESERVED for later)

"

  1. ## 8-bit instruction encoding

The all-zero instruction (8 bits of zeros) is illegal.

scheme for CP and LW

   -- note: register 2 is no longer TOS, but we want TOS here; so when the register bits say 2, instead we mean TOS

except:

      1. 16-bit instruction encoding The most-significant-bit (the first bit of the first byte) is always 0.

Note that op0 spans two bytes.

The 16-bit instruction encoding represents the Boot instructions. The interpretation of instructions from opcodes and operands is identical to Boot's. Note that Boot code can only access the first 8 registers in each bank, and cannot access the smallstacks.

"

---

old/removed for now

  1. # LOVM calling convention

Register roles:

here's from BootX?, do we keep this and then extend? probably..

Registers:

should extend the 8 global regs with:

should maybe first 8 are global, last 8 are output regs, so at least 16 total (0-15)? If no others then the new output regs = the current 8 input regs; otherwise the new output regs will be the last 8 regs. OR should the # of output regs be specified below (that's what i did for now)?

When a function is called, typically the callee begins with the ENTRY instruction, which takes 3 arguments:

   for each of the first 6 fields, the encoding is:
      00: 0
      01: 2
      10: 4
      11: 8
   for each of the last 6 fields, the encoding is:
      00: 0
      01: half as many as the corresponding callee-save spots/argument spots
      10: twice as many as the corresponding callee-save spots/argument spots
      11: as many as possible (given the 32 registers/smallstack item limit)
        umm this isn't quite good

i guess for each of the 6 regions, you want something like:

0000 both zero 0001 callee 0, caller 2 0010 callee 2, caller 0 0011 callee 2, caller 2 0100 callee 0, caller 4 0101 callee 4, caller 0 0110 callee 4, caller 4 0111 callee 2, caller 4 1000 callee 4, caller 2 1001 ??? 1010 callee 8, caller 4 1011 callee 4, caller 8 1100 callee 8, caller 8 1101 callee 8, caller 16 1110 callee 16, caller 8 1111 callee 16, caller 16 (only applicable to the stacks, b/c the regs have 8 regs spoken for already, and the limit is 32 total)

probably there's some way to regularize the above

If any callee-save registers are requested, ENTRY spills them to the memory stack pointed to by the stack pointer (see section below on memory stack layout).

After ENTRY, the 16 OUTPUT

CALL

RET

---

old

we have a little bit of a problem with branch range.

RISC-V has 12-bit target offsets in their conditional branch instructions (+-4 KiB? range, because they are signed offsets in units of 2 bytes) and https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.pdf figure 5.6 shows that branch ranges up to 8-bit width can be used very effectively (i think that 8 bits includes the signed offsets, so this would yield a range of +-256). But we only have 7 bits per operand, which means our conditional branches could use at least one more bit.

I suggest consuming an extra set of branch opcodes, and represent forward and reverse conditional branches separately. Now we would get a range of +-256.

done

---

old

addr modes: +-32 immediate, 32 registers, 16 stack, 8 register indirect by register, 8 register indirect by stack (or should we do 16 register indirect by just register). Also, for ptr operands, need to do stuff else for 'immediate' I guess, maybe here we can have offset (by either ptr or int32 scale) or indexed or postpre inc dec.

so:

alternatively, mb just have 0 thru 31 smallstack locs so:

this allows us to have up to ~60 program variables of each type, and it's also simpler. The disadvantage is that the large smallstack size probably prohibits keeping the whole smallstack in registers and using register MOV chaining to quickly push/pop (although this could still be done as a special case when a function allocates a small smallstack, which will be the common case).

i guess simpler is better, so let's go with this.

one remaining question is whether it's worth the complexity to avoid having two ways to say immediate zero (and to gain immediate +32). It might be simpler to just say: if the value is >= 64, subtract 64 (or, bitwise-AND with 63) then sign-extend that 5-bit twos-complement representation. That way there's no extra conditional.

According to a sampling of some of https://www.agner.org/optimize/instruction_tables.pdf , on various x86 archs, MOVSX (move with sign extend) is just as quick as MOV, which is just as quick as ADD.

Perhaps it would be even simpler to treat immediate, rather than register, as the 'default'. So:

i gotta say, that's simpler to specify. Let's do it.

actually i think sign-extending 7 bits could be annoying (and notice how the representation of an unsigned immediate gets messed up). How about specifying everything in terms of unsigned 7 bits:

ok but let's think about how we will process the immediate mode constant, since this will usually be signed.

Sign extending from i6 might not be cheap on some targets (might have to actually do something like if >= 32... then result = or(result, 0xfffffffe0, and we branching (if) isnt fast). A sign bit might also require testing for the sign bit and branching. So a bias might be best. So:

well actually i think 'sign extension' is always branchless as long as we have left shift: left shift 2 bits, then OR with 128, then AND with the original value -- actually that's wrong, b/c e.g. x=32; ((x 1 2. Something tells me that will be even worse, however. I guess could do:

x*(1 - ((x&32) >> 5)) + -(64-x)*((x&32) >> 5)

that works and has no conditionals but it's a bit ugly and it has multiplication instead! In Python a conditional is probably best.

this encoding would be:

in x86 i verified with https://carlosrafaelgn.com.br/asm86/ that: sal edx, 26 sar edx, 26

seems to work. But yknow what is just as easy? sub edx, 32 (for the bias encoding where 32 is 0 and 0 is -32). And as i noted earlier, on intel the timing of add is the same as shifts.

What about AArch64. On Cortex-A57, https://developer.arm.com/documentation/uan0015/b/ it's the same (ADD/SUB and immediate shifts are just as fast as each other).

So, bias takes one less instruction in assembly, is not slower, and is easier in Python. I'm going to go with bias.

no, i changed my mind about this. 0 is an important special case sometimes, and there may be times when it's faster to recognize 0 (e.g. with "bz", "bnz") than to recognize 32. The decode is worse without bias but it's not terrible. So we'll do:

done

---

regarding

"- cmp and branch-on-status register-like instructions (to allow larger branch immediate; mb use R1 (S0) as fixed 'status register' location)"

mb consider this comment: "(Note that this is still better than ISAs which write flags to a GPR and then branch upon the resulting flags)" [1] - mb ask er why/what e meant?

--- old

on an old change from LOVM from a 64-bit encoding with more features, to just a 32-bit encoding which is a 'basic assembly language':

"

---

-- should we require that at any given point in the code, that the smallstack sizes (relative to the previously encountered ENTRY) are always fixed (eg you could make 'stack maps')? that might make exception handling/debugging hard..

no.. i think that kind of restriction is for OVM. LOVM offers structured control flow but doesn't rely on it.

---

old

section "%rip-relative addressing" of https://cs61.seas.harvard.edu/site/2018/Asm1/ explains that, yeah, we probably do want a PC-relative LOAD somewhere. Right now in Boot we can do this already; use lpc to load the address, then LOAD from that. But remember that for addr modes

ok i added it to the list of addr-modes-like-instructions to consider

---

old

---

old

yknow i think we should just make the smallstacks conceptually a cache of part of the tip of the stack.

define exactly when they are written to the stack and the stack ptr is updated; BUT the implementation is free to store them below the SP before that; someone (the implementation? the program?) should ensure that the stack has space for that.

this probably means that the program shouldn't read or write the SP in between allocating smallstacks and flushing them. Dunno about that though, what if the program really has so many variables that it wants to store some stuff on the stack? We should accomodate that somehow. If the implementation saves a copy of SP at that time then i guess we're good. How about if we just say that (a) the area below (the stack pointer at the time of smallstack allocation) is reserved for the implementation, (b) until smallstacks are flushed, SP can be read but not written, although it's fine to allocate new smallstack stuff; so upon a function call, if all you need to do is call ENTRY, you're good, but if you need to manually move the stack pointer down the make room for your own manual stuff, then you must flush smallstacks first.

if the implementation wants to change the SP to keep track of stuff in the meantime, it needs to make a copy of the SP at time of allocation so that the program doesn't see the changes, and then it can hide and change the target platform's 'real' SP.

This way all smallstacks are in the main stack, so there's only one. The stack layout of the various caller-save, callee-save, smallstacks is standardized/defined. ENTRY etc aren't (completely) magical. Now since there's only one stack, we could still add a second stack for return addresses, although i guess that should be visible too.

---

---

brandmeyer 1 day ago [–]

rotate-and-xor (and xor-and-rotate) are both common operations in ARX ciphers. They demand 4 macro-ops in RISC-V, but only one in ARMv8.

Bitfield insertion is only one instruction in most RISC ISAs, but 5 or more in RISC-V.

reply

Veedrac 1 day ago [–]

The (unfinalized) bitmanip extension has single-op rotates.

reply

CalChris? 1 day ago [–]

I really don't like it. It seems like they copied x86 (bext, bdep) where they should have been plagiarizing armv8 (BFM, ...).

reply

brandmeyer 1 day ago [–]

I've watched that extension's development since it was little more than one smart guy's wish list. I don't think its fair to say that they copied any one architecture. The authors have put in a ton of time researching the tradeoffs and investigating the trade space over the years.

That said, until it gets ratified by the consortium and implemented in silicon its still just a (well-researched) wishlist.

reply

---

pizlonator 1 day ago [–]

The lack of condition codes is a big deal for anyone relying on overflow checked arithmetic, like modern safe languages that do this for all integer math by default, or dynamic languages where it’s necessary for the JIT to speculate that the dynamic “number” type (which in those languages is either like a double or like a bigint semantically) is being used as an integer.

RISC-V means three instructions instead of two in the best case. It requires five or more instead of two in bad cases. That’s extremely annoying since these code sequences will be emitted frequently if that’s how all math in the language works.

---

pizlonator 1 day ago [–]

I mean you will do overflow checks on the following. I’ll use the “s” and “u” prefixes to mean signed and unsigned. Unsigned matters less than signed.

sadd32, sadd64, uadd32, uadd64, ssub32, ssub64, usub32, usub64, smul32, smul64, umul32, umul64

reply

---

lists some more things we should include, toread:

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-130.pdf

---

" RISC-V has some closely-related sharp corners in indexed address arithmetic as well. Some choices for the type of the index variable perform much worse on rv64.

Consider: an LP64 machine uses 32-bit integers for 'int' and 'unsigned', but 64-bit integers for `long`, `size_t`, `ptrdiff_t` and so on.

If you use an array index variable of type `unsigned`, then the compiler must prove that wraparound doesn't happen. That's pretty weird considering that half the point of using unsigned is to elide such proofs of correctness. If it cannot prove the absence of unsigned wraparound, then it will be forced to emit zero-extension sequences prior to using the index variable to generate the addresses.

ARMv8 side-steps the whole problem by providing indexed memory addressing modes that include the complete suite of zero and sign extension of a narrow-width index in the load or store instruction itself. "

fulafel 23 hours ago [–]

Nitpick: s/LP64 machine/LP64 C implementation/

But isn't size_t (or ptrdiff_t) the preferred indexing type in C for this reason (among others)? Sometimes you of course do want wrap around modulo semantics but that's much rarer, right?

reply

saagarjha 1 day ago [–]

> If you use an array index variable of type `unsigned`

This is usually why your array indexing should be done with an iterator or size_t :)

reply

_chris_ 1 day ago [–]

The problem is that providing extra bits for "sign-extension mode" and "read 32b or 64b" blows through the opcode space very quickly.

reply

---

brandmeyer 1 day ago [–]

(jumping up the thread to try and hop over some confusion...)

The problem isn't with unsigned types generally. Its with subregister unsigned types. So, size_t and uintptr_t are fine. uint32_t, uint16_t, uint8_t (on LP64 ABIs) are pessimized and demand zero-extension instructions (or proofs that they can be safely elided) prior to causing side-effects. uint64_t on a LP128 ABI would also be problematic.

signed 32-bit int is also fine... because RISC-V specifically has a suite of arithmetic instructions that unconditionally sign-extend from bit 31. Even without those, it would still be fine because the carve-out for undefined behavior is wide enough for INT_MAX+1 to remain positive. Same thing for all of the other narrow-width signed integer types. If you increment SHRT_MAX and then use it to index memory, its perfectly legal undefined behavior to access base + SHRT_MAX+1 instead of base + SHRT_MIN.

However, that's not legal for the unsigned types. They are all mandated to wrap in 2's complement. base + UINT_MAX+1 must access base + 0 when the index is `unsigned int`, even on a 64-bit machine.

reply

a1369209993 1 day ago [–]

> base + UINT_MAX+1 must access base + 0 when the index is `unsigned int`, even on a 64-bit machine.

Ironically, given the topic, that's actually not true, because `base + UINT_MAX+1` is `(base + UINT_MAX)+1` (with a pointer, not a unsigned int, as the temporary value). That should probably be `base + (UINT_MAX+1)`.

reply

brandmeyer 8 hours ago [–]

This whole spiel is only relevant when the programmer specifies an array index in a distinct variable.

reply

---

zozbot234 1 day ago [–]

The RISC-V spec includes recommended code sequences to check for overflow, so that the hardware can potentially use insn fusion as an optimization. The "bad" cases you mention can be a bit clunky, but they should also be rare.

reply

pizlonator 1 day ago [–]

I’m aware of those sequences and it’s a myth that they will be rare. For dynamic languages they will be super common.

reply

brandmeyer 1 day ago [–]

We know the origins of that myth by examining the papers that the RISC-V designers wrote. They got a C compiler back-end working and didn't incorporate any other languages in their benchmarking corpus.

reply

---

" I think the only mistake was in finalizing the ISA without any support for checked arithmetic. My belief is that doing it well will not be orthogonal to the rest of the ISA's design, and therefore is a poor candidate for an extension. "

-- brandmeyer [2]

---

the5avage 22 hours ago [–]

At the time C was designed not all machines did represent signed integers in 2 complement. Therefore it was not possible to define behavior for signed overflow. They probably should change that 2020^^

GCC has intrinsics for integer math with overflow checks

reply

ncmncm 13 hours ago [–]

C++ formalized two's complement already. Formalizing power-of-two word size and 8-bit bytes might come.

reply

---

jhallenworld 1 day ago [–]

Maybe overflow checking could included as an ISA extension. If it is included, what is the least impactful design?

Overflow is part of the result, so maybe include extra bits to each register that can be arithmetic destination. These bits are not included in moves, but could be tested with new instructions.

Another way that avoids flags is new arithmetic instructions: add but jump on overflow. Maybe this is reduced to add and skip next instruction except for overflow, but maybe things are simplified if the only allowed next instruction is a jump, so the result is a single longer instruction.

reply

jhallenworld 1 day ago [–]

After thinking about this some more: I think the extension instruction should work like "slt" (set on less than). So we have "sov"- set if add would overflow:

    add t2, t1, t0
    sov t3, t1, t0
    bnez t3, overflow

Why this way? "extra bits on destination registers"- this is really flags. The flags have to be preserved during interrupts, so extending the registers is not so easy (I think it just reduces to classic flags).

"add but jump on overflow" or "add and skip on no overflow"- I don't like this because you can not break it into separate operations without flags. I think you might have to add hidden flags in a real implementation.

An add followed by an sov could be fused, but requires an expensive multi-register write. Fusing maybe could be more likely if the destination is always to a fixed destination register:

    add t2, t1, t0
    sov tflags, t1, t0
    bnez tflags, overflow

reply

wbl 1 day ago [–]

Control bits as in ARM and x86 force serialization of arithmetic due to the RW dependency in every instruction on that bit. There are some tricks but it still needs tracking. For higher order superscalar or out of order processors this gets annoying.

reply

ansible 1 day ago [–]

Yes, the old, old way of having a single condition code register or the like (which dates back 40+ years) doesn't work well these days.

...

wbl 1 day ago [–]

That's one of the tricks. But it doesn't solve the issue of clobbers, which Intel had to introduce new variants of ADD and MUL to solve. Named predicate registers make it all much easier for everyone.

reply

tom_mellior 1 day ago [–]

ARM has separate instruction variants with and without setting of flags. Normally one uses the flag-less versions, so you don't have this problem.

reply

---

so i'm thinking:

---

this guy indep.ly had the same idea as one of those ideas:

spacenick88 1 day ago [–]

I wonder how this interacts with branch prediction. Since overflows should happen very rarely I guess the branch on overflow should almost always predict as non taken. So wouldn't it be possible to have a "branch if add would overflow" instruction or even canonical sequence that a higher end CPU can completely speculate around and just use speculation rollback if it overflows?

...

pizlonator 1 day ago [–]

And yeah, it’s true that the overflow check is well predicted. And yeah, it’s true that what arm and x86 do here isn’t the best thing ever, just better than risc-v.

this guy warns that extra branching should be avoided tho:

brandmeyer 1 day ago [–]

The current world record holder (in the published literature) for branch prediction is TAGE and its derivatives. The G stands for Geometric. It is composed of a family of global predictors that increase in length with a geometric progression. That's somewhat relieving since it means that the storage growth is not unlike that of mipmapping in computer graphics. A small constant k times maximum history length N.

But to a first approximation, if you double the density of conditional branches in the program, then you will need to roughly double the size of the branch prediction tables to get the same performance, even if all of them are correctly predicted 100% of the time.

reply

---

implementation detail:

bertr4nd 1 day ago [–]

I’d be curious to see the instruction sequences for handling overflow without condition codes. I’m not even sure I see how to do it as efficiently as 3 or 5 instructions :-/

reply

pizlonator 1 day ago [–]

One example of 3 is branching on 32-bit add overflow on a 64-bit cpu where you do a 32-bit add, a 64-bit add, and compare/branch on the result.

reply

---

Veedrac 1 day ago [–]

Mostly the concern around the lack of instructions in RISC-V revolves around a few well-known cases (eg. indexed loads) where the instructions to fuse are pretty canonical.

done

---

(about RISC-V)

bonzini 1 day ago [–]

The worst issue, at least for the versions of the ISA that will run a "real" OS, are the lack of conditional move instructions and lack of bitwise rotation instructions. Lack of shift-and-sum instructions or equivalently addresses with shifted indexes is usually mitigated by optimization of induction variables in the compiler. They are nice to have (I have written code where I took advantage of x86's ability to compute a+b*9 with a single instruction) but not particularly common with the massive inlining that is common in C++ or Rust.

The ugly parts are indeed all ugly, though they have now added hint instructions.

---

Maybe give up on the idea of abstracting activation frames in lovm. That could be done in OVM.

later: well, i think we can introduce the abstractions without banning direct access to the stack (because Boot can access the stack directly, and Boot is a subset of LOVM). OVM can then take away the direct access but keep the abstractions.

---

in Zig, you can give #DEFINE settings like function arguments when importing a file

" const c = @cImport({ @cDefine("_NO_CRT_STDIO_INLINE", "1"); @cInclude("stdio.h"); });

pub fn main() void { _ = c.printf("hello\n"); } "

---

QBE lets users define composite types! cool. Maybe something we should do in Lo?

---

the MIR project looks pretty great.

https://github.com/vnmakarov/mir https://github.com/vnmakarov/mir/blob/master/MIR.md

Some notes:

Although it says it is lightweight, it is probably still more heavyweight than we want. https://github.com/vnmakarov/mir/blob/master/HOW-TO-PORT-MIR.md says that it will probably take 1 month of work for an experienced person to port MIR to a new backend.

It's MIT licensed.

The author also reviews and contrasts QBE, LibJIT?, and others at https://github.com/vnmakarov/mir#mir-project-competitors . MIR compiler is about 16K LOC ( https://github.com/vnmakarov/mir#current-mir-performance-data ). QBE is about 10K LOC. The others have much more LOC.

Conclusions:

--- design motivation doc

LOVM should be:

---

mb 10 bits 'registers' and banking (above 32) is enough for SSA if the original variables are 8 bits (non-banked):

---

design motivation doc opaque activation records so that an implementation can choose to: - implement the BootX? 'native' activation record layout (with our link register rather than return addr on the stack at the memory-top of the stack frame, our # of arguments passed on the smallstacks, our saved caller's frame pointer just below the return address, etc), or - implement some sort of native calling convention and stack frame layout, or - have our function calls on top of some HLL abstract 'call stack'

--- ops

double-wide CAS?

--- encoding

on how many operands to have: If new LOVm operanda are 8 bits then Can fit two extra ones

Could have 64-bit LOVM format that has 1024 or 4096 locals, three or four fields of 12+4 bits or 10+4 bits and 8 bits. Defer a bit in 16 bit format to this (later: i don't understand what i meant by this sentence; probably 'devote' instead of 'defer' (a lot of this was dictated/transcribed by the poor-quality AI in my phone etc)). This can encapsulate the standard library with regards to context switching etc

on encoding types: Maybe LVM polymorphism only specifies types on the two-source operands. Now with eight bits we can specify a choice between: source operands distinct, source operands the same, dynamic type, aggregate type, each with six bits except separate type literals (source operand distinct) only have three bits each

---

"...several existing well-understood design families for minimal syntax: Lisp-like, Forth-like, APL-like" [7]

---

What's the purpose of OVM low again? I had said a more convenient language to implement garbage collection and other language services in, but really that is a property of the high level language (Oberon-like?) that compiles to that implementation. But another purpose might be to have something that the "trusted" language implementation can use to write stuff that bypasses the garbage collection and preemptive concurrency stuff which is implicitly enforced in ovmhigh - sort of an "inline assembly" for ovmhigh (LLVM-like). And yet another purpose could be as a compilation target. For these latter two purposes perhaps we should enforce SSA and also have a CFG? So perhaps OVMlow is LLVM-like. Or mb QBE or MIR-like. Or mb eliminate OVMlow all together -- if you want to do fancy stuff that bypasses the conventions of OVMhigh while implementing oot core, perhaps you have to directly implement OVMhigh on your platform. Another purpose for OVM low could be as a transpilation IL used when the ultimate target platform doesn't directly support garbage collection eg like Rust or C. Also OVMlow provide instructions for anything that in something like c would be done with inline assembly, to remove the temptation to ever use inline oot assembly in the OVMhigh or oot core implementation.

also recall that OVM was supposed to be able to hold higher-level things like objects in its registers. And that it was supposed to be able to do simple operations using the same encoding as oot assembly. And that OVM instructions were supposed to be implemented in LOVM (Forth-like metaprogramming could be useful here).

---

---

forth style metaprogramming is appropriate for any language that is linear, for example, assembly language or virtual machine. So mb oot assemble, or mb ovm

---

PE executable format variant in .NET Micro Framework PE file Format

" Major differences from ECMA-335

    The number and size of the metadata tables is limited in NETMF to keep the overall memory footprint as low as possible.
    Since NETMF is designed to operate without an OS the Windows PE32/COFF header, tables and information is stripped out
    Switch instruction branch table index is limited to 8 bits
    Table indexes are limited to 12 bits
        This also means that the metadata tokens are 16 bits and not 32 so the actual IL instruction stream is different for NETMF
    Resources are handled in a very different manner with their own special table in the assembly header" -- [8]

---

does it make sense to have a C-like/Forth-like/Lisp-like/Oberon-like low-level HLL that compiles to the Lovm virtual machine?

What would be the purpose of the low-level HLL? It would be a language in which the following could be implemented, at least for the reference implementation:

One question would be, why not just write this stuff in (higher-level language) Oot? Answers might include:

An obvious argument against having a low-level HLL here is that we already have languages like C, Rust, Zig, Oberon, Forth, Hare that other people worked on that are probably better than what we would come up with.

arguments rebutting this depend on some of the decisions above:

some possibilities to keep in mind that might justify a new low-level HLL:

---


Footnotes:

1.

 2) & 128) | x = 160. Recall that we want to do the sign extension operator, that is, copy bit 6 into bits 7 and 8. So y = x & 32; result = x | (y << 1) | (y << 2). I assume that would be easier for electronics than adding a bias, even though it's a little harder in assembly. To get rid of the y introduction, could do x | ((x & 32) << 1) | ((x & 32) << 2). But wait, this doesnt work in Python, it yields 224.

another method is shift left followed by arithmetic shift right. I think that would work on CPUs but not on Python because it has that arbitrary bit-width thing going on.

But in Python we can still do int.from_bytes(int(x<<2).to_bytes(length=1,byteorder=sys.byteorder,signed=False),byteorder=sys.byteorder,signed=True)