Bayle Shanks's website: proj-oot-ootLovmNotes1

can we/should we allow CALL3 to have INOUT operands? 2 of them or all 3? maybe CALL3 via just pushing the 2 'input' operands onto the stack? in which case, upon return, it just POPs the stack into the 3rd operand? which would allow a function that wants 2 INOUT operands to leave results on the stack? but that would mean that CALL3 has differing stack behavior depending on what is called, which i don't like.

so i don't think i'll do that. But it's just something to think about.

i guess you could have two different CALL3 addr mode flags to have this behavior, or the normal behavior.

---

in any case, one possible implementation of CALL3 is for it to have 'arguments passed on stack' semantics, so the two input arguments are pushed onto the stack beforehand, and then afterwards the one output argument is popped from the stack and moved into the destination register.

---

macros are expanded inline; one macro which is invoked multiple times will generate multiple instances in the generated code. A label within a macro will generate a unique label for each instance of the macro.

can nested macro invocations refer to macros from outer macro invocations? If so, the compiler will have to pass in the unique label names when calling a nested macro. Perhaps you could let the user handle this by allowing labels to be assigned to variables.

better idea: if the same label is defined more than once, the most recent definition (in terms of static position in the instruction stream) overwrites the others

---

sounds related to my idea of using the extra operand to pass type info:

Tag-free garbage collection using explict tag parameters http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.1106&rep=rep1&type=pdf

---

so, if the number used for passing types has 8 bits, how should an 8-bit immediate 'type' in this field be defined?

8-bits surely won't be enough for all of the types we may ever need, including custom types, struct types, function types, so we can just have a subset, and we'll have to just pass a pointer to a 'type' for others even when the type being passed is statically known (and so could otherwise be an 'immediate').

maybe the largest subset (in terms of bits needed) is function types, so let's consider those first. We'd like to at least handle functions with 2 inputs and one output. The largest number that can be multiplied by 3 and still be 8 or less is 2. So, we have 6-bit immediate function types with 2 bits each for each argument and for the return value. So, for each argument and 4 the return value we can encode 4 types. I think the obvious choices are:

nil
32-bit integer
pointer
dynamic

'nil' can be used when we have functions that have less than 2 inputs or less than one output. ('nil' can be another name for what some languages call 'unit' and represent using the literal '()', and for what some languages call 'void')

These function types occupy 64 out of the 256 possible immediate types.

Now, isn't everything really a function type? Because we'll be calling functions and passing these types for polymorphism, so it would make sense for these types to be the function types.

This suggests that we choose the other (256-64 = 192) types to just be the most commonly needed function types, similarly to the 64 8-bit instructions.

However, when we have a pointer to a constant struct representing a 'type', this must be a recursive tree, and what encoding is the leaf of that tree? May as well make this part of the encoding. Which means that we need to reserve a lot of this space for the base types.

Let's reserve a block of 64 for base types.

Now we have used 64 for the base types, and 64 for the function types, and we have 128 left. What about 3-argument functions with only 2 distinct types? Assuming we can swap the 2 arguments, there are two forms of this:

a <- f(b, b) a <- f(a, b)

Each of a and b can take 3 bits, which means that each of the two forms takes 6 bits, which means that the total takes 7 bits. Which is our remaining 128. So these functions can only access (without pointing to a larger, constant struct) the most common 8 out of the 64 base types.

I might suggest:

bool, int32, int64, uint32, uint64, f32, f64, ptr

But doesn't leave any room for: dyn, custom/obj, string.

Also, since we are now talking about the format for the entire 'type struct', not just for the immediate operand, we need a way to specify 'this is a composite type'. We could repurpose 'dynamic' for that. Alternately, we could leave all that for the 64 'base types' (after all, we also need to specify WHICH composite type we have (union or product type? or, homogenous array? hetero array? linked list? struct? tuple? discriminated union (and then: option? simple enum?)? nondiscriminated union? keyword/interned string?)). But this suggests that we really want at least 'dyn' (or whatever we replace it with) in the functions-with-two-same-argument 8 base types. So, in the 8 base types, we should probably give up one or more of:

bool
having both 32-bit and 64-bit ints
having both signed and unsigned ints

I suggest giving up 64-bit ints (in the 8 common base types; they can still be in the 64 base types).

Also, we would like these 3-bit common types to be a superset of the 2-bit common types above. So we need nil in there.

So, the 2-bit function operand types are:

composite
nil
32-bit integer
pointer

And the 3-bit common types are:

composite
nil
int32
ptr
bool
uint32
f32
f64

We could also consider giving up one of f32 or f64, probably f32, and putting something else in that slot, like dyn or obj (obj could simply be RESERVED or CUSTOM, so as to allow HLLs to efficiently reuse this same system and put their own thing in there, which could be an object type). Or giving up bool. Or giving up both bool and f64.

And the 64 base types must include all composite type constructors as well as actual base types.

for inspiration, maybe see also:

https://en.wikipedia.org/wiki/Comparison_of_programming_languages_(basic_instructions) https://en.wikipedia.org/wiki/Primitive_data_type https://en.wikipedia.org/wiki/Template:Programming_language_comparisons

on the question of whether f32 or f64 is more common, it's contentious: https://softwareengineering.stackexchange.com/questions/305760/why-is-the-most-common-integer-number-32-bits-but-the-most-common-floating-poin

now that i've heard of NaN? tagging, that's more ammo for me to choose f64, given that having a system which is good for implementing an HLL is part of the goal here.

yeah, i like the idea of giving up f32 and adding in either DYNAMIC (which means a type variable to be provided later), ANY (which means a boxed/self-describing type) or CUSTOM (which means the semantics of this are reserved for a HLL building on top of LOVM). Remember, the ones we don't choose here can still be in the base 64 types, we are just choosing which 8 types are available in the shortcut for functions with two operands with the same type. For now let's just say CUSTOM or, maybe better since we haven't quite decided, RESERVED.

i'm tempted to get rid of bool and have two of these choices, but i guess functions with a bool output are pretty common so we want to represent those succinctly. One goal could be for these 8 types to be able to represent every function in Boot, as well as obvious alternatives; dunno if we have any actual bools in Oot, but we do have compare-and-jumps, for which an obvious alternative would be compare (although compare could also be trinary compare, which doesn't output bool; but a compare that was exactly analogous to compare-and-jump would be binary). And we also clearly have both int32 and uint32 in Boot, because i found that i had to describe that in footnotes to some of the instructions.

so the 3-bit common types are:

COMPOSITE
nil
int32
ptr
bool
uint32
f64
RESERVED

Out of the 256 one-byte types, we have:

64 2-argument, 1-result functions where the type of each of the 3 operands is 2 bits (the first 4 of that list)
64 2-argument, 1-result functions where the two operands are the same type and the type of each of the 3 operands is 3 bits (that list)
64 2-argument, 1-result functions where one operand is the same type as the result, and the type of each of the 3 operands is 3 bits (that list)
64 other types, the first 8 of which are that list

so we have 56 other types. Included in that list might be the above, plus: - f32 - int64 - uint64 - int8 - uint8 - int16 - uint16 - VARIABLE/DYNAMIC (type variable) - ANY (boxed) - unicode string (how terminated?) - bytestring (null-terminated) - bytestring (32-bit length prefixed) - bytestring (8-bit length prefixed) - bytestring (16-bit length prefixed) - fixed point - CODEPTR (as distinguished from an ordinary data pointer) - type - some of these? or is that too crazy? i think it's too crazy: https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats - function

And some composite type constructors: - array (homogenous) - struct (which, here, i guess is the same as tuple/heterogenous array/tree and maybe S-expr if you forget about the linked-list-ness, since any subelement can be COMPOSITE) - linked list (which, here, i guess is the same as S-expr, since any subelement can be COMPOSITE) - discriminated union (which, here, i guess is the same as ADT, since any subelement can be COMPOSITE) - nondiscriminated union - option (yeah this can already be represented but it's probably worth pulling out into its own type) - comptime - annotation

This is only 6 items so far, out of 56 spots, so i think we'll have room for whatever we need (be sure to leave lots of RESERVEDs and CUSTOMs!).

We could have 16 RESERVEDs and 16 CUSTOMs, which means we have 32 slots to fill now, of which the above might be the first 26, which means we'd have 6 left. That's more like it.

i'm sure other language implementations must have stuff like this (internal semi-compact semi-standardized representations of composite data types), what are they? i don't think i've come across it.

i'm worried that this is getting much higher-level than i'd like for LOVM, however. Maybe a lot of this should be bumped up to the OVM level, and just left as CUSTOM here.

for LOVM, maybe all we need are 16 non-composite types, saving spots for COMPOSITE and function types for OVM as above: :

COMPOSITE (RESERVED)
nil
int32
ptr
bool
uint32
f64
RESERVED
f32
int64
uint64
int8
uint8
int16
uint16
???

maybe for ??? we let one composite type slip in there, probably bytestring of some sort. Null-terminated or length-prefixed? If length-prefixed, what is the bitwidth of the prefix (i'm guessing 32 bits, since we are 32-bit oriented here in LOVM)? We could have both null-terminated and length-prefixed if we drop int16 or uint16, and if we drop both int16 and uint16 then we can have one other thing too.

So for now let's just say:

RESERVED (for COMPOSITE)
nil
int32
ptr
bool
uint32
f64
RESERVED
f32
int64
uint64
int8
uint8
RESERVED (uint16?)
RESERVED (int16? null-terminated bytestring?)
RESERVED (32-bit-length-prefixed bytestring?)

yes, i think that's much better. All the composite type stuff is too complicated for the LOVM level.

But now if we can't pass function types, how do we do polymorphism?

One alternative is just to not do polymorphism (in which case we don't have to define types here at all). But another alternative is to do only OOP-style "receiver"-based polymorphism, in which dispatch takes only one primitive type argument. This is enough for what we really want polymorphism for at this level (generic arithmetic ops e.g. addition over all of int32, uint32, f64); and frankly, it's all that some target platforms will support anyhow (and more than many will support, which means that on those platforms we'll have to compile e.g. polymorphic ADD to a dispatcher)

---

how does Wren represent strings? interestingly, it has both a 32-bit length prefix AND a null terminator. Having the null terminator means that it can just directly call C library functions for basic string ops. It also has a hash value:

" A heap-allocated string object. struct sObjString { Obj obj;

  // Number of bytes in the string, not including the null terminator.
  uint32_t length;

  // The hash value of the string's contents.
  uint32_t hash;

  // Inline array of the string's bytes followed by a null terminator.
  char value[FLEXIBLE_ARRAY];}; " -- https://github.com/wren-lang/wren/blob/main/src/vm/wren_value.h

this suggests to me that if we have a bytestring type, that we specify BOTH that:

it must be length less than 2^32
it must not contain any zeros

this lets us use whichever native primitive bytestrings the platform provides (if any).

bytestrings would be just defined as self-describing arrays of bytes (that is, an array of bytes data structure that contains the length of the array in the structure, so that you don't have to pass the length around separately). When you want arrays with zeros in them, the programmer can just use pointers and keep track of the length themself.

---

yknow, regarding the issue of having the 'smallstack' be a register accessible from Boot, necessitating the concept of a Boot 'user code' profile, you could just put the smallstack registers above the 8 registers that Boot can see. My issue with that is that we really would like to do stack manipulation in the compact 16-bit instructions. But:

could just use all (or some) of the 64 8-bit instructions for (the most common) instructions with smallstack operands
or, could forbid Boot from seeing/using the stack registers (so that all Boot code is in the 'user profile'). i sort of don't like that, because i want to simply macrocompile LOVM to Boot, and also store the stack stuff in Boot registers.
i'm not sure if this objection really makes sense though. If we do have a 'user code profile' in Boot that leaves enough registers open for a LOVM implementation, we also have to save registers for the PC and for a pointer to the memory where the rest of the registers are stored. The pointer zero register can handle the PC, but do we want to reserve 3 other regs? (smallstack TOS, smallstack ptr, pointer to other regs)? Otoh i guess it's not quite that bad, as the smallstack and the other regs would be next to each other in memory so you only need one ptr there, and you can have one integer for current smallstack size (so the bottom of the stack is an offset from the one ptr into memory, and then you have the stack size integer to find the TOS in memory; and you also keep the actual TOS in a register; so maybe you are really only reserving one thing, the register for the actual TOS).

so, i'm still leaning towards just having a Boot 'user code' profile, but it's something to think about. The second best alternative seems to be: - use the zero registers to save the current smallstack depth, and a pointer to the memory where you put the smallstack and the other registers. - use a higher register (>7) that Boot can't see to indicate 'if you see this in an operand, pop/push to smallstack instead of storing in a register) - now the 16-bit encoding can't access SMALLSTACK at all, but Boot can use all of the 8 registers it can see (with the zero registers functioning as immutable zeros) - we could even make the register containing the TOS a normal Boot register; this won't violate Boot semantics because Boot can't do anything to cause the stack to be pushed or popped, so within Boot program code this just acts an ordinary register - use some or all of those 64 8-bit instructions as compact ways to encode instructions that use the stack - unfortunately the most obvious encoding (just offer every opcode in a stack-only form) probably isn't good enough b/c we also want to compactly have a mixture of stack and register arguments, and we have 64 potential opcodes and 64 8-bit instruction slots - but 64 is probably still enough for the most commonly used cases, so it's probably fine

hmm after writing that down i like it more and more..

---

probably want opaque calling convention probably want 'entry' or 'prolog' (function prolog) instruction that takes as an argument how many registers are needed (i guess separately for int and ptr regs), and 'epilog' instruction to define the end of each function scope in the linear instruction stream. mb use something like GNU lightning's 'argument management' to access arguments: https://www.gnu.org/software/lightning/manual/lightning.html#The-instruction-set. Or maybe just define a convention there.

between these, and CALL, and RETURN, the calling convention can be implemented opaquely.

remember the old interop instructions: xentry xcall0 xcalli xcallp xcallii xcallmm xcallim xcallip xcallpm xcallpp xcall xcallv xlibcall0 xlibcalli xlibcallm xlibcallp xlibcall xlibcallv xret0 xreti xretp xpostcall

---

i GUESS we could sometimes use the 4th operand to just pass another argument, not always to pass type information

---

maybe CALL takes 4 arguments: the destination register, the address to call, the number of int registers in use (ie to be saved if the calling convention makes them caller-save), the number of ptr registers in use

otoh dont we want to pass type information in CALL, at least sometimes (mb a separate polymorphic call instr)?

hmm could split the # of registers operand into 2 4-bit operands. But that would mean that only 16 regs can be saved in between calls.. i guess the rest could be callee-saved tho..

ENTRY would have a similar problem tho (need to specify how many int regs needed, how many ptrs needed, where to put the type info, where to put the link register). I guess 'where to put the type info' and 'where to put the link register' could be a convention; but specifying them would be nice because mb where you want to put them is 'on the stack'.

do we even want a pseudostack register for the MEMORY stack (MEMSTACK as opposed to the SMALLSTACK)?

also does ENTRY need to specify how many arguments there are? how many int args and also how many ptr args?

---

one thing to consider is not having all of the 255 'registers' mean the same thing, not just as a convention, but as fundamental semantics. Could have some or all of:

argument-passing registers
return registers
SMALLSTACK special register, MEMSTACK special? register, PC special register, LINK register?
SMALLSTACK-aliased registers
MEMSTACK-aliased registers
caller-save registers
callee-save registers
maybe even heap locals and heap globals
RESERVED registers

if there are only up to 16 caller-saved and 16 callee-saved regs, then you can specify how many you need of either one or the other of those classes, both int and ptr, in one 8-bit immediate operand.

---

yknow, first, we could just have a separate REGSAVE instruction rather than making the # of things to save argument(s) to CALL.

Second, this was not wrong above, but remember that the caller (CALL) only needs to specify how many of the caller-saved regs they need saved, and the callee (ENTRY/PROLOG) only needs to specify how many of the callee-saved regs they need saved.

---

if the CALL instructon doesn't take a destination register argument, then it could just be:

the address to call
the number of int caller-saveds to save
the number of ptr caller-saveds to save
the type information being passed along

Where the arguments are, where the link register is, and where the return value(s) go could all be specified by the calling convention.

Although, to be fair, the type information could be specified by the calling convention just as easily.

---

arg regs and return regs are caller-saved

---

does the caller have to alloc shadow space on mstack? MS x86-64 convention has 32 bytes of caller-allocated shadow space. sysv x64 does not have this tho.

i think not. No shadow space, no 'red zone'.

---

if the convention for passing types to polymorphic functions is:

pass the types as the first argument

then you can make the 4th operand just be the 1st argument, and get an additional (code density) benefit from it when calling non-polymorphic functions.

---

one issue with a CALL3 mechanism is: how do you know if the CALL3 operands are to be interpreted as integer registers or pointer registers? Seems like you would need to use the addressing modes (either of the "opcode" ie function pointer, or of the operands) to disambiguate this. There's a lot of possibilities:

zero arguments
one argument, int
one argument, ptr
int int
int ptr
ptr ptr

(note: you could mb just leave out the 0- and 1- argument forms (just pass immediate zeros for the arguments you don't need))

(note: you don't need immediate mode for pointers, do you? but maybe you do, just to pass null pointers? could also interpret that as PC-relative)

i guess we could redefine the operand addr modes for CALL3. Use 1 bit for int-or-ptr, and 2 bits for any 4 out of: immediate, register, smallstack offset, stack choice, register indirect. So this isn't too hard, it's just an additional complexity.

also, note that unlike normal calls, CALL3 doesn't do any caller-saving, so the convention is different (everything is callee-save)

---

yknow, most callers probably won't need many or any of the caller-saved registers to be saved, as they will use the callee-save registers for anything that needs to be saved across calls, especially since there are so many callee-save registers available in this architecture. So:

it's not so bad to not include ANY caller-saving in a CALL opcode, and have that be a separate opcode if needed
if any caller-saving is included, it's not so bad to limit it to 16 regs in each bank

ENTRY/prologue is more likely to need to save stuff (the callee-save registers), and so should probably include two full saving operands.

Similarly, RET will often have to restore the callee-save registers, and so should probably include two full restore operands in addition to the value being returned (or maybe the two values being returned, since we have an extra operand).

---

but doesn't RET usually need to pop off the stack too? which stack? both? so doesn't it need to be told how many to pop off?

no; i think in many conventions, the only thing on the stack are the args (and maybe the return addr), and the callER cleans up the stack. This does, however, suggest that CALL does actually need to know how many arguments there are.

---

So, so far we're looking at:

CALL &fn #argcount(4-bit each for int and ptr; so 16 args max per bank) arg0/type arg1 -- or: one entry for int argcount and one for ptr, and no arg1 -- we dont need it to do anything except store the current address in the link register, but it does whatever the platform calling convention needs it to do (e.g. push args onto the stack)

ENTRY (maybe: #argcount) callee-saves-used-int callee-saves-used-ptr ?? -- pushes the needed numbers of callee-saves registers onto the stack. Note: the first callee saves ptr register is the link register, so if you are going to issue a CALL within this function but dont need any callee-saves registers for computation, then callee-saves-used-ptr should be 1, not 0.

RET (maybe: return_addr) return_value (maybe: #argcount) callee-saves-used-int callee-saves-used-ptr -- restores the callee-saves registers from the stack, puts return_value into the return register, then jumps to the link register (or, could leave the 'link register' on the stack but adjust the stack pointer, and then jump indirectly to the 'link regiter' value past the end of the stack)

---

so what if the platform doesn't provide as many callee-saved regs as we do? eg if we provide 64 callee-saved regs, and the platform provides 3, then we don't really want to save 61 regs upon CALL, because most of those probably aren't needed. So, we kind of do want CALL's operands to tell us how many int and how many ptr callee-saved registers are currently in use/actually need to be saved.

this suggests either: CALL &fn #argcount(4-bit each for int and ptr; so 16 args max per bank) #callee-saves(4-bit each for int and ptr; so 16 max per bank) #caller-saves(4-bit each for int and ptr; so 16 args max per bank) or CALL &fn #argcount(4-bit each for int and ptr; so 16 args max per bank) #callee-savesInt #callee-savesPtr

in option 1, arguments are caller-saveds but they don't count in the #caller-saves count

i'm leaning towards the first variant.

so, we seem to have 48 registers per bank:

16 arguments
16 callee-saveds
16 caller-saveds

we also presumably have some 'specials', such as the stack pointer. If we had 16 of these 'other things', then we'd have 64 registers per bank.

---

actually even in x86-64 there are a lot of situations where only 2 GB of code is allowed; see 13.3 64-bit memory models in Windows and 13.4 64-bit memory models in Linux and BSD and 13.5 64-bit memory models in Intel-based Mac (Darwin) in https://www.agner.org/optimize/calling_conventions.pdf

---

could also use Lua-style CALLs:

" CALLA B CR(A), ... ,R(A+C-2) := R(A)(R(A+1), ... ,R(A+B-1))Performs a function call, with register R(A) holding the reference to thefunction object to be called. Parameters to the function are placed in theregisters following R(A). If B is 1, the function has no parameters. If B is 2or more, there are (B-1) parameters.If B is 0, the function parameters range from R(A+1) to the top of the stack.This form is used when the last expression in the parameter list is afunction call, so the number of actual parameters is indeterminate.Results returned by the function call is placed in a range of registersstarting from R(A). If C is 1, no return results are saved. If C is 2 or more,(C-1) return values are saved. If C is 0, then multiple return results aresaved, depending on the called function. " -- http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf

actually we could provide both (both Lua-style calls, and calls that use a fixed set of registers for argument and return passing)...

---

yknow, instead of providing how many caller-saved registers to actually save upon each CALL, could store this in special REGISTERS instead (since it will often be set upon ENTRY and stay constant across a function). Then you have less CALL operands and you can exceed the 16 register limit and use more of the potential 255 registers

so we're looking at:

CALL fnptr #argcount(4-bit each for int and ptr; so 16 args max per bank) arg0/type ? -- reads the #s of caller-saves from special registers, and saves them, then calls, then restores them

mb also CALLLUA (same arguments as CALL) -- same as CALL but instead of the arguments being the first 16 registers, they are taken from the registers immediately after the fnptr register

ENTRY caller-save-needed-int caller-save-needed-ptr callee-save-needed-int caller-save-needed-ptr -- puts all 4 args into special registers, and saves the callee-save-neededs, (do we also need #args tho? mb but probably not..)

RET return-value -- reads the #s of callee-saves from special registers, and restores them

---

should we have IP-relative addressing so that position-independent code is ez? Or mb use some other register and do everything relative to that?

makes me go back to thinking about those X,Y registers from apple

http://www.obelisk.me.uk/6502/addressing.html http://www.emulator101.com/6502-addressing-modes.html http://skilldrick.github.io/easy6502/#addressing

could use our wealth of registers as a sort of 'zero page'. Do we want to allow indexing into the set of registers though? Probably not.

---

oo remember how pretty this used to be:

" initial commit Bayle Shanks authored 4 years ago

OVM is a virtual machine platform for programming language implementation. It is distinguished by a small, easy-to-port implementation, yet with the ability to preserve some high-level constructs when used as a target IL. Ovm is a prototype and is incomplete and buggy. There have not been any releases yet. Ovm is open-source (Apache 2.0 license).

Motivation Ovm is being (very slowly) developed for the Oot programming language (which is also under development), to serve as:

the language in which Oot will be implemented in a target IL (intermediate language) for Oot programs

Ovm is motivated by Oot's aspirations to be an easy-to-port 'glue language' that can run under, interoperate with, and transpile to/from a variety of languages and platforms. The reason for Ovm's existence is that many other virtual machines:

are too big: many VMs have large implementations and long specs, making them hard to port constrain control flow: many VMs and ILs are written to support a particular HLL (high-level language) and only support control flow patterns of that HLL (for example, some VMs/ILs have trouble with tail calls) are too low-level: on the other hand, there are some tiny VMs that consist of only a handful of assembly instructions and few or no registers. When HLL constructs are compiled into assembly, program structure can be difficult to recover; such a language would not be suitable as an IL for transpilation between HLLs

Ovm deals with these issues:

Ovm has a small implementation. Much of its functionality is in its standard library, which is itself written in Ovm (so you don't have to port it) Ovm is assembly-like and supports arbitrary control flow patterns via computed GOTO and exposure of the callstack Ovm is slightly higher-level than assembly and provides enough registers for a simple mapping of HLL variables to registers in most cases, but more importantly, it is extensible, which allows the standard library to define some higher-level constructs

The implementation of even relatively simple and common operations as standard library functions makes them inefficient. The motivation of this is just to allow us to have a small core, for portability, and push most of the functionality into the standard library. An Ovm implementation which values efficiency over portability could simply override some of these standard library calls with native implementations. Ovm is structured to make this easy to do.

What's here This repo contains:

ovm: command-line frontend to pymovm pymovm: a minimal Ovm implementation (in Python) oasm: Ovm assembler odis: Ovm disassembler pyooblib: a Python library for working with .oob files (and .ob0 files and raw ovm bytecode files) cinstr.ob0 and syscall.ob0, part of the 'standard library' of Ovm docs: There are no docs yet. For now, see http://www.bayleshanks.com/proj-oot-ootAssemblyThoughts (and possibly the later items in the http://www.bayleshanks.com/proj-oot-ootAssemblyNotes* series eg http://www.bayleshanks.com/proj-oot-ootAssemblyNotes13 ).

Ovm understands three file types (raw and .ob0 formats are understood by the core OVM implementation, .oob format is supported by standard library functions):

raw: raw Ovm bytecode .ob0: a simple module format containing a list of functions, their type signatures, their bytecode implementations .oob: "Oot bytecode", a more elaborate module format also containing imports, constant tables, signatures, optional debugging information, etc (not implemented yet).

Overview of the Ovm language Ovm is a assembly-like 3-operand CISC memory-to-memory register machine with 2 primitive types, 4 primitive address modes, 16 primitive instructions, 256 registers, a data stack, and a call stack. Primitive types:

U16: unsigned 16-bit integer PTR: pointer

Primitive address modes:

immediate register register indirect (the equivalent of LOAD/STORE) stack (post-increment/pre-decrement upon read/write; reading/writing to a stack pointer pops/pushes the stack)

Primitive instructions:

annotate (not a real instruction, but an annotation of this point in the code) loadi (load immediate) cpy (copy) add-u16-u16 (addition on integers) sub-u16-u16 (subtraction on integers) leq-u16-u16 (<= comparision on integers) add-ptr-u16 (add an integer to a pointer) sub-ptr-u16 (subtract an integer from a pointer) call3 (call the function represented by a function pointer, passing in 2 inputs and getting 1 output) lea (load effective address: given an addressing mode and operand data, compute an effective address) syscall (call various language or system services) jzrel (conditional relative branch; branch if input is zero) jnzrel (conditional relative branch; branch if input is non-zero) j (absolute jump (or static, or immediate jump); jump to a fixed label) ji (indirect jump (or dynamic jump); jump to a pointer in memory) (maybe: swap?)

Ovm will provide compile-time type checking (not yet implemented). Ovm provides the following features which are atypical for assembly-like languages:

Generic and ad-hoc polymorphism Single-instruction function calls to first-class-functions: a memory location holding a reference to function is called, passing in 2 inputs, and assigning one output Capability-based security (not yet implemented) Separate data and call stacks; also, stack, and the register file, are first-class and may be swapped in a single instruction to allow concise 'caller saving' of registers, greenthread context switching, etc. Extensibility features:

custom instructions, written in Ovm custom syscalls, written in Ovm user-defined data structures with getters and setters written in Ovm (not yet implemented) language hooks which call Ovm code (not yet implemented)

Oot has three bytecode encodings, SHORT (16-bit fixed length instructions; defined but not implemented yet), MEDIUM (64-bit fixed length instructions), and LONG (under development; variable-length instructions composed of a 64-bit aligned stream of 16-bit words). " -- https://gitlab.com/oot/ovm

---

how many regs to allow the use of? we want to save some for the implementation, right?

since i kind of like 64 anyway, there's a nice justification for that; 64*4=256, so if we only allow 64 regs in each bank and if registers hold 32-bit words then the implementation can convert register number to byte number in a heap-stored register bank

---

i really like the spirit of QBE vs LLVM:

https://c9x.me/compile/doc/llvm.html

" QBE is a much smaller scale project with different goals than LLVM.

    QBE is for amateur language designers.

    It does not address all the problems faced when conceiving an industry-grade language. If you are toying with some language ideas, using LLVM will be like hauling your backpack with a truck, but using QBE will feel more like riding a bicycle.

    QBE is about the first 70%, not the last 30%.

    It attempts to pinpoint, in the extremely vast compilation literature, the optimizations that get you 70% of the performance in 10% of the code of full blown compilers.

    For example, copy propagation on SSA form is implemented in 160 lines of code in QBE!

    QBE is extremely hackable.

    First, it is, and will remain, a small project (less than 8 kloc). Second, it is programmed in non-fancy C99 without any dependencies. Third, it is able to dump the IL and debug information in a uniform format after each pass.

    On my Core 2 Duo machine, QBE compiles in half a second (without optimizations)."

it's probably not for us right now for similar reasons as LLVM, as: (i) it's SSA (and right now we're looking for simplicity, not optimization) (ii) it's only for a few platforms (and right now we're looking for extreme and extremely easy portability)

but still interesting

---

two things to consider about CALLs and callee-save and caller-save variables:

1) the idea i had above of having the number of caller-save variables in use, and the number of callee-save variables in use, both be in special registers, works well in terms of this virtual machine, but if implemented straightforwardly has an implementation cost: - either these numbers are actually stored in platform registers, consuming a bunch of registers, or they are stored on the stack or heap, causing an extra unnecessary memory access upon each CALL/RETURN - this is because putting these things in registers makes them DYNAMIC variables, instead of statically known immediates - they are saved/restored automatically with the other registers, so it's not a 'problem' in that senses, but it is a performance problem - what these really should be are 'lexically scoped'; in this case 'linear lexical scope' would be enough, e.g. "any CALL/RET instruction after a PROLOG instruction and before the next EPILOG instruction take their number-of caller-save/callee-saves from the values given in the PROLOG instruction" - but, lexical scoping, even linear lexical scoping is perhaps slightly more complex than we want this language to be? - i dunno if it's actual a problem, is there really much of a complexity cost from this? If a compiler, you can still be doing single-pass; just have a few more state variables as you read thru the code that stores the last encountered number-of caller-save/callee-saves. If an interpreter, you can just use extra registers, as above, even though there's a perf cost to that. - still, even lexically linear stuff rubs me the wrong way a little. It seems it would be best to just have instruction operands in CALL and RETURN that contains that number of things that may have to be saved.

2) in actual register machines, the calling convention is just that, a convention, and programs have the option of using any register to pass arguments, not just the argument registers. However, in our case, some implementations might put some registers in actual registers and others in the stack or on the heap, so it might be simpler if we just made the caller-save/callee-save/argument register distinction an enforced property rather than a convention. That is, after a call, ONLY the argument registers are guaranteed to be preserved. Note that this strongly suggests that we should encapsulate callee-saving via an ENTRY/PROLOG instruction, so that implementations that wipe those registers anyway don't spend time redundantly saving the already wiped values (if it's left up to the program to do that explicitly).

so so far, i'm thinking:

don't do registers, don't even do lexical linear, just have operands for how many things might have to be saved
maybe don't even offer the opportunity to save caller-saves in the CALL instruction; programs have to really just use those as 'temporaries' that are potentially erased at each CALL (or they have to save them manually with another instruction). But DO encode the number of callee-saves (because the implementation might not really be having register bank shared across CALLs).
but still have ENTRY/PROLOG that must declare, in a 'lexical linear' way, the maximum number of callee-saved registers used anywhere in its scope; this allows an optimizing compiler to allocate less space on the stack or heap for these registers, instead of just allocating 255 (or 64, or however many we allow) registers per bank
but couldn't an optimizing compiler just look thru the CALLs in a given scope and take the max of them?
but also don't guarantee that anything other than the argument registers are shared with a subroutine, or that anything other than the return registers are shared upon returning

---

Another place that extra/special registers might be useful is in defining custom instructions. Again, though, that would allow custom instructions to be redefined dynamically, whereas we probably want them to be defined 'lexically linearly' instead.

---

still have to figure out if we're doing hardware-style CALLs, where the argument register are a fixed group of registers, or Lua-style CALLs, where the argument registers are after the register holding the address to be jumped to. I'm thinking that we should do it hardware-style because maybe there's some reason they all do that, and even if not, there may be some impedence mismatch in translating to that (for example in tail calls), although i haven't thought it thru so i'm not really sure if there's a cost or not.

---

agner likes elf best: https://www.agner.org/optimize/calling_conventions.pdf

---

could still consider stuff like custom instructions, with a special register determining what the custom jnstructions do

---

an example of someone else's simple assembly language and simulator/debugger:

---

points out that labels can be the only thing in a line.

i think labels should be the only thing in a line, and those lines should always start with :

---

the goal of LOVM is to be a portable 'assembly-like' language which is somewhat efficient and a somewhat good IL for transpilation:

assembly-like suggests:
linear instruction stream (e.g. not a textual core language or tree-like AST)
relatively simple encoding (e.g. not variable-length integers)
efficient code (especially in terms of density)
a good IL for transpilation (this implies a sort of 'readability')
e.g. when possible, e.g. just CALLs rather than elaborate preparation for calls
better to have too many local variable-holding places (registers, stack locs) than too few
avoid stuff like status registers which may be tricky to implement on some platforms, and which make automated reasoning about code tricky

---

consider having offset ranges like MIPS (which is greater than RISC-V):

16-bit conditional branches (RISC-V only does 12-bit)
32-bit unconditional jumps (RISC-V only does 20-bit)

---

mb define an optional 'status register' (which CANT be used as a gpr, to ease efficient implementation ) and optional GPR 'accumulator' to (1) allow bigger immediates in special instr forms and (2) ease an orthogonal 8bit instruction set

---

could specify that instruction boundaries msut be consistent throughtout the program, that is u cant jump into the middle of an jnstruction as viewed by the alignment implied by decoding starting from the PC (and going forward or back) at any other time

---

my calculation at the top of lowEndTargetsUnsorted5 regarding the NXP i.MX 6ULZ suggests that we stay under:

128 or maybe 512 bytes of state per thread (or at least for the TCB)

in units of 32-bit words (4 bytes), that's:

32 or maybe 128 words

so if we have 2 banks of 32 registers each, plus 2 smallstacks of 32 items each, that's already at the limit of 128 words, and we don't have room for metadata! So maybe that's too much.

It's still less than Erlang's processes though, which are at least 309 words [1] -- but that includes some non-stack heap as well as the stack (the process heap, which contains the stack, is 233 words).

so this suggests that we reduce the registers in each bank and each smallstack from 32 to 16. If we allow direct access to deep smallstack (indexed to TOS) as if it were another register bank, then a program can ignore the smallstack and effectively have 32 registers in each bank if it wants, which is enough.

So in each bank (before making space for the special registers) we'd have: 8 caller-save (temporary) GPRs 8 callee-save GPRs 16 smallstack GPRs

are arguments passed on the smallstack, or in the caller-save GPRs, or on the memory stack? With this setup we have more stack locs, so maybe on the smallstack.

---

so we're looking at:

16 regs in each bank; 8 caller-save (temporary) and 8 callee-save
special registers taken out of the caller-saves?
2 smallstack banks of 16 items each
so, 64 words of register-like state
the 32-bit and higher encodings have 255 values in each register operand -- since we only have 64 regs this gives us 2 extra address mode-like bits -- or we can reserve some or all of this for the implementation
ENTRY instruction specifies how many registers in each bank, for each of caller/callee-save, is needed, and also how many items in each smallstack is needed. The smallstacks are effectively callee-save. For each callee-save item needed, including each smallstack spot needed, the relevant registers/stack locs are spilled onto MEMSTACK by the ENTRY instruction. The specification of caller-saves needed has no visible effect (assuming the program never exceeds what it asked for) but allows the implementation not to allocate as much memory for register state, if that's how it works.
the smallstacks can be treated like registers with a special addressing mode. However the registers are indexed from the current TOS (should we provide an option for fixed indexing too?).
an instruction is provided to spill as much of the smallstack as necessary to MEMSTACK to free up a given amount of space (but not to spill the top m items, which could be used to preserve function arguments passed on the smallstack). However, on circular buffer implementations which don't remember the stack depth (only the pointer to current TOS), this instruction just spills the entire smallstack

one concern is that this proposal doesn't seem to provide enough caller-saves; in fact you want more caller-saves than callee-saves, not less. I suppose if arguments are passed on smallstack, though, those arguments can be thought of as caller-saves.

so maybe take the special registers out of callee-saves.

also, you probably want to interleave the caller- and callee-saves a little so that the 16-bit encoding, which can only easily access the first 8 regs, can access some of each. Just extend the Boot convention, i guess.

---

Looking ahead to the 64 8-bit encoding shortcuts, probably what we want to do is analyze and find the common things and then rationalize it so that we have a regular ISA subset. Initial thoughts, according to some stuff in my plChAssemblyFrequentInstructions:

MOV is common so we'll probably want all 4^2=16 CP combinations between registers R1 thru R4, inclusive, and also CPP. So that's 32 already (half of our budget).
LD is common so we'll want all 32 combinations of lw with register operands R1 thru R4, similarly to the previous.

Note that R1 thru R4 comprise the accumulator, TOS of SMALLSTACK, TOS-of-SMALLSTACK-with-implicit-push-pop, memstack pointer.

That's already all of our budget! But if it weren't, what else is common? Going thru everything in plChAssemblyFrequentInstructions, and writing down the stuff that appears in spot #1 in the various lists:

MOV (includes ld)
MOV
PUSH
LD
CALL
CALL
LM
LD
LD
LD
LD
conditional branch
LD
mov
lm
ld
mov
mv
addi (includes mov, lm)
addi (includes mov, lm)
mv
ld
mv
ld
mv
ld
mv
ld
mv

totals (top only): 10 mv, 11 ld, 2 lm, 2 addi, 2 call

writing down the stuff that appears in spot #2 in the various lists:

PUSH
SHL
MOV (includes LD)
MOV
RET
LM (load immediate)
LD
conditional branch
LM
conditional branch
conditional branch
LD
conditional branch
lea
lm
callw
ldsp (load from stack with offset)
ld
ld
ld
ld
lea
ld
compare
add
mv
add

totals (top only): 3 mv, 8 ld, 3 lm, 5 cond branch or compare, 2 add

so i think it's clear that mov and ld are the most common, all things considered. After that maybe lm, and then maybe a conditional branch.

HOWEVER we won't have the offsets that these things usually do, so these might not be right for us, we'll have to see.

if we really do just have 32 MOVs between R1-R4 and 32 LDs between R1-R4, then we can completely regularize:

2 encoding format bits
1 opcode bit (CP or LW)
1 register bank bit (int32 or ptr)
2 source register operand bits
2 dest register operand bits

let's go with that for now, as this high degree of regularization may keep the interpreter code smaller

---

i think when making the calling convention we really need to think about making calling not very costly, a la Forth

---

And remember that if we only use 64 registers specifiers then we can use four addressing modes even in the 32-bit encoding

Lovm can have a limited number of registers but ovm can have 255. Lovm can have four addressing modes, or 8 if we don't specify the register bank in the operand.

---

RISC-V says that they didn't use a CAS instruction because it would have 3 source operands (or 5, for double-wide CAS) (and apparently the maximal number of input operands over the entire instruction set determines how many ports are required on the register file, which makes hardware power and mb perf worse).

Here's some things with 3 source operands:

CAS (expected old value, new value, destination address)
STore with indexed offset (value to store, destination address, index)
fused multiply-accumulate (a = a + b*c)

i think we want those in LOVM (and probably even in BootX?). Mb not in Boot tho.

similarly, yeah, addressing modes makes things slightly harder for hardware, so they are in LOVM but not in Boot.

in fact, right now in Boot, i think we're already good; the STore instructions have only an immediate offset. So i don't think any instruction in core Boot reads 3 input registers.

---

i guess that if you pass operands in registers, then basically you could just say that up to all of the temporaries (caller-save) registers are used for arguments, in their are that many (non-vararg) arguments, then the last two are used for #rest-of-args and for ptr to the rest of the args.

should we pass arguments in registers or on memstack? probably in registers; want calling to be efficient
should we use a link registers? forwardcom doesnt. but i think we should, to make calling efficient
should we pass args in the smallstack? i'll have to think about this
what about caller/callee save/restore of smallstack? i'll have to think about this
normally you'd just say "callee can use the stack, and leaves the existing stack as they found it", but the issue here is that the smallstack may need to be spilled if the caller (and ancestors) used a lot of it and the callee is using a lot of it too. I'm thinking that part of what ENTRY does is call a special platform primitive, telling it how much smallstack space the callee needs, and the platform spills the smallstack to memstack as necessary (from the top? from the bottom?). But there are more questions: eg if it could be spilled from the top, then the callee can't read what the caller put on the stack, is that okay?

---

maybe xret0 could take an $ncond?

---

consider a pointer to data storage area(s), and mb thread-local data, as part of register convention

Describe convention including stack pointer. Consider zero page like addressing mode.

---

for assembler:

look at example from https://www.muppetlabs.com/~breadbox/software/tiny/teensy.html. of empty elf file; it has $ and $$ for locations of stuff in the file, as well labels whose location can be subtraced from $, as well as db for data, dv for variable data, and equ to set a variable. i think: variables and labels are the same hthing. $$ is label of beginning of file and $ is label for current location

---

201021

The next big question for Lovm is Whether it is really just an assembly language and therefore has a normal-ish amount of registers, or alternately whether it is a high-level assembly language of some sort whatever that means. hence, Can it have a large number of potential registers without actually taking up space for those if they are unused. this might(?) imply that it always knows Which function it is in and how many registers that function actually needs. We know that it is somewhat aware of calling conventions and memstack, like LLVM. But it is not as high level of Lua, which encapsulates the actual location function locals This question is important right now because it determines how much space in the encoding we need for registers therefore how many addressing modes we have, and Probably more importantly, the amount that the architecture constrains what we do in order to stay aware of how many registers are in use. i guess the real question is, Can I come up with a set of zero-cost conventions That allow the architecture to have a zillion registers in theory and yet never have to actually have them If the program never requests them. Some constraints that might bear on whether we have lots of registers or not: we're going to be doing stuff like tail calls and yielding between coroutines and green threads, so we are going to have to serialize state possibly to places other than a linear stack. We want compilation to be fast, so this might rule out a lot of global stuff in the language because we probably want to emit code as we encounter the first time (is that what is meant by a single pass compiler?). We want compilation to be simple if we target boot or LLVM. We don't want to be doing lots of allocations unnecessarily. we don't want to have mandatory minimal state per greens thread be large. If we have a lot of registers we probably want a single instruction To save / restore state that can be used during green threads. To keep it zero cost we should probably tell it how much state there is. If we have four types of state (two register banks and two small stacks) and each type has a 64 item max, then it takes six bits to describe each of these four types for a total of 6 * 4 = 24. How many data bits do we have in a 32-bit instruction encoding? 8 * 3 = 24. So it's not crazy to think that we could do this.

what are our goals here? well, although we want to be easily compilable to Boot, really we could do without Boot; Boot is just there to make it easier to implement our stack in case someone who doesn't have the time to implement LOVM directly. So Boot is kind of like training wheels (although there's nothing wrong with just implementing Boot on a platform and then never implementing LOVM; performance may be worse but everything else should work). Really we should be thinking of LOVM as our lowest-level 'assembly-like' VM. So we don't want to be climbing the abstraction ladder yet, unless we can do so at (close to) zero cost. For example, we don't want to force the implementation to do things like register allocation just to compile LOVM to Boot; it should be feasible for the extra registers in LOVM to just be straightforwardly mapped to memory. On the other hand, the reason we even bother with LOVM given that we already have Boot is that we don't just want any old assembly-like language, we want a kick-ass assembly-like language (as noted elsewhere, partially so that LOVM can serve as an intermediate language for transpiliation; but also partially just because if we have to make up an assembly language and then use it for writing all of OVM, it may as well be a kick-ass one, so that the OVM implementation is both readable and performant; yet the LOVM implementation still has to be small and simple).

---

so i guess the thing to do is think about: if we have an abundance of registers, but the program doesn't need them, does the implementation end up causing the possibility of using the others to have a significant cost? Since we are choosing between 16-regs-per-bank and 64-regs-per-bank, our goal is to imagine we have 64-regs-per-bank and see if we can find a situation in which that has to be worse than 16-regs-per-bank.

so let's imagine that we have 64 registers in each bank, but a program/function P only uses 16 of them. What happens during:

function calls
tail calls
green thread initialization
green thread task switch
saving and accessing activation records on the heap

---

function calls: P, who only uses 4 registers, nevertheless will be forced to use a lot more when P calls another function who takes a lot of arguments, if the calling convention says that a large number of registers are available for passing arguments and that they should be used first.

so, if we want the 64-reg case to be no worse than the 16-reg case, we should have at most 16 argument-passing registers.

---

if there are some implementations that dont track the smallstack stack depth (using just a circular stack, for example), they will have to spill the entire smallstack if a callee needs to push anything. So if there are many regs, either the implementation must track stack depth, or the program must and must pass that information to callees, who uses or passes that information to the implementation when it is/may be time to spill.

---

i think the possibility of tail calls implies caller-cleanup?

(assuming we dont want a separate calling convention for tail calls)

---

if some greenthreads have less state than others than we need to store the length of state being stored. if maximal state is 16*4, that's 4 bits*4 = 16 bits, and if it's 64*4, that's 6 bits*4 = 24 bits. So, two bytes vs 3 (or maybe 4 if 16-bit alignment is needed). Not a big deal.

---

if LOVM provides special instructions to save and restore state, and if there's a lot of potential state, then you have to pass arguments to these instrs saying how much state there actually is. If this is 16 bits then you still have a third operand for another purpose, but if it's 24 bits then all of your operands are used up, and these instructions must have hardcoded input and output registers that are the only ones they can use. Ugly but not a big deal.

---

in 24 bits there's not enough space to separately specify how many caller-save vs how many callee-save registers are in use for 4 register sets (2 banks, 2 smallstacks). If you need that granularity then you're going to have to have multiple instruction calls to pass along the register usage info.

---

mb see https://cs.stackexchange.com/questions/35760/what-are-the-disadvantages-of-having-many-registers

---

i'm 1/2 way thru reading https://cs.stackexchange.com/questions/35760/what-are-the-disadvantages-of-having-many-registers

so far:

they say the instruction encoding gets larger. yes, i'm aware of that and have been considering it.
they say that b/c the physical space taken up by the register file electronics gets larger, the wires get longer, which makes things slower and hotter. i don't think that's too relevant b/c we are imagining that physically, most of our registers are actually stored in memory anyhow, it's just that the VM exposes them as registers to make the programming environment better/more efficient.
they say that the CPU has more state to save for speculative checkpointing. i hadn't thought of that; that implies that if we want more registers without the cost, then the VM must be informed about how many registers are in use at all times. This implies that anytime you do the sort of 'context switch' that changes that, the VM should be informed of that. So either the VM has to get involved in remembering how many registers each context is using, or up on each context switch, that information must be communicated to the VM. With 64*4 regs it takes 24 bits to communicate this, which means that all 3 operands are taken up, which means you need an entire instruction just for that (and you can't, for example, just put it in one operand of a JMP instruction)

however, if the bits to say how many registers are in use is such a big deal, we can just say that only powers-of-two registers can chosen to be in use. That is, for each bank, you can only choose between:

0 registers
1 registers
2 registers
4 registers
8 registers
16 registers
32 registers
64 registers

that's 8 choices, so it takes 3 bits instead of 6, times 4, is 12 bits total. So there are 12 bits left for the other operands. That's not a multiple of 8, however, so it means that we only have 1.5 other operands to work with.

If we wanted to fit this information in just 1 operand, we'd only have 8 bits total, which is 8/4=2 bits per bank, so we could only have 4 choices. What might those be?

we probably want to preserve the '0 registers' choice, because we may have substantial savings when we can skip saving one bank altogether. And if we are talking about having 64 regs then we'd better save that choice. Which others to get rid of? On the one hand, the cost of going from 32 to 64 registers is substantial, so we may want to save 32. On the other hand, functions that use few registers are probably more common, so distinguishing between them probably saves more.

Design of the RISC-V Instruction Set Architecture by Andrew Waterman Figure 5.2: Cumulative frequency of integer register usage in the SPEC CPU2006 benchmarksuite, sorted in descending order of frequency shows that even as little as 1 register in use occurs 20% of the time, with a very smooth curve up as you get more registers. Looks sort of like an inflection point is 8 registers, however.

we have 4 banks so is 8 registers for them like 2 for us? probably more like 4 for us.

also probably either '8 registers per bank' or '16 registers per bank' is conceptually attractive because would also be good choices for the total number of registers available period.

4 registers per bank is just enough to have:

a smallstack TOS register
a smallstack register
a memstack register
an accumulator

(although if we have addr modes we may not need a smallstack register and/or a smallstack TOS register)

So maybe something like:

0 registers
4 registers
16 registers
64 registers

---

to summarize the previous section: i'm still reading that thing but one thing i though of was:

you may want to keep the VM informed at all times about how many registers are in use. That means that you want to be able to encode that information in an operand or two when jumping. Which means that you want it to fit in either 8 or 16 bits total. If, for each of 4 register banks, you specify exactly how many regs are in use, that's too much (24 bits). But if you constrain to powers of 2, that's only 3 bits per bank = 12 bits. And if you further constrain to just 4 choices (2 bits), that's 8 bits total. The latter two choices are attractive because now you can include this information with every jump instruction that changes 'context'.

for 4 choices, good choices might be:

0 registers
4 registers
16 registers
64 registers

recall that these are each 'per bank'.

---

one issue is that if you are jumping into library code or you are the scheduler jumping into a greenthread, you dont statically know how many registers it needs, so you have to read this from a register or memory or something. But other times you do statically know, and you probably want to surface that to the implementation by using an immediate. So you want addressing modes on the information about how many registers are in use. But with even 8 bits, there's no space for that. I guess you could just use multiple opcodes for different addressing modes. Or, with 8 bits (but not 12 bits), you could use 2 operands.

i'm kind of thinking there might be unknown benefits to having a small menu of choices about how many registers are active, even besides having more operands free (physical registers could have sleep modes, virtual ones could use slab allocation to reduce fragmentation, register save/restore code could be unrolled/specialized/'monomorphized' to these choices, etc). So yeah, if we decide to do >16 regs per bank at all, then i like:

0 registers
4 registers
16 registers
64 registers

---

but now imagine you're jumping back to someone else's code. Now you have to store both (a) the return address, and (b) their register usage configuration. So you can't just use a link register, you have to store 2 things on the stack, etc.

mb the implementation could ease this by storing the return address together with the register usage config using tag bits. How many tag bits are available? on x64 right now memory addrs are only 48 bits but i've heard intel is switching to '5 level paging' which will use 57 bits (leaving 6 or 7, i guess). and i guess from first principals, if a 32-bit address must be 32-bit aligned then you get 4 bits free.

now, 4 bits is only half of the 8 bits needed for the above. Having one bit for each bank isn't too useful (16 vs 64? off vs on? you'd really like to go at least off/16/64, in which case you have 4 bits so you may as well throw in another choice). Maybe restrict the combinations across banks? e.g. obviously you'll probably never want all 4 banks off.

so, with 4 bits, you get 16 choices across all 4 banks.

maybe:

4/0/4/0? (note: we left out 0/4/0/4)
4/4/0/0
0/0/4/4
4/0/0/4
0/4/4/0
4/4/4/4
16/4/4/4
4/16/4/4
4/4/16/4
4/4/4/16
16/16/16/16
64/16/16/16
16/64/16/16
16/16/64/16
16/16/16/64
64/64/64/64

that seems to be working out reasonably, if not great. However merely storing this table adds cost and complexity, so i dunno. but i expect i can find some regular function on 4 bits to come up with that sort of thing. Something like: the sum of the 4 bits determines the two choices of values between the slashes, and then where the bits are determines where the bigger value is. This rule (as with most regular ones, i bet) still leaves an all-zero, however -- whatevs. eh, the table is probably fine.

the table is 4x16 entries, where each entry is 6 bits, so 4*16*6/8 = 48 bytes. Not nothing, but it's only 12 words.

however, one issue i have with this is that some architectures may not have ANY tag bits available. i don't want the program to be able to just assume that any pointer can always hold 4 tag bits, do i? or am i happy with the implementation falling back to saving 256 registers every time on platforms with no tag bits?

wait a minnit i'm dumb: fitting the register state configuration into four tag bits is kind of stupid because the register state itself has to be saved somewhere else, so it can easily have a length header. The only time you need to describe the amount of state is when you're creating that state -- either an ENTRY instruction to allocate the registers before us, and/or before you jump away when you save them to say how many are currently in use

---

so far i see no (good enough) reason not to allow up to 64 registers, while allowing an 8-bit selection of how many of those are active e.g. within each function.

---

if ENTRY selects a # of regs, could store this choice of # of registers in a register itself; then the program can save it on the stack before a call the same way a non-leaf function saves the link register (although a special instruction will be provided to do this stuff; each program doesn't have to implement it manually).

---

so, i'm tired right now and probably making mistakes, but i think we're looking at:

the ENTRY instruction takes information about how many registers are needed of various types (either one or two 8-bit operands, depending on whether or not # of callee-saved and caller-saved are provided separately)
the ENTRY instruction spills any necessary callee-saved registers to the stack, and saves the #registers configuration to a register, after moving the old #registers configuration to a different register
the CALL instruction spills the link register, the old #registers configuration, and any necessary caller-saves to the stack, writes the current address into the link register, and jumps
the RET instruction restores callee-saved registers (looking at the current #registers config register to see how many there are that need to be restored), and the old #registers configuration, remembers the current link register, restores the old link registers, then jumps to the remembered current link register

the cost is:

more state: we store those extra two #register registers (the current one and the old one)
perf: we have to execute ENTRY, which moves the old #register register value into the secondary register, and writes the new value into the #register register; and CALL and RET must save and restore the #registers register

and the benefit is:

the machine knows which registers are in use
because of this, it's easy to let the machine do the spilling/saving of registers

if the machine knows, it might be able to implement something like register windows instead of just spilling to the stack always

if the machine doesn't need to know, or if it can just have hidden state (or special state in a similar manner to the PC) that remembers how many registers are currently active, we might be able to avoid the #register register. Upon RET, we know how many registers to restore because that information can be saved with the saved register states themselves. Upon CALL, we provide the number of caller-saves as an operand. Currently, this sounds better to me.

one reason to mb keep the #registers register, and the secondary #registers register, is that, consider a function that uses no caller-saved registers. There is no reason for this function to need to save anything on the stack before a CALL, but it does have to save its #registers state. So maybe, like a link register, it would prefer to save that state in another register, rather than on the stack.

---

I'm overthinking it; most of this can be done by calling sequences in software without ISA support.

(note: see next section for a more readable summary of the following thoughts)

The real problem is in pre allocating resources for a new thread (green thread or otherwise); Even if the main routine doesn't use that many registers, it could conceivably use a library that calls another library that uses a lot. You could say oh hey I'll just never call libraries that do that, but the language-wide convention as to how many registers are available will probably lead to some libraries using most of them.

And you probably don't want to allocate space for registers and then have to reallocate later, although actually maybe that's okay cuz isn't that what golang and erlang do with stacks? Although i guess the real problem here is that if registers are being used right from the memory that they're allocated in for each greenthread, then this forces a check on every register write to see if it's too large to fit in allocated memory.

So the problem is really just for massive parallelism. But do we care; LOVM is already too beefy for the kind of small chip we expect to find in a massively parallel system; LOVM already supports 64-bit integers and floating point, which may not be available in such a system so we might just say that we expect that LOVM would have to be subseted or compiled down for that application anyway. If you did actually have massively parallel hardware, it would be relatively trivial just to recompile all applications and system libraries register allocating down to less registers. And you probably have to anyway because LOVM is probably too beefy for a small chip like that.

So the remaining problem case is green threads. In green threads, if you store registers in physical registers, you have to load those in on each task switch anyway, so I guess it's okay for different green threads to have different numbers of registers; again if libraries use more you can always compile four versions of each library with various register usages. In this case, though, it may be useful to make the programs declare how many registers they are using at any given time, so that the implementation knows how many to bother copying into the physical registers upon a context switch.

However the more problematic case is what if the implementation wants to execute each green thread by accessing the registers directly out of memory without copying them to one standard location? Here we have more of a problem because now before each register access you have to check if that register is really there in that green thread's TCB, whereas if there are few enough registers that you just always allocate enough space for all of them, you don't have that problem.

So here again we might find use for a special register allocation instruction declaring a change in the number of registers being used, which probably leads to calling conventions with a register containing the number of currently active registers. In this case, if a green thread is executing directly out of memory, and it calls the library using more registers than it uses, the library's register allocation instruction will notify the runtime, which can do the check once at that time to see if there are enough registers in this green threads TCB and if not, to allocate more; then you don't have to consider this possibility when accessing registers, only when you hit those #register instructions. However that's pretty complicated, I think it might actually be simpler just to tell users that if they need a massive number of threads, just recompile everything to use less registers. So the idea is the encoding supports lots of registers, but the implementation (to the extent that "implementation" includes library binaries) has (at least four) different profiles depending on how much concurrency you need.

In other words, for a very concurrent implementation, just compile LOVM code to use less registers at compile time.

Now consider again the more likely case where some registers are actually copied into target platform registers. The target platform probably has fewer registers than LOVM in any case. So LOVM is probably going to have some registers stored in platform registers and others stored in memory. So it's already doing some sort of check (or compile time specialization) on the register number and treating them differently. In the case of green threads where we initially allocated too few registers, we're just adding one more check here.

Also in the green thread case, when you think about the objection that a small program might call a library that needs to use lots of registers and worry about how that means we can use less concurrent threats, that doesn't quite make sense because the actual amount of memory needed to store each thread state is dependent on the program not the ISA; that library had to store that state somewhere whether it's in registers or in heap allocated to that green thread, so using a different ISA with less registers would not have allowed you to run more green threats concurrently in total anyway! So it may not be a big deal; The real cost is offering more registers than the platform offers, which we probably do want to do at this level.

i guess what i'm saying is that if you really care about maximizing concurrency by having little state per thread, then you have to only call library calls that actually don't need much state, which is a property independent of how many registers it uses. If you don't meet that condition, no ISA can help you; you need to store more state than you have memory available. If you do meet that condition, then you can compile LOVM down to LOVM-with-less-registers if you need to.

---

to rewrite the previous section more readably:

the question is, how many registers should the LOVM ISA have?

i've been worrying about how this affects the calling conventions, but i think i've been overthinking that; even if the ISA provides no support for keeping track of how many registers are in use, if the software saves and restores registers itself before/after calls using calling sequences, then it can do whatever is needed. For calling conventions and register save/restore, there is not much benefit to explicitly declaring how many registers are in use in a way that other parts of the system can read that, because even if a caller isn't using all of the callee-saves, it's possible that a grandfather caller is using the other ones, so the callee must still save them if it needs to use them. And symmetrically, even if a callee is not using all of the callee saved, the caller still needs to save them if it needs them to persist across calls, because a grandchild callee might overwrite them. (and, much of the time you will be deep in the call stack so it's likely that SOME grandfather is using each caller-save register, up to some point (64*4 may be way past that point)). So you'd need whole program analysis to take advantage of that sort of thing (which could be implemented with low cost with just declarations in the object file format of how many registers are being used by each function plus anything it might call), but it's not useful if you can only consider the register usage of the current caller and the current callee.

The real problem is, how much space do we allocate for registers for each process in a massively concurrent system?

first, you may think that having too many registers limits the amount of concurrent processes that a system can support (assuming a fixed total amount of memory). This isn't quite true: consider three cases:

A) the code needs to use all those registers for concurrently live variables B) the code doesn't need to use all those registers for concurrently live variables, but it does anyways because it is inefficient C) the code doesn't use all those registers

A: even if there were less registers, the program still needs this much state per process, so the total number of concurrent processes that you can fit into a fixed amount of memory is the same, whether or not you have lots of registers.

B: this can be turned into case C by writing the program to be more frugal in its register usage

C: some of the registers are unused, so it's possible for the implementation to get away with never allocating space for them. So let's explore the consequences of this.

Let's consider a few possibilities:

1) LOVM is running on a massively parallel hardware system; how many registers does each core have? We want few registers per core
1a) subcase: each core's physical ISA is LOVM
1b) subcase: the process running on each core is written in LOVM and compiled to the core's physical ISA
1c) subcase: each core is running a LOVM interpreter
2) LOVM is running on many processes or greenthreads; how much space per thread do we allocate to store register state?
2a) subcase: when executing LOVM, we access the registers directly out of this memory
2b) subcase: when executing LOVM (thinking of greenthreads here with a few 'virtual cores' in separate processes, and each process servicing many greenthreads), upon each context switch we copy all of the registers into some fixed set of memory representing the registers for our virtual core, and then we execute
2c) subcase: when executing LOVM, we copy some prefix subset of the registers into target platform registers, and then execute the rest out of memory, in one of the ways noted above

case 1a: this is unlikely because in this case the cores need to be small and simple and LOVM is probably too heavyweight anyways. So LOVM is probably not the ISA of each of these cores; even if something LOVM-like is there, it's probably a restricted subset (e.g. no 64-bit integer math) so we can also have a restriction on # of regs.

case 1b: we have a compilation step, so we do register allocation during that compilation and we can compile down to however many registers are physically available.

case 1c: here we have a problem. However it's not too terrible to say that these cores only implement a restricted profile of LOVM that supports less registers. Any LOVM program can be compiled (or even JIT-compiled) to one in this restricted profile using register allocation. So, the result is that we are forced to do some compilation (even if only in the form of JIT) even though we wanted to do interpretation. If we insist on JIT-compilation this affects throughput, if we use AOT it affects latency, but otherwise it's fine.

case 2a: here we have a problem because, if we initially allocate less memory for the registers in hopes that it won't need all of them, but then it does, we have to detect that and reallocate memory; which means that we have to do a check before each register write, or we have to think of registers as 'low' registers (for which space is allocated in the default initial allocation) or 'high' ones (which must be treated specially). In the latter case, either we do checks before each acces of high registers or we compile multiple variants of each piece of code for use in the situation before and after we've reallocated, and branch to one of them upon each entry.

in the former case (check before variable accesses), upon each access to a 'high' register (a high reg is one that isn't included in our default initial allocation), we have to first do a check to see if we've done the necessary reallocation yet (or, we compile multiple variants of each piece of code for use in the situation before and after we've reallocated, and branch to one of them upon each entry).

this is the sort of thing golang and erlang have to do to grow their stacks in their greenthreads anyways, so it's not too terrible, but here we're talking about it for registers, rather than stacks!

there's two kinds of costs here:

the cost of treating registers differently, in a staticly-known way (the difference between 'low' and 'high' registers)
the cost of treating high registers differently depending on dynamic state

And again, if we really care about perf, we can do whole-program compilation.

In the latter case (compile multiple variants of each piece of code), this adds the overhead of storing that extra code, and the overhead of that extra branch (which isn't much i guess because it only has to happen once, at reallocation; after that, we stay in the part of the code dealing with the reallocated variant, and this is preserved across context switches b/c the PC is preserved).

case 2b: in this case it's only the smaller number of target platform processes which have their memory requirements increased. This is probably okay; if you have this situation, you probably have many fewer target platform processes than greenthreads, so the memory requirements of the target platform process VM probably don't dominate the memory requirements of the system.

case 2c: this is like 2a, except that here we already are treating 'low' and 'high' regs differently, so we can assume that the cost of treating registers differently in a staticly-known way is small.

So, to summarize all that:

LOVM proper isn't really targeting the tiny chips needed for massive concurrency; you'd have to subset or compile or interpret it anyways
there's only a problem when we have a program that really needs less state than the number of provided registers (or when it has more state but cant put it in regs b/c it needs it to be addressable)
in this case, if we care about perf and maximal concurrency, we could eliminate the problem with whole-program analysis at compiliation or link or load time
so the only problematic case is when we have a program that doesn't need all the regs, AND we car about perf or concurrency, AND we refuse to do whole-program analysis
and this is only a problem if we want to execute greenthreads using their stored register banks directly out of the TCB memory. In which case, either we have (at least) a 2x binary bloat (to store variants of code cover the pre- and post-reallocationg), or we maybe have some implementation bloat to treat 'low' registers differently, and then have a slowdown because we do a check before each write to high registers.

If we care about perf that much we can do whole-program analysis, which would detect that less regs are needed. So only thing many registers really prevents is using full LOVM on a tiny chip without subsetting.

---

So my conclusion is that we can use the larger number of registers. Using LOVM directly on a tiny chip without compilation, interpretation, or subsetting to a restrictive variant of LOVM is out of scope.

It seems like we don't need an architecturally visible register for how many registers are in use by the current function; if the implementation wants that, it can just cache the last-seen value of the #registers allocation instructions. #registers allocation instructions may be useful for this reason, however. So ENTRY should include that, and we should provide instructions (or additions to functionality of instructions like RET) which save and restore groups of regs from memory (which will often just be to the stack).

---

So to summarize, we're looking at:

the ENTRY instruction takes information about how many registers are needed of various types (in one 8-bit immediate operand, so we can't separately given information on numbers of callee-saved and caller-saved for each bank)
the ENTRY instruction spills any necessary callee-saved registers to designated memory (typically the stack)
the CALL instruction spills the link register, and any necessary caller-saves to designated memory, writes the current address into the link register, and jumps
the RET instruction restores callee-saved registers (the number of which is in an immediate operand), remembers the current link register, restores the old link registers, then jumps to the remembered current link register
for coroutines, maybe break these up into subfunctions differently; you want to do the saves and restores of both RET and CALL before transferring control
i think this works for tail calls already

---

removed old text from lovm ref:

the above is out of date. the new plan is for LOVM to go up to 32-bit encoding, and leave the 64-bit encoding for OVM. To keep state small, i'm thinking that maybe instead of 255 regs (locals), there may only be 16 regs in each bank (and each smallstack), so there is plenty of room for addr modes in each operand. In other words, instead of making this be something with more variables that compiles to an assembly language, i'm thinking of making this just an assembly language -- ie if an implementor were more experienced or patient, they would probably skip Boot totally and implement this directly.
So, LOVM would be like Boot, but with 16 regs per bank instead of 8, with smallstack(s), with addr modes, and with enough opcodes to fit all of the basic instructions (including e.g. arrays). call, ret. some higher level control flow like switch, and some loop helpers like lua bytecode. A usable bare bones assembler with labels and macros. Nothing too high level (so probably not 255 variables).
need to decide on number of regs tho. see lovmNotes1.

---

Rather than Computing Boot "reference opcodes" the way that I did it, could consider "expanding opcodes" and just let the bit pattern of the concatenated opcode field and opcode-like operand fields determine the reference opcode. However this generates OP codes larger than 256, which isn't helpful for reuse in LOVM.

---

some of what compilers use many registers for is loop unrolling. However, in our case, since the many registers probably aren't physical but the first few of them probably are, we don't want compilers doing that. So, we should probably provide special loop instructions to make an affordance that suggest that that's the best way to loop. We should also explicitly mention this issue.

---

https://stackoverflow.com/questions/35465373/why-not-have-separate-code-and-data-stacks

no answer is really given except for legacy incompatibility

one thing i can think of: harder to layout stuff in memory/memory fragmentation; you can't just have a heap growing upwards and a stack growing down, now you have two stacks. If you don't want to be segmenting them or moving them around all the time, now you gotta conservatively preallocate enough space for both of them. With our SMALLSTACKS we don't have this problem b/c they are fixed size (but see next section).

http://s3.eurecom.fr/docs/secucode09_francillon.pdf has a good thing to do regarding fragmentation; it puts the data stack on the bottom, growing UP, and puts the return stack (which ONLY holds return addresses) on the top, growing DOWN. This way the sum of both of them is that same that you'd normally allocate to a single stack, and they will never run out of memory unless the traditional single stack would run out of memory.

Of course that only works for two stacks. If we store our smallstacks as ordinary stacks to make it easier to spill things, then we'd have 4 stacks (integer smallstack, pointer smallstack, memory data stack (both data types), memory return stack (only pointers)).

we wouldn't want to unify our pointer smallstack with the memory return stack because the whole point of the memory return stack is that ordinary data processing isn't always mucking around with it, making it easier to secure. Otoh the actual use of a memory return stack in http://s3.eurecom.fr/docs/secucode09_francillon.pdf is to prohibit access to it except by CALL and RET, and we don't want to do that b/c we want to support task switching/coroutines/fibers/greenthreads, which requires swapping out the callstack or at least storing it in memory.

---

"The typical call stack is used for the return address, locals, and parameters (known as a call frame). In some environments there may be more or fewer functions assigned to the call stack. In the Forth programming language, for example, ordinarily only the return address, counted loop parameters and indexes, and possibly local variables are stored on the call stack (which in that environment is named the return stack), although any data can be temporarily placed there using special return-stack handling code so long as the needs of calls and returns are respected; parameters are ordinarily stored on a separate data stack or parameter stack, typically called the stack in Forth terminology even though there is a call stack since it is usually accessed more explicitly. Some Forths also have a third stack for floating-point parameters. " -- wikipedia

so we might decree that the calling convention is that parameters are passed on our smallstack. However that makes it certain that the smallstack will need to be spilled as we get deep in the call chain. Otoh on some implementations the 'small'stack is could just be implemented as an ordinary in-memory stack, then spilling is a no-op. But then with multiple stacks this reintroduces the memory fragmentation issue that our fixed-size stack avoided.

alternately, we might consider not providing facilities to spill the smallstack at all; the calling convention would then be that each function needs a certain amont of smallstack space, and the caller is responsible for making sure that much space is left. Or, more simply, we could make that expectation constant, at half of the stack size; if the caller needs more to persist between calls, they must spill, and if the callee needs more, they must spill.

---

the idea of storing the smallstack in memory and spilling it makes you think however. maybe there doesn't have to be a distinction between registers and the smallstacks at all?

Does there have to be a distinction between smallstacks and memstack? Well, yes, because smallstacks are not addressable. That lets a compiler optimize them better. But smallstacks could be regarded as the 'top of memstack'.

---

if we specify both caller- and callee #registers in use in ENTRY, with 4 choices each (0, 4, 16, 64) that's more than 8 bits; if we only specify that for registers and leave the smallstacks with a single thing, then that's 12 bits. So may as well fully use two 8-bit operands, which gives up 4 extra bits. But now we have 6 things that we are specifying, so i'm not quite sure how to use those 4 bits. I guess you could use them to specify: x2 for all registers, x2 for all stacks, x2 for ints, x2 for ptrs. Or instead of registers vs stack, x2 for caller, x2 for callee. Or we could just specify 4 of those 6 things with a finer grain (3 bits instead of 2, so 8 choices instead of 4: 0,1,2,4,8,16,32,64).

---

if we have a lot of registers all of which are probably not being used, let's make ENTRY be mandatory, so that the implementation can not allocate or not 'turn on' all of the registers until it encounters an ENTRY that requests them.

This also suggests that we do want to specify both caller and callee registers in ENTRY so that we can only turn on the ones we need. The third operand to ENTRY is the place to spill to if needed. If ENTRY should not spill, then this third operand can be immediate zero.

Same convention to not spill in CALL (the register that would ordinarily hold the place to spill to is instead immediate zero).

---

if we have immediates, then we don't need the zero registers. Then we can make those the 'smallstack push/pop' registers instead of #2, which allows us to have Boot compatibility without a 'user profile'.

but wait a minute, we WANT Boot to be able to do smallstack push/pop. So that's no good.

---

ppl say IA-64 failed b/c explicit, instruction-level parallelism (ILP) is too hard for compiler to figure out. But my guess is that (a) even if it was good but backwards incompatibility alone could have killed it, and (b) maybe the way IA-64 worked required it to understand the CPU architectural details too much.

we could put annotations in there that help the implementation know what instruction-level-parallelism opportunities exist. If you say 'here's a packet of instructions to run at once', that . Just annotations that say, hey, if you want to run these at once, you can (or maybe the opposite, assume that you can, and then have required annotations that say, hey, if you want to run these in parallel, you can't).

the simplest way to do this would be to group instructions together into blocks that can all be done in parallel. But the reality is probably more complicated; you might have things like:

A B C

where AB can be done at once, but also BC can be done at once, but A must precede C. You might also have things like:

A B C D

where A must be before B, and C must be before D, but A and B can be in parallel with C and D

However IA-64 just put 'stops' to delineate groups of instructions that could be reordered (in some ways). If it's complicated enough for them, i don't see why we should complicate it any more. One bit per instruction can indicate whether or not a 'stop' is present before that instruction. An 'instruction group' is a group of instructions between stops. If you had more bits you could have multiple 'lanes' for the 'stops' which could encode the more complicated stuff above.

---

wait we don't really have space for 4 addr modes and 64 registers in each bank; we need the encoding format bits. There are 3 of those.

so each operand only has 7 bits, not 8. So we only have 2 addr modes if we have 64 regs (immediate and register).

4 addr modes would be nicer than 2, so that might be enough of a reason to go back to 32 regs per bank.

the opcode could either have 256 instructions, or it could have 128 and we have room for an extra bit (which could be a 'stop bit' for ILP)

256 instructions is pretty nice so we may have to do without those stop bits. Also, is there even any way that a VM in software could really make use of that sort of thing? I guess you could have two physical threads trying to run every one thread in the VM, each one dividing up all of the instructions in the current instruction group between them, and then joining. But something tells me that the cross-thread synchronization there will overwhelm the savings from parallelizations.

best guess:

yes, use separate int and ptr banks
yes, use separate registers and smallstacks
yes, assume that more registers is better, not worse
yes, spill both registers and smallstacks to memory via special instructions -- so the registers are the visible frame within an unbounded stream of 'stacked registers', and the smallstacks are the last n visible items within an unbounded stack
??/todo: should we have a separate in-memory 'call stack' to store just return addresses on (and whatever else the program manually puts there)? or should we combine this with the ptr stack? i'm thinking separate
the smallstack is then like the tip of a data stack. The in-memory call stack is then like a return stack. The backing store for the smallstacks are like data stacks.
yes, this means that each thread has 3 stacks to allocate for; the int data stack, the ptr data stack, and the call/memory stack (if you assume that eg the int registers and int smallstack go 'next to' each other; o/w you have 5). However, it might be possible to interleave the frames of the int data stack and ptr data stack in memory, since only one frame can be accessed at a time. I sort of like that, b/c it keeps the frames relevant to each function next to each other; otoh if a language wants to do something unusual there it may run into trouble.
??/todo: think about this
but if a different return stack and data stack, then what to pass to ENTRY and CALL for the parameters of the pointer to the stack? do we make stuff implicit? use special registers?
I feel like we need some smaller primitives besides ENTRY, CALL, RET; not sure what tho.
yes, 4th register mode is stack-as-registers, but indexed from TOS
pass arguments on smallstack(s) or in registers? i'm thinking registers
yes separate registers into: global, input/locals, output
yes ENTRY needs to configure these
i guess these are like (variable size) register windows -- why are those hated?
mb for reasons that don't apply here (see next section)
yes we need some special instructions to control the interaction between the stacked registers (and "stacked smallstack") and the backing store; need a flush-from-regs-to-store, a flush-from-store-to-regs, and (if those flushes don't cover the current register frame) a 'cover' to push the current register frame back so that it is affected
yes upon calling we need to push the configuration of the previous register frame into a register, like we do with the link register; it needn't be immediately spilled to the in-memory stack b/c perhaps this is a leaf function

---

cons of register windows:

"You also see stack-like register windows (SPARC, i960, Am29k) and rotating register files (Itanium) but the general consensus seems to be that this is overdesigned and MIPS does just as well with 32 ordinary registers. 16 registers is almost as good in typical code (see: ARM, x64) but the "cost" of increasing from 16 to 32 is low enough that architectures tend to go with 32."

https://www.hpl.hp.com/techreports/Compaq-DEC/WRL-87-5.pdf

only looked for 5 secs but i think it says that the extra indirection in the data path (which is critical) is worse for perf that just not having register windows and having more registers

https://www.cse.msu.edu/~enbody/postrisc/postrisc2.htm offhand note "Register windows add some complexity to register renaming. However as the following processor shows, it is not impossible to have out-of-order execution with register windows."

https://people.inf.ethz.ch/wirth/Articles/GoodIdeas_origFig.pdf "The reason why it must be considered as questionable, is that at any moment only the top registers, those in the most recently opened window, are under direct use and are therefore frequently accessed. The others are quasi dormant, yet consume a scare, expensive resource. Thus, register windows imply a poor utilization of the most expensive resource. "

my thoughts:

these don't seem to apply to us. For us 'register windows' are more about the calling convention and way of expressing register access than they are about cacheing caller-saved register storage. And we are a VM not a CPU, so our 'register access' is already at least somewhat abstracted.

wasting scarce registers: we are only having one 'window'
complexity to register renaming: we aren't doing register renaming, at least not in the CPU sense of out-of-order execution tricks
extra indirection in the data path: not much more than we already have in many cases, by (probably) having more virtual registers in our VM than exist on our target platform
overdesigned; a little, but we are just trying to deal with the necessity of saving/restoring registers in between functions, which must happen no matter what

---

speaking of register windows, isn't having a link register (instead of just saving that to the call stack) kinda like having two register windows for the PC? It's the same argument about, a large fraction of calls are from the deepest non-leaf function to a leaf function.

so if that's such a good idea, maybe we should have 2 register windows instead of just 1. And then stuff like the link register is handled transparently.

or, since ordinary code won't need to look at random values from the caller's random registers, but remembering how register windows have an input/locals area and an output area,

we could put this 'architectural stuff' (the previous/caller's PC, the previous/caller's window size, etc) into the output area, which then goes near the beginning of the windowed registers in the callee's perspective.

---

Considering the floating point registers, we have a use for the extra two bits in the number of register specification for entry. Forget about specifying the number of output registers per bank, keep that hard coded at 8

done

---

another option for the 8-bit encoding is to abandon regularity, realize that i only really care about the operands getting 3 different combos, not 4 and add:

all combinations of tos, smallstack, one other register for ld and cp
the 8 others that i have above (from cp and cpp with identical operands)
the one extra spot that we have above (from lp 1 1)
a bunch of other stuff (12 more left i think; reasoning when the last choice of each operand, times the 3 choices that we are keeping of the other one, times 2 for 2 banks)

we have one empty space above already so we get to add 13.

of course we'd like drop and over, both banks, to be in there (x4), so after that there's 8 left.

that way we can fit in more of the more common instructions and/or possibly also get a basis set for smallstack-only operations.