Bayle Shanks's website: proj-oot-ootAssemblyNotes14

the 'meta' bit could quote the operand specifier; kind of like Lisp's quotes or Kernel's "operatives"; a quoted opcode specifier could be like a Kernel operative or like a 3-Lisp reflective procedure.

---

monads ('programmable semicolons')?

---

timers (and monads?) could be via 'jmp'-like hooks of addr subspaces; have an 'exec1' hook too?

---

the idea of 'arraystruct' types; that is, the idea that an 'addr subspace' is an object instance whose elements are addressed by an integer index, and that such a thing could encompass both a fixed-length struct as well as a variable length array; i guess for this it would have to have a 'len' hook as well as a getter and a setter.

---

so some conceptions of 'types' in Oot Assembly:

an algebraic system eg List type constructor, eg List Int saturated type
object classes and instances
getters/setters that map flat memory (integer indices) into things like ListInts? (arraystructs?)
structural pattern mapping, eg a declarative mapping from flat memory into arraystructs

---

Can

each program define new addr subspace types, or
only ob0 files can define new addr subspace types (via syscall hooks)

It should be:

programs

because we want things like addr subspace getters and setters to do GC read/write barriers and refcounting, and we want boxing and unboxing to do stuff like lazy evaluation, and we want to implement this sort of thing at the Oot Assembly program level, not at the implementation level

---

ok after implementing a bit the following make it more complex than it needs to be:

the multiple encodings. This is annoying in itself (having multiple cases in a bunch of places, more places than i'd have thought), but then there are a few interactions that make it even worse:
- the imm2 format (when two operands are combined into one to get more bits); in SHORT mode, each operand has 2 bits of data, and in MEDIUM, it has 8 bits, so when combining the operands, you want to multiple the second one by 4 in SHORT mode but by 256 in MEDIUM mode. So the information about whether a decoded instruction came from SHORT or MEDIUM must be kept around at least until the imm2 squashing is done.
- the program counter. The program counter can sometimes be non-64-bit-aligned, but only when it is inside a SHORT chunk. This means that you can't tell before a JMP whether the destination will be aligned correctly, because you have to fetch the destination chunk first. Also, if you jump into the middle of a chunk, you have to integer divide to get to the beginning of the chunk, which is simple enough but a little annoying.
- the LONG encoding will probably be much worse.

But the SHORT/MEDIUM multiple encodings doesn't seem nearly as bad as:

the combination of signatures and stack addressing
- the order in which you resolve operands matters now
- you can't just tell some subroutine the effective address and let it do the reading/writing from/to that address later; you have to remember whether or not the addressing mode was stack addressing, b/c in that case you must mutate the effective address at some point
stack addressing is annoying by itself because, (a) it combines mutation with addressing, and (b) this means that the result of resolving an operand is more than just an effective address
signatures are annoying in themselves because it just one more source of multiple cases, and one more piece of data to lug around

---

the encoding complexity of SHORT isn't that bad. LONG is a little worse, because it means that the LONG decoder has to return how many guys were read. But neither of these are show-stoppers; their complexity is contained. Also, the actual arrangements in bits in SHORT and MEDIUM is a surprisingly small part of the overall complexity, so my insistence on 5 identical fields may not be wise. Still, the benefit of SHORT is marginal, and while the benefit of LONG may be greater, so is its cost. It may not be a bad idea just to throw these out and only have MEDIUM; they are not terribly costly, but they are unnecessary.

---

part of the complexity introduced by signatures is that now you have to resolve the opcode, and look up its signature, before you can do anything else.

This could be resolved by having four 'signature bits' somewhere in the instruction. For example, we could get rid of the 'meta' or the 'unbox' bits of fields 0 and 1 (context and opcode) and use them for that.

Alternately, we could subdivide the opcode space by signature. If we use the four LSBs instead of the four MSBs, then we can assign whichever cutoffs we want to the four classes for the 16 ops available in SHORT mode. This would not save us from having to resolve the opcode, but it would save us from having to look up the signature by opcode.

One of these two sounds is probably a good idea.

Does having to 'resolve the opcode' really matter? If the opcode is not immediate, then the instruction is a first-class function call. This would seem to make it less important to get the signature; either each first-class function might have its own signature, in which case we have to resolve it later anyways, or they all have the same signature (out in in? or inout inout inout?).

---

the complexity is not just from sigs x stack addressing, it's really sigs x stack addressing x multiple operands. Stack addressing is typically used with 0-operand ISAs. When there are multiple operands, each of which will be involved with mutation, at either or both of input and output, it becomes important to specify the order of the mutations, because some of them might depend on what happens before (eg if there are multiple operands being popped from the same stack, which one is popped first and which one is popped next?).

This complexity is partially just a matter of specification, though; it'll seem simpler once we have a simple rule. And i think the rule should be this:

  Operands are read in from right to left (field4, then field3, then field2). Operations are written out from left to right (field2, then field3, then field4).

This was arrived at by thinking of something which is inout inout inout, where all operands are stack-addressed to the same stack. Let's say you want to do CSWAP. You can do this by popping field4, then popping field3, then popping field2, then pushing field2, then (depending on what field2 was) either pushing field 4 and then pushing field 3, or pushing field 3 and pushing field4. Of course other orderings would work too, but this one feels right (in the sense that things like CSWAP would be easy to implement if all you had for temporary storage while implementing was a hidden stack).

---

so, it's really the inout inout inout case that's most complex. I need to figure out exactly how this should work. A complexity is that sometimes we want the instruction to actually get the effective address, rather than the value, upon input. Consider CAS (compare-and-swap), or, better, LLSC. LL/SC's LL needs to know not just the value read, but the actual memory location that is was loaded from, so that it can tell the underlying platform to 'lock' that location, watching it for writes from some other thread of control.

It makes sense that we'd generalize inout inout inout to take care of this, since inout inout inout is already weird, and since we're already on a power-of-two boundary with 4 signature classes (ANNOTATE doesn't count, it's an even specialer special case, one which is not so hard to implement because it's not available to custom instructions).

But we can't ONLY pass the effective addresses, because we also need to do the mutations in the case of stack addressing mode, and we need to do these mutations in order and interleaved with the reads and writes.

So, how i think it should work is:

before the instruction is evaluated, the operations are read from, just like with signature 'in in in'; first field4, then field3, then field2, with any mutations being applied just after they are read; so if they are all in stack addressing mode, then they are POP'd one by one.
the effective addresses of the operands are also captured at this time. This requires some further thought; clearly there is a difficulty if any inputs are in immediate addressing mode, but what about if they are stack addressing mode? If you take the effective address of something and then POP that thing just afterwards, is the new top-of-the-stack supposed to be the thing pointed to by the effective address? what if you then POP that? In fact, almost all addressing modes share this problem; we can't get obvious, meaningful pointers for immediate, or register, or stack.
one solution could be to say that we don't bother with effective addresses at this level; if you want to do something like LL/SC, then the value read from the resolved operand is itself a pointer, which is then used by LL/SC. Eg 'register mode' just means that the pointer to the memory location that will be 'locked' is found in a register.
what about cases where we want to do something like ROT3 on registers? Simple inout, without any effective address nonsense, would suffice there.
so maybe we can forget about effective addresses. CAS could then just be signature 'in in in', where both the target address and the address it is being swapped with are pointers which are the VALUES which the operands resolve to. CAS returns success or failure in the ERR register.

---

so, mb how it should work is:

before the instruction is evaluated, the operands are read from, just like with signature 'in in in'; first field4, then field3, then field2, with any mutations being applied just after they are read; so if they are all in stack addressing mode, then they are POP'd one by one.
there is no effective address nonsense
now the instruction is evaluated
after the instruction is evaluated, the operands are written out; first field2, then field3, then field4.

---

ok, the next thing we have to work out is first-class function signatures.

there's basically three options:

first-class functions can have one of the 4 custom instruction signatures
first-class functions have a fixed one of the 4 custom instruction signatures (out in in? inout inout inout?)
first-class functions just arbitrarily manipulate the stack (they could still have signatures, just not ones that we have to deal with at execution time)

The third option is the simplest but it doesn't really get to the idea of first-class functions.

The first option is powerful but more complicated; how would we get the signatures in there? Would they be stored as a 'header' at the top of the function? Would we use some opcode bits or some addr mode bits or context bits in the caller to specify it (if only 64 registers can be used for first-class functions then we can take 4 bits out of the opcode, but we don't really want to introduce a #-of-accessabile-registers non-uniformity, do we? we could also just say that all registers can be used, but the low-order bits determine the signature)? The second option is simpler but less powerful. With the second option, i guess we'd choose 'out in in', because it's most common, and because it's what i have in mind when i imagine using the first-class function facility to eg apply maps and folds to other functions. 'inout inout inout' is in a sense more powerful, though.

Storing the signatures with the function code doesn't make as much sense as storing it in the instruction, because surely the caller (and the verifier) will have to statically know what the function's signature is.

hmm so all of these sound reasonable. To break it down, the options seem to be:

first-class functions with signatures in the instruction (in opcode data, or in opcode addr mode, or in context, or in other addr modes)
first-class functions with fixed sig 'out in in'
first-class functions with fixed sig 'inout inout inout'

---

we currently have a (13, 'LOADI', ('out', 'imm2')), which is not one of the 4 permitted signatures

what to do?

imm2 wouldnt even need to be a signature if not for the SHORT/MEDIUM encoding difference.

but... we could just special-case LOADI. Like ANNOTATE, i don't think custom instructions will need to be like this. The purpose for LOADI is rather 'meta'.

---

let's reorder the instructions by signature class:

PRIMITIVE_OPCODES = [ (0, 'ANNOTATE', ()), (13, 'LOADI', ('out', 'imm2')), (1, 'JZ', ('in', 'imm2')), # conditional absolute (via index into module jump table) TODO: how can access the module jump table, except through this instruction? how does LOADMODULE setup the jump table, is there a vmloadjmp hook analogous to vmloadk? or is the jump table part of the constant table? i guess whatever it is, it's the same as the table used by CALL, except that the jump table is only from this module and includes unexported functions (so the CALL table for a module is a subset of the jump table for the same module). (2, 'JZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is zero) (3, 'JNZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is not zero) (4, 'CALL', ('out', 'in', 'in')), # uses imported function table (how? a vmhook syscall eg GETIMPORTFUNC?) (5, 'MOVZ', ('out', 'in', 'in')), # MOVZ 0 0 1 is NOP. Note that this is a memory-to-memory MOV. Conditional upon third operand; MOV takes place if it is zero. (6, 'ADD-U15', ('out', 'in', 'in')), (7, 'LEQ-U15', ('out', 'in', 'in')), (9, 'SUB-U15', ('out', 'in', 'in')), (10, 'SYSCALL', ('out', 'in', 'in')), # argument is syscall number (what is the other input argument? is it an input to the syscall?). out is where the return value will go. (12, 'CALL4', ('out', 'in', 'in')), # CALL the function held in register 4, passing two inputs and taking one output. This is provided so that SHORT mode can make first-class function calls. (14, 'LEA', ('out', 'in', 'in')), # input arguments are the addr mode (must not be 0) and data of an operand, output is the effective address (15, 'EVAL1', ('out', 'in', 'in')), # the parts of an instruction have been placed on the data stack, EVAL1 consumes them and executes the specified instruction (8, 'JZI', ('in', 'in', 'in')), # conditional indirect branch and also used as RET. The arguments are; condition, target, return value. 'Return value' only acts as a return value if the target is a CALL instruction; otherwise it is moved into the ERR register. (11, 'CSWAP', ('inout', 'inout', 'inout')), # if 'in' is non-zero, then read the second 'inout', then read the first 'inout', then write the second 'inout', then write the first 'inout'. If the inputs are both STACK mode on the same stack, this would have the effect of swapping them. Note that this is (a generalization of) a universal reversible logic gate (Fredkin gate). ]

any thoughts on this? well:

ANNOTATE and LOADI seem good as special cases
Of the rest, most of them are 'out in in', which makes sense, because most things have an output. The two sigs without an output, 'in in in' and 'in imm2', are valued only for their side-effects (control flow, mostly).
there's only one 'in in in' instruction in the list, but we'll need this for other things like CAS. This is a good candidate for merging with inout inout inout.
'in imm2' is needed for control flow because we'll want to immediately specify more than 256 labels.
CSWAP is weird, but we'll also need 'inout inout inout' for stuff like ROT3.
i bet we have more a larger proportion of 'in imm2' in this list than we'll need in the custom instructions. I don't see how that helps us though, if we are trying to come up with a single scheme to partition opcode space into signature classes.
about half the instructions are out in in
how essential is 'in imm2'? not very; JZ could just be J (an imm2 but no in), and JZREL could just be out in in (with relative branches of 16 instead of 256 (4 bits used; leaving 8 bits (4 data bits, 4 addr bits) for inline expansion)), or 'in in in'. Furthermore, 'imm2' is valued as a signature b/c of SHORT but we can't really J very much in SHORT anyways because 4 bits to index into the jump table isn't enough. So we could perhaps get rid of 'in imm2'.
so we could just have two classes: out in in, and inout inout inout (plus the two special cases ANNOTATE and LOADI). There would be more granular signatures for the verifier (which knows the difference between an 'in in in' like JZI and a true 'inout inout inout' like ROT3), but then the dispatcher wouldn't have to worry about this.
otoh cutting down the # of classes from 4 to 2 doesn't reduce complexity that much. The fact that there is more than 1 class (aside from the two special cases) is where the complexity is coming from. May as well distinguish between 'in in in' and 'inout inout inout' here so that the verifier doesn't have to have a signature table anyways, at least not for these.
the idea of relying on side effects, as LL/SC does, isn't too comforting either, from the perspective of a verifier. It would be nice if we could actually capture that 'capturing the effective address' stuff. Note that in CAS there are two operands to which stores (might) be made. This suggests that really, there are two bits of signature info attached to each operand: in, out, inout, or there-is-another-layer-of-indirection-because-we-need-the-effective-address-and-actually-we-might-write-to-the-guy-this-points-to (and in the case of LL/SC we even have one more, because LL will never write, although it does have some spooky additional side effect).
eg everything CAN be inout inout inout, but for the sake of the verifier, it's better if everything isn't.
if we're going to map signatures onto opcodes, we can stop worrying about the partitioning not fitting on the first 16 if we realize that there's really no need for the SHORT opcodes to be the same as the MEDIUM ones.
w/r/t capturing CAS's signature for the verifier, CAS and LLSC could just be SYSCALLs in this system. This makes sense b/c (a) they have to do with interaction with the outside world (other threads), (b) other threading stuff (eg FORK) are SYSCALLs, not ordinary instructions, (c) they will not be available on all platforms. If we stop worrying about those, we don't have to consider the 'effective address' stuff anymore, at least not here.

---

so summary of proposed changes over the previous few sections:

LOADI, like ANNOTATE, is a special case.
was thinking about forgetting about imm2 (other than LOADI), because it could be done by the instruction. But, it's not really that costly in terms of complexity, so i dunno.
SHORT mode can have a different opcode mapping and can have hardcoded signatures.
use the MSBs (or LSBs?) of the opcode data field to indicate signature. More than half will be 'out in in'. We also need 'inout inout inout', and either 'in in in' or 'in imm2' or both.
- first-class functions use the same scheme; the opcode data field is consulted BEFORE operand resolution
make CAS, LLSC, etc into SYSCALLs, and forget about the effective address stuff
before instruction evaluation, input operands are resolved in order from right to left (field4 first); afterwards, output operands are resolved in order from left to right (field2 first)

TODO do these

---

note that if we wanted to really simplify things, we could:

remove stack addressing (and hence the need for signatures in the dispatcher; we can just give everyone an effective address, except for dealing with immediates (fake effective addresses could be used for that, i guess), or we can just always give 3 inputs and take 3 outputs (more advanced implementations might use signatures to make this more efficient, but we dont have to in naive ones))
- note that traditional ISAs have 'signatures', it's just that they can all be implicit non-uniform special cases lumped in with 'addressing modes' and 'instruction formats' b/c they don't have custom instructions
remove SHORT and LONG encodings
remove imm2 (which is unneeded in the dispatcher if SHORT is gone)

---

ok here's how i did the opcode-to-signature stuff:

i was gonna do: SHORT_OPCODES_TO_PRIMITIVE_OPCODES = [primitive_mnemonic_to_opcode_hash[mnemonic] for mnemonic in ['ANNOTATE', 'JZ', 'JZREL', 'JNZREL', 'CALL', 'MOVZ', 'ADD-U15', 'LEQ-U15', 'JZI', 'SUB-U15', 'SYSCALL', 'CSWAP', 'CALL4', 'LOADI', 'LEA', 'EVAL1']]

but i realized that, since SHORT can't use JZ very well anyways, we can move it to opcode 16, and then everything fits for a reasonable partition of the 4 least-significant-bits of the opcode, and we have one extra space for a primitive SHORT opcode to boot. Also, i decided to partition the opcode space based on LSB rather than MSB because, when explaining to someone which parts of opcode space are already in use and which are free for custom instructions, it'll be easier to tell them that everything up to and including some number n is in use and everything above n is free.

def opcode_to_signature(opcode): """ The 4 least-significant-bits of the opcode (the opcode mod 16) determine the signature of the function. Opcodes 0 and 1 (ANNOTATE and LOADI) are special cases. """

    opcode_mod_16 == opcode & 15
    if opcode_mod_16 >= 6:
        return ('out', 'in', 'in')
    if opcode == 0:
        return ()
    if opcode == 1:
        return ('out')
    if opcode_mod_16 <= 3:
        return ('in', 'imm2')
    if opcode_mod_16 == 4:
        return ('in', 'in', 'in')
    if opcode_mod_16 == 5:
        return ('inout', 'inout', 'inout')

PRIMITIVE_OPCODES = [ (0, 'ANNOTATE', ()), (1, 'LOADI', ('out', 'imm2')), (2, 'JZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is zero) (3, 'JNZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is not zero) (4, 'JZI', ('in', 'in', 'in')), # conditional indirect branch and also used as RET. The arguments are; condition, target, return value. 'Return value' only acts as a return value if the target is a CALL instruction; otherwise it is moved into the ERR register. (5, 'CSWAP', ('inout', 'inout', 'inout')), # if 'in' is non-zero, then read the second 'inout', then read the first 'inout', then write the second 'inout', then write the first 'inout'. If the inputs are both STACK mode on the same stack, this would have the effect of swapping them. Note that this is (a generalization of) a universal reversible logic gate (Fredkin gate). (6, 'CALL', ('out', 'in', 'in')), # uses imported function table (how? a vmhook syscall eg GETIMPORTFUNC?) (7, 'MOVZ', ('out', 'in', 'in')), # MOVZ 0 0 1 is NOP. Note that this is a memory-to-memory MOV. Conditional upon third operand; MOV takes place if it is zero. (8, 'ADD-U15', ('out', 'in', 'in')), (9, 'LEQ-U15', ('out', 'in', 'in')), (10, 'SUB-U15', ('out', 'in', 'in')), (11, 'SYSCALL', ('out', 'in', 'in')), # argument is syscall number (what is the other input argument? is it an input to the syscall?). out is where the return value will go. (12, 'CALL4', ('out', 'in', 'in')), # CALL the function held in register 4, passing two inputs and taking one output. This is provided so that SHORT mode can make first-class function calls. (13, 'LEA', ('out', 'in', 'in')), # input arguments are the addr mode (must not be 0) and data of an operand, output is the effective address (14, 'EVAL1', ('out', 'in', 'in')), # the parts of an instruction have been placed on the data stack, EVAL1 consumes them and executes the specified instruction # what goes here? (16, 'JZ', ('in', 'imm2')), # conditional absolute (via index into module jump table) TODO: how can access the module jump table, except through this instruction? how does LOADMODULE setup the jump table, is there a vmloadjmp hook analogous to vmloadk? or is the jump table part of the constant table? i guess whatever it is, it's the same as the table used by CALL, except that the jump table is only from this module and includes unexported functions (so the CALL table for a module is a subset of the jump table for the same module). ]

Actually, wait, since if it's a first-class function, then the register referenced controls the signature, and since the stack register is a low number, we'd prefer for that to be signature 'out in in'. So:

PRIMITIVE_OPCODES = [ (0, 'ANNOTATE', ()), (1, 'LOADI', ('out', 'imm2')), (2, 'CALL', ('out', 'in', 'in')), # uses imported function table (how? a vmhook syscall eg GETIMPORTFUNC?) (3, 'MOVZ', ('out', 'in', 'in')), # MOVZ 0 0 1 is NOP. Note that this is a memory-to-memory MOV. Conditional upon third operand; MOV takes place if it is zero. (4, 'ADD-U15', ('out', 'in', 'in')), (5, 'LEQ-U15', ('out', 'in', 'in')), (6, 'SUB-U15', ('out', 'in', 'in')), (7, 'SYSCALL', ('out', 'in', 'in')), # argument is syscall number (what is the other input argument? is it an input to the syscall?). out is where the return value will go. (8, 'CALL4', ('out', 'in', 'in')), # CALL the function held in register 4, passing two inputs and taking one output. This is provided so that SHORT mode can make first-class function calls. (9, 'LEA', ('out', 'in', 'in')), # input arguments are the addr mode (must not be 0) and data of an operand, output is the effective address (10, 'EVAL1', ('out', 'in', 'in')), # the parts of an instruction have been placed on the data stack, EVAL1 consumes them and executes the specified instruction (11, 'JZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is zero) (12, 'JNZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is not zero) (13, 'JZ', ('in', 'imm2')), # conditional absolute (via index into module jump table) TODO: how can access the module jump table, except through this instruction? how does LOADMODULE setup the jump table, is there a vmloadjmp hook analogous to vmloadk? or is the jump table part of the constant table? i guess whatever it is, it's the same as the table used by CALL, except that the jump table is only from this module and includes unexported functions (so the CALL table for a module is a subset of the jump table for the same module). (14, 'JZI', ('in', 'in', 'in')), # conditional indirect branch and also used as RET. The arguments are; condition, target, return value. 'Return value' only acts as a return value if the target is a CALL instruction; otherwise it is moved into the ERR register. (15, 'CSWAP', ('inout', 'inout', 'inout')), # if 'in' is non-zero, then read the second 'inout', then read the first 'inout', then write the second 'inout', then write the first 'inout'. If the inputs are both STACK mode on the same stack, this would have the effect of swapping them. Note that this is (a generalization of) a universal reversible logic gate (Fredkin gate). ]

    opcode_mod_16 = opcode & 15
    if opcode == 0:
        return ()
    if opcode == 1:
        return ('out')
    if opcode_mod_16 <= 10:
        return ('out', 'in', 'in')
    if opcode_mod_16 <= 13:
        return ('in', 'imm2')
    if opcode_mod_16 == 14:
        return ('in', 'in', 'in')
    if opcode_mod_16 == 15:
        return ('inout', 'inout', 'inout')

eh, let's make inputs go from left to right and outputs from right to left. SUB dest x y == y - x is hard to remember (not that we couldn't change that without affecting which inputs were read first, but it seems to me that things will be easier to remember if the conceptual 'which input to look at first' for each instruction matches the order that they are read in), and SYSCALL dest arg SYSCALL_NUMBER is awkward (not that this helps, now its SYSCALL dest SYSCALL_NUMBER arg).

and if the outputs are read in 'reverse order', that's better, b/c the only instruction signature with multiple outputs is inout inout inout, so you only have to think about it when thinking about that signature, and that signature was the motivation for reversing things.

---

i wrote some notes in the implementation code that i'd like to copy here:

in JZREL: # todo: note in docs that this an JNZREL add 1 to positive inputs (b/c 0 is just an infinite loop, not useful)

DONE

---

are functions their own data type? if not, then what are they, memory locations? labels? if so, then how does the program create a reference to an arbitrary function? LOADK is just the constant table, right, is there a LOADFUN for functions (and a LOADJ for jumps?) or are all these just in the constant table? Are jumps the same as functions?

and for EVAL1:

so where do we get the 10 inputs from? the stack? maybe this should be a syscall...

and in JZ: # need representation of addr subspace (and offset into it) for this; i'm thinking a tuple addr_subspace, offset. But we need to turn the addr_subspace into a bytes() object so that we can execute it; how do we do that? Mb the program has to actually load it in like a file as raw, or as ob0?

---

regarding a program which writes instructions into an addr_subspace and then wants to execute them. The program should have to 'load in' the addr_subspace (either as raw, or as ob0), using one or more SYSCALLs. SYSCALLs should be involved b/c these SYSCALLs may not be available on every platform. Eg if you compile Oob to assembly for some embedded platform, then most likely you will not be able to compile dynamically-generated Oob to assembly on the embedded platform at runtime.

copied that to ootAssemblyThoughts.

---

i think EVAL1 should be a syscall, not an instruction, because it takes a bunch of inputs from the stack.

removed this from the implementation:

    (10, 'EVAL1', ('out', 'in', 'in')), # the parts of an instruction have been placed on the data stack, EVAL1 consumes them and executes the specified instruction

and

        if opcode == self.mnemonic_to_opcode['EVAL1']:
            raise NotImplementedError('Instruction %s not implemented yet' % opcode)
            #so where do we get the 10 inputs from? the stack? maybe this should be a syscall...
            #self.evaluate_first_class_function(first_class_function, context, inputs, PC_for_error_messages=PC_for_error_messages)

---

so i think, in my initial implementation, the Python representation of a pointer (that is, an address in OVM's memory) will be:

(addr_subspace_id, offset)

---

should there being any facilities for manipulating the constant table of a currently loaded module? No, for the same reason that executing dynamically-written OOB code is not always available; because the OOB program may have been compiled, and the compiler may have hardwired in the constant table accesses.

So, if you want to dynamically load a module and then read its constant table, you'll have to do so via another method. Perhaps dynamically loading it gives you a pointer to a MODULEMETA-type object from which you can reach objects representing the constant table of that module, or maybe you reach this by passing that MODULEMETA object to another SYSCALL.

---

should the 64k (or 32k) constant table also hold the function references, and should labels be just a special case of function references?

i can't think of any strong arguments either way. regarding merging them: pro: slightly simpler. anti: more space in each table. pro: keeping them separate means more max allocation for the runtime. anti: it's harder to typecheck to make sure you dont try to (ab)use an integer constant as a function reference. pro: you still could have a run-time error if you try to index into the function reference table with an index that exceeds the table size; and also we'll probably want to do typechecking anyways.

i guess i'm leaning towards merging them.

---

so to answer my questions above:

are functions their own data type? no, they are the same as memory locations (pointers), however, in the context of a module there is some metadata associated with them (their signature, requested stack depth, etc)

how does the program create a reference to an arbitrary function? well, if the function is in your module then it's in your constant table, and you use box(imm) addr mode. If the function is in another module then you get it via that module's MODULE object (perhaps via a SYSCALL). If you wrote your own function and dynamically compiled it, then the compiler gives you a MODULE object from which you can access the function.

LOADK is just the constant table, right, is there a LOADFUN for functions (and a LOADJ for jumps?) or are all these just in the constant table? Are jumps the same as functions? they are all in the constant table.

should there being any facilities for manipulating the constant table of a currently loaded module? No

should the 64k (or 32k) constant table also hold the function references, and should labels be just a special case of function references? yes, it holds everything, yes, labels and function refs are both addrs

should EVAL1 be a SYSCALL? yes

regarding a program which writes instructions into an addr_subspace and then wants to execute them. should it have to 'load in' the addr_subspace (either as raw, or as ob0), using one or more SYSCALLs? yes

---

trap: at some point i gotta go through all the code under run() in OVM and make it so that all the exceptions, instead of being Python exceptions, trap inside the VM; this is so that eg code dynamically executing other code in a sandbox can install a trap handler and get an error message, instead of the VM (or sub-VM) just crashing.

---

regarding context in SHORT, maybe things SHOULD be polymorphic by default; b/c very little code will actually be using U15, so this would prevent most code from using SHORT unless polymorphic ADD and SUB were available. So then how to pass along context? A special context register.

in fact, in general, instead of having more fields in instruction words, you can always take out some fields and have special registers instead. For example, instead of a prefix modifier LOCK, you could have a LOCK register. To exactly mimic instructions, these special regs would only accept one write per instruction and would be erased after each instruction, but you can imagine that in some cases there might be benefits to having them act more like normal (persistent) registers. Perhaps this should make me less afraid of ditching LONG.

---

TODO implement these

i'm concerned about the CALL and RET instructions (RET is a use case of JZI). What i'm concerned about it the idea that the CALL instruction has a 'dest' operand that RET is supposed to place its return value into. This necessitates loading and decoding the CALL instruction upon return, which seems like something that shouldn't be happening within the evaluation of one RET instruction. Some ideas:

just decode the CALL upon return, like i said
the CALL records its own dest operand on the callstack, then RET just has to look at that
the CALL pre-resolves its dest operand and records the effective address on the callstack
the RET sets a special hidden 'is_returning' flag in the VM, and also stores the return value in some hidden way in the VM. On the next cycle, because 'is_returning' is set, the VM checks if the current instruction is a CALL, and if so, it resolves the dest operand and writes the return value to there, and then goes to the next instruction (not CALLing again, of course). Problem: what if RET returns to a non-call instruction? I guess we just silently clear is_returning (we certainly don't want to leave it set, and then skip the next CALL instruction that we happen to encounter later). We could also trap, i guess. I guess the address of the RET instruction could be recorded in the VM's hidden state, too, so we could make it so that, to the user, it could appear like we are still in the RET when the trap occurs. note: if RETing to a non-call is an error, then we need to separate RET and JZI.
instead of a CALL with a dest, a CALL is followed by an instruction that MOVs from a special reserved register (or POPs from the hidden stack), and then the RET places the return argument there and then jumps to that instruction. This special 'return register' would be kind of like a 'link register' except that the contents of the 'return register' is supposed to be used on the very next instruction after it's set, so there's no problem with nested calls here.

these are all a little complicated and it seems to me that having CALL assign a return argument to a dest might be 'too cute', and mb we should get rid of it. Otoh, HLLs have it (eg consider a function call within an expression) so if we are to be a good interop language, maybe we need it?

i like the last two the best; the last one is simple, but it breaks one conceptual instruction into two (making it harder for transpilers), whereas the previous one is a little more hacky. I guess i could do either one and then the implementation could compile it to the other one? a problem with the first one and the second-to-last one is that they rely on the RET being to a CALL instruction, which is a problem if the CALL was made from within a custom instruction (in the case that custom instructions are interpreted, not inlined, by this implementation).

so that leaves:

the CALL records its own dest operand on the callstack, then RET just has to look at that
the CALL pre-resolves its dest operand and records the effective address on the callstack
instead of a CALL with a dest, a CALL is followed by an instruction that MOVs from a special reserved register (or POPs from the hidden stack), and then the RET places the return argument there and then jumps to that instruction. This special 'return register' would be kind of like a 'link register' except that the contents of the 'return register' is supposed to be used on the very next instruction after it's set, so there's no problem with nested calls here.

the last one is the simplest to translate to common ISA. We can then wrap the two-instruction CALL,MOV with a custom instruction CALLMOV that HLL transpilers can look for. Mb just call the wrapped one and call this something else.

y'know, in this case, assuming that we put the pushing of the current address to the callstack into the wrapper, then the primitive CALL is just the same as JZ ((conditional) jump to label).

we can also replace JZI, which currently has a RET argument, with JIF (if-then-indirect-branch, else-indirect-branch). I copied the old one that i am replacing to here:

    (14, 'JZI', ('in', 'in', 'in')),    # conditional indirect branch and also used as RET. The arguments are; condition, target, return value. 'Return value' only acts as a return value if the target is a CALL instruction; otherwise it is moved into the ERR register.

---

y'know, i'm glad i had JZI because it made me think about the 'in in in' signature, but really things will be slightly simpler for naive compilers and for branch-predicting interpreters if we don't have the unnecessarily conditional instructions in the primitive set (we can always have custom instructions on top of them that do that stuff).

here's a copy of the old stuff:

PRIMITIVE_OPCODES = [ (0, 'ANNOTATE', ()), (1, 'LOADI', ('out', 'imm2')), (2, 'MOVZ', ('out', 'in', 'in')), # MOVZ 0 0 1 is NOP. Note that this is a memory-to-memory MOV. Conditional upon third operand; MOV takes place if it is zero. (3, 'ADD-U16', ('out', 'in', 'in')), (4, 'LEQ-U16', ('out', 'in', 'in')), (5, 'SUB-U16', ('out', 'in', 'in')), (6, 'SYSCALL', ('out', 'in', 'in')), # argument is syscall number (what is the other input argument? is it an input to the syscall?). out is where the return value will go. (7, 'CALL3', ('out', 'in', 'in')), # CALL the function held in register 3, passing two inputs and taking one output. This is provided so that SHORT mode can make first-class function calls. (8, 'LEA', ('out', 'in', 'in')), # input arguments are the addr mode (must not be 0) and data of an operand, output is the effective address # empty spot here # empty spot here (11, 'JZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is zero). Note: imm2 input is signed. 1 is added to non-negative inputs (b/c 0 is just an infinite loop, not useful), so range is (-32768:-1 unioned with 1:32768) (12, 'JNZREL', ('in', 'imm2')), # conditional relative jump (branch if 'in' is not zero) (13, 'JZ', ('in', 'imm2')), # conditional absolute (via index into module jump table) TODO: how can access the module jump table, except through this instruction? how does LOADMODULE setup the jump table, is there a vmloadjmp hook analogous to vmloadk? or is the jump table part of the constant table? i guess whatever it is, it's the same as the table used by CALL, except that the jump table is only from this module and includes unexported functions (so the CALL table for a module is a subset of the jump table for the same module). (14, 'JZI', ('in', 'in', 'in')), # conditional indirect branch. The arguments are; condition, target-if-condition-is-zero, target-if-condition-is-nonzero. (15, 'CSWAP', ('inout', 'inout', 'inout')), # if 'in' is non-zero, then read the second 'inout', then read the first 'inout', then write the second 'inout', then write the first 'inout'. If the inputs are both STACK mode on the same stack, this would have the effect of swapping them. Note that this is (a generalization of) a universal reversible logic gate (Fredkin gate). ]

        if opcode == self.mnemonic_to_opcode['MOVZ']:
            if inputs[0] == 0:
                return True, [inputs[1]]
            else:
                return False, None

        if opcode == self.mnemonic_to_opcode['JZ']:
            if inputs[0] == 0:
                # input[1], the immediate argument, is an index into the constant table. At this index we should find a pointer (todo the verifier should check that). Then we jump to that pointer.
                jump_target_pointer = self.do_syscall(self.syscall_mnemonic_to_opcode['LOADK'], inputs[0], PC_for_error_messages=PC_for_error_messages)
                self.jump_to(jump_target_pointer)
                return False, []
            else:
                return False, []

        if opcode == self.mnemonic_to_opcode['JZI']:
            if inputs[0] == 0:
                self.jump_to(inputs[1])
                return False, []
            else:
                self.jump_to(inputs[2])
                return False, []

        if opcode == self.mnemonic_to_opcode['CSWAP']:
            if inputs[0] == 0:
                return True, [inputs[0], inputs[2], inputs[1]]
            else:
                return True, [inputs[0], inputs[1], inputs[2]]
                # since this is signature inout inout inout, the inputs may have already been
                # consumed, so we should write outputs either way

---

i'm going back and forth on whether the primitive ADD, SUB, LEQ should be polymorphic or -U16-U16 (and PTR-U16). Thing is, if they're polymorphic, then SHORT code will incur a polymorphic dispatch lookup (using the CONTEXT register as an index into the polymorpic dispatch table) each time it hits one of these.

Also, i switched from U15 to U16 (because i think the memory should be an array of spaces that are guaranteed to at least hold U16s; on platforms that have only signed arithmetic, the implementation will just have to emulate U16 arithmetic, i guess). Which makes it more feasible for some code to actually use ADD-U16-U16 etc.

---

maybe make the entry point be index 0 in the constant table; the MODULE object can then be index 1. This way it matches with --raw execution mode in ovm.py (in which the default 'entry_point' is to start executing at address 0)

---

syscall mappings should be per-addr-space. Advantages:

this provides an easy way for syscalls called from within the syscall.ob0 file to trap to the actual implementation syscalls.
this also would allow a second layer of syscalls to be built on top of the syscall.ob0 ones (eg for executing stuff in an instrumented virtual sandbox), without running a whole VM on top of the VM; this could be useful in implementing the HLL in a layered fashion
this allows SYSCALL instructions to be mapped to syscall implementations staticly (at compile time), because the mapping is lexically, not dynamically, scoped.

note that capabilities should still be dynamic, because eg there might be a common library that is used by both trusted and untrusted code, and that contains some functions that call some syscalls that we don't want the untrusted code to have access to.

---

our idea to build RET out of JI, and just do a MOV afterwards if needed to return an argument, should also apply to SYSCALLS, i guess. So no output arg for SYSCALLs. later: i changed my miind about this for syscalls

BUT... this idea is challenged by our idea that in ordinary calls (not SYSCALLs?), the VM would save our registers for us and restore them upon return.

could we have a special 'registers-pointer register' in analogy with the stack pointer register, to allow user-mode code to save and restore regs in one go?

we certainly don't want the implementation's stack to increase in level while executing an Oob function call.

we have unused arguments in J and JI. Maybe we could use that to specify register saving/restoring behavior. Maybe 0 would mean 'leave the registers untouched', some other numbers would mean 'throw out the current registers, and restore a package of registers sitting on the (call? data?) stack', and some other numbers would mean 'save the current registers, and also restore a package of registers sitting on the (call? data?) stack'. But do we have enough numbers? If we stick to only immediates, then we don't, because we have 256 immediates available, but also 256 registers.

---

i guess a 'register register' is the cleanest solution. The other argument in the J and JI could be a location in memory to set the 'register register' to, where immediate 0 means 'dont change it'. Saving the old one to memory could be done manually. We could use the other 'reserved for implementation' register (7) for this.

i wonder if we should put any other context in there as well?

Call it REGREG.

---

a REGREG should also work for different functions using different numbers of registers; REGREG can point to ordinary RAW address subspaces, and the number of registers there is the addr subspace's LEN.

---

some commentary:

OVM (which runs a language called 'Oob Assembly' or mb 'Oot Assembly') is a virtual machine designed to be an IL, a portable substrate for HLL language implementations (including runtime services like garbage collection and scheduling of greenthreads), and a 'transpilable' 'interop language' (part of making it 'transpilable' means trying not to lose HLL intent (eg not compiling a HLL for loop into something that is non-trivial to decompile; Oob Assembly attempts to have code that looks somewhat like what the HLL code was doing); and another part is trying to minimize the need for temporary variables and inessential data movement, via a 3-operand format, CISC addressing modes, and many registers). It is assembly-like, because most HLLs present various structural constraints on control flow compared to assembly-like languages with 'goto'. It is also assembly-like for the purpose of making it easy to implement (compared to other languages with conveniences such as expressions).

Why not Nock? Because most HLLs, and most platforms, are Turing-machine imperative style, so we choose a matching paradigm for this language implementation language.

In an effort to be portable, it has a small core, called 'primitive' Oob Assembly, and other commands are built out of these primitives in a manner similar to macroinstructions. However, most platforms will be able to more efficiently support many of these 'macroinstructions' if they are implemented using platform-native primitives rather than if only the Oob primitives are implemented and then all other commands are built out of those. Therefore, rather than reducing the macroinstructions to Oob primitives before the native implementation sees them, we have a system of 'custom instructions', which allows each implementation to choose which custom instructions to implement natively; the others are then reduced to Oob primitives.

Other efforts to be portable involve choosing very few, very simple primitive data types (16-bit unsigned integers, and pointers), and abstracting away from the size of items in memory (and from the question of which items are so large that they are referenced by pointers rather than being directly held in registers), as well as from the layout of items within structs.

Custom instructions are one form of structuring and metaprogramming available in Oob Assembly. Other forms include syscalls and address subspaces. Whereas custom instructions act like macros that can be inlined, syscalls are similar to ordinary subroutine calls. Syscalls, rather than custom instructions, are used when (a) interaction with the external world is occurring, or (b) the functionality is not available on every implementation, or (c) the routine takes more than 3 input arguments or returns more than 3 output arguments, or (d) the implementation of the routine is long. In addition, the OVM calls user syscalls in the course of executing other code, which serve as OVM 'hooks'. Address subspaces are similar to user-defined data structures or a simple kind of OOP object. They present an interface similar to a flat array of values, or perhaps a struct; however, reads from and writes to these values are 'hooked' and call class-specific code.

Other features of Oot Assembly that go beyond typical assembly language include type checking (similar to the JVM's), polymorphic functions (both generic and ad-hoc), first-class functions which can be passed arguments and called in a single instruction, and capability security.

---

so how should it work if an instruction's 'context' (mb i'll call that 'type' or 'typecontext') resolves to type 0, DYN (DYNAMIC)?

this means that the instruction wants you to look at the actual values and determine their types. But unboxed values are not guaranteed to have type tags, so that is impossible in general in Oob.

but we could make it work for a subset of types, ones whose representation in Oob are specially constructed to be compatible with DYN.

for example:

maybe an instance of these types is a pointer to a function that, when called with the instruction, figures out how to apply this instruction
maybe an instance of these types is a pointer to a tuple (vtable, instance data), and then we index into the vtable with this instruction's opcode
maybe there is a default implementation of the instruction

in general, we could have a vmhook for this stuff. Eg when we see a DYN type context on a polymorphic instruction, we call some sort of "vmhandledyn" vmhook, passing this instruction's context and opcode and resolved operands, and then "vmhandledyn" figures out where to go from there.

---

it should be noted that the idea is not that an extensible OVM itself is easy to compile to an efficient runtime, but rather that an OVM, specialized with a specific set of custom instructions and syscalls and addr subspace classes, can be compiled to an efficient runtime. Note that after such specialization the 'metaprogrammability' of OVM has been used up.

---

arg, the same problem with not being able to RET a value bites us with first-class functions. That annoys me so much that i think i will change my mind and permit returning things.

To recall, the main issue was that if a function call were to be issued in the middle of a custom instruction, and if the implementation is interpreting, rather than inlining, custom instructions, then we can't get an address that points to the actual function call, because it is in the middle of a custom instruction, which, since it's not inlined, can't be addressed. In writing this, though, i realize that the same argument would prevent function calls in custom instructions at all, since of course function calls inside custom instructions should return to within the custom instruction.

Now, a non-inlining implementation could just use the implementation stack, instead of the Oob-visible callstack, to keep track of where it is in the custom instruction, but this is dirty, esp. because it means that deep call chains are vulnerable to stack overflow in the implementation stack even if they are ok in the Oob-visible stack (probably also issues with swapping stuff out/locking stuff in (green)threading). However, for custom instructions, this is bearable, since they won't be going too deep anyways, and we don't mind if we can't switch greenthreads right in the middle of these things.

For first-class functions, however, we might expect deep call chains. But, i suppose that implementations have ways around this; (a) they could compile the first class function call into two instructions, with the second instruction being a MOV that places the return argument where it needs to go; (b) they could have Pointers (maybe only when on the callstack) sometimes store a hidden field containing instructions about what to do with the return argument; (c) they could pass instructions about what to do with the return argument to the called function, and then add an epilog to it that does them. So, if using the implementation stack becomes a problem, there are ways out.

So i'm thinking about going ahead and adding a return value back to JI. But, a problem:

we really wanted two arguments for returns; one 'in' to send the return value, and one 'out' to accept it upon return. But, since we now have another argument for register swapping, we don't have room. Should we move the register swapping to a syscall?

or, the register swapping can just be via register-mode access to REGREG. Just like for swapping datastack or the callstack.

If we're really going to go hog-wild on the 'do everything by simply treating the mmapped regs as first-class', then may as well do control-flow by directly updating the PC, too, right? In which case we could get rid of J and JI (but not JZREL and JNZREL; they'd at least have to be conditional SKIPs, because we need a conditional in here; although i guess CSWAP could do in a pinch). But then how would SHORT mode do control flow? we could move the PC register to R3, but then we don't have room for any GPR. Or, we could move the PC register to R5, and provide ONLY register-mode access to it (which is fine).

hmm, that's intriguing. otoh, getting rid of J and JI isn't such a big win..

---

ok wait if we are using our interpreter's stack, then how do first-class functions know when to return control to the interpreter subroutine that called them? Do they have HALTs like syscalls? We can't just hook RET (even if we had a RET) because the first-class function might call a subfunction, which might call RET. Are first-class functions really just pointers, or are they more? Do we put a magic sentinel address in the callstack, and when it is jumped to, that causes a return of control?

i kinda like the sentinel address. We only have to check for it in JI (unless we also allow direct manipulation of PC).

but wait if we're willing to do that, why not just do the hidden outputspec on the callstack idea, so that we don't have to run a sub-runloop at all?

this is not so crazy; the interpreter could implement it by replacing the actual return address with a return address that goes to a short bit of code that does the desired MOV, and then returns.

jeez this is so much trouble just to avoid a POP instruction after the CALL. Is it really worth it? I dunno, maybe, because it'll make the source code more like HLL code, which probably make it more 'transpilable'. And certainly nicer to read.

---

oh i remember; the reason for reading input args from right to left was that on x86 the calling convention is to push arguments onto the stack from right to left, so that at the end, the first argument to be popped would be the leftmost.

---

ok i got the basic OVM implementation in Python up and running!

It has already has:

4 primitive addressing modes, 16 primitive instructions
custom instructions
custom syscalls
generic and ad-hoc polymorphism (but no typechecking yet)
first-class functions

todos:

u/l to gitlab (sync to github?)
- https://www.google.com/search?q=sync+gitlab+with+github&ie=utf-8&oe=utf-8
- https://about.gitlab.com/2016/05/10/feature-highlight-push-to-remote-repository/
docs
- overview
- tutorial
- reference
- design
tests
comment and prettify code
more todos in code
shorter name for 'addr subspace'

Maybe i should call the language 'OVM' instead of 'Oob' or 'OotB?'? Eg for the JVM, i think of it as just 'the JVM language', even though i suppose it might actually be called 'Java Bytecode'.

---

note: addr subspace 'backing' is another addrsubspace except in these special base cases

---

the other day when thinking about Ovm i was reminded of SEAM, Alice's VM: http://www.cs.tufts.edu/~nr/cs257/archive/neal-glew/mcrt/AliceML/multivm.pdf

SEAM tries to be language-agnostic by providing language-agnostic language services, and accepting an 'Execution Unit' module which actually interprets bytecode. SEAM itself provides language-agnostic constructs for:

registers
memory: "Seam’s abstract store implements a simple data graph: The basic node types are integers, chunks (raw byte data of arbitrary length) and blocks consisting of a number of ordered directed edges to other nodes. The data graph is an arbitrary directed graph with sharing and cycles"
frames (eg stack frames)
GC
scheduler

i feel like we probably have a few things to learn from SEAM. The reason we are not completely agnostic about how to execute bytecode, as SEAM is, is that our goal is to have a small portable core and to write these other language services in Ovm itself. Still, SEAM sounds like it has nice, general abstractions for frames and memory that may have somewhat similar goals to Ovm's.

---

to reiterate, i guess a lot of the complexity in the implementation comes from the need to use signatures during dispatch, which results from having a mutating address mode that depends on signature (stack). And some of it comes from address modes period.

Note that these are not necessary complexities; you could have everything have 'out in in' signature, and you could give up address modes (and have LOADI LOAD STORE PUSH POP instructions instead; we could give up LEA though).

We should consider that because it would serve our goal of portability, even though it may make transpilation harder (but it may make transpilation easier, b/c the language would be simpler; how would a transpiler deal with stack addr mode anyways?).

mb it would look like this:

19 instructions: annotate loadi loadk load store push pop cpy add-u16 sub-u16 leq-u16 add-ptr sub-ptr call3 syscall skipz skipnz jrel ji

if the PC is first-class, we can get rid of jrel and ji. loadk can be a syscall:

16 instructions, all can be signature 'out in in': annotate loadi load store push pop cpy add-u16 sub-u16 leq-u16 add-ptr sub-ptr call3 syscall skipz skipnz

Note that:

since the only signature is 'out in in', now we can never have custom instructions like ROT3 or CSWAP with signature 'inout inout inout'
the only primitive addr mode is Register
we probably want to make a 'zero register', so that we have some way to indicate discarded outputs
we can't do the 'displace registers in SHORT mode' trick, so SHORT mode must not have one of: zero register, data stack, call stack, hidden stack, GPR. Out of those, probably we should lose the zero register? Getting rid of the hidden stack is looking appealing too, though
by getting rid of 2 addr mode bits, we save 10 bits per instruction (which is enough for a 6th field, if we wanted one), and probably more importantly, 6 bits per SHORT instruction.
we could use 3 or 6 of those 6 bits in SHORT to give SHORT access to 8 or 16 registers, alleviating pressure to choose between zero register, data stack, call stack, hidden stack, GPR.
if we also get rid of the Meta and Unbox addr mode bits, we would save another 10 bits
the implementation can use the unused 'out' argument in skipz to deal with the inline expansion

right now i'm not sure this would be a good idea, but i'll continue to think about it. Perhaps i should implement it to see how much simpler it actually makes things. We still have 3-operand register mode, so we can still expression HLL z = x + y without temporaries.

Surprisingly, the thing that bothers me the most about this right now would be losing the possibility to have ROT3 and CSWAP custom instructions. I suppose that since the dispatcher doesn't care about signature anymore we could just declare that these are permitted (perhaps only in a certain opcode range).

Actually, upon another minute of thought, though, it also bothers me that we lose register indirect mode and have LOAD/STOREs. This sort of changes it from a memory-memory machine to a register machine, because now we introduce temporaries (the registers) in between statements like x.a = y.b; with register indirect these are "xa_addr = x + a_offset; yb_addr = y + b_offset; xa_addr<register_indirect> = yb_addr<register_indirect>;" but in contrast, LOAD/STORE this becomes "xa_addr = x + a_offset; yb_addr = y + b_offset; temp = LOAD yb_addr; STORE xa_addr = temp;"

So mb save 1 addr mode bit for register/register indirect.

Now we have 14 instructions:

annotate loadi push pop cpy add-u16 sub-u16 leq-u16 add-ptr sub-ptr call3 syscall skipz skipnz ?: empty slot ?: empty slot

(maybe a call3_inout in one of those empty slots?) (mb welcome back JREL and JI into those empty slots, then get rid of first-class PC?)

And we can have 8 registers in SHORT.

One issue here is that even if we add back JREL and JI, without J we can't in one step detect absolute jumps, because these would look like:

SYSCALL vmloadk idx_low idx_high (pushes result onto data stack, i guess?) tmp = POP JI tmp

otoh this would be a custom instruction so we could detect that.

yeah so we'd welcome back JREL and JI and have:

16 instructions: annotate loadi push pop cpy add-u16 sub-u16 leq-u16 add-ptr sub-ptr call3 syscall skipz skipnz jrel ji

Note that push, pop, jrel, call3 would not be strictly necessary, because all of these could be handled by sequences of simpler instructions (jrel would reply on being able to read the PC, even if not to write to it, and push,pop would rely on the stacks being ordinary pointers, or at least mmapped to appear that way; jrel could be implmented as one add-ptr-u16 instruction if the PC were both readable and writable and the offset were <=256). Mb put those at the end, to indicate this:

16 instructions: annotate loadi cpy add-u16 sub-u16 leq-u16 add-ptr sub-ptr syscall skipz skipnz ji jrel push pop call3

the 8 regs in SHORT would be:

zero data stack call stack hidden stack GPR (ERR) GPR (CALL3) PC REGREG

---

so now we can do complete function prologs/epilogs (including register swapping) in SHORT.

---

mb limit the length of any one function in a ob0 file to 32k, so that the signed 16-bit relative jumps can reach anywhere in the function

---

note that if you read the PC and then directly manipulate it via ADD-PTR-U16, then you could make assumptions about the spacing of code which is then invalidated by inline expansion. Which means that the only use for reading the PC is if you take the argument as-is (you are going to jump back to next instruction after the current one), or if there will be no inline expansion. I don't see how to get around this, though, b/c you have to place the PC on the callstack for function calling, unless we want to introduce a new type of pointer, 'opaque pointer'.

hmm that could be useful. Consider introducing 'opaque pointer'. Or mb 'code pointer', which would also be opaque.

ok, i like this. If relative jumps are always through the 'jrel' instruction, and if code pointers are opaque, then i believe that you can guarantee that inlining is safe as long as you (a) adjust the offsets in jrels, (b) adjust the code pointers given by the loadk syscall.

---

(which also means that some implementations could choose to implement a custom VMLOADKI instruction by actually embedding the constant in the code and then skipping over it)

---

do we really need push, pop in the primitives, given that they can be emulated by custom instructions:

push _ stack value: sub-ptr $stack 1 cpy *$stack $value ??but what register would hold the input parameters, $stack and $value?

well, yes, we do need them, because we need them to access the hidden stack within custom instruction implementations so as not to clobber a GPR.

also, it's good to have them, because they have (well-defined) side-effects beyond the 'out in in' signature (they affect the stack being targeted). In fact, their stack argument is actually inout, so push is sig 'inout in', and pop is sig 'out inout'.

incidentally, can custom instructions survive with push,pop but without dup,swap? Yes, but you'd have to temporarily use the data stack. That's fine, you won't clobber anything. Note that a free stack, such as the datastack, can serve as a single temporary register by using push datastack 0; ...*$datastack...pop datastack. I suppose that this does mean that custom instruction might have separate stack depth requirements for the hidden stack, the data stack, and the callstack.

dup_not_ds_hs stack: # note: in the following, 'stack' cannot be datastack or hiddenstack

because this is a custom instruction, that 'stack' argument will be on the hidden stack at the beginning of the custom instruction) push $datastack 0 #create a single 'temporary register' pop *$datastack *$hiddenstack #the stack that is being popped from is *$hiddenstack, that is, 'stack', and it's being written to datastack's TOS. So now 'stack' is down one item, and that item is TOS on datastack. push *$hiddenstack *$datastack #the stack that is being pushed to is *$hiddenstack, that is, 'stack', and we're pushing back the same item that was just popped. So now 'stack' is the same as when it started, but its top item is also TOS on datastack. push *$hiddenstack *$datastack #now we've dupped 'stack' pop 0 $datastack #free our 'temporary register' pop 0 $hiddenstack #consume the input 'stack' which was on hiddenstack

swap_not_ds_hs stack: # note: in the following, 'stack' cannot be datastack or hiddenstack push $datastack 0 pop *$datastack *$hiddenstack #we start out the same as with DUP, so here 'stack' is down one item, and that item is TOS on datastack. push $datastack 0 cpy *$datastack *$hiddenstack #now TOS on datastack is 'stack', and second on datastack is the (previous) first element on 'stack', and 'stack' is currently down one item pop *$hiddenstack *$datastack #now TOS on datastack is 'stack', and second on datastack is the (previous) first element on 'stack', and TOS on hiddenstack is the (previous) second element on 'stack', and 'stack' is currently down two items push $hiddenstack 0 pop *$hiddenstack $datastack #now TOS on datastack is the (previous) first element on 'stack', and TOS on hiddenstack is 'stack', and second on hiddenstack is the (previous) second element on 'stack', and 'stack' is currently down two items push *$hiddenstack *$datastack #now TOS on datastack is the (previous) first element on 'stack', and TOS on hiddenstack is 'stack', and second on hiddenstack is the (previous) second element on 'stack', and 'stack' is currently down one item, and TOS on 'stack' is what used to be the first element on 'stack' pop *$datastack $hiddenstack #now TOS on datastack is 'stack', and TOS on hiddenstack is the (previous) second element on 'stack', and 'stack' is currently down one item, and TOS on 'stack' is what used to be the first element on 'stack' push *$datastack *$hiddenstack #now TOS on datastack is 'stack', and TOS on hiddenstack is the (previous) second element on 'stack', and TOS on 'stack' is what used to be the second element on 'stack', and second on 'stack' is what used to be the first element on 'stack' pop 0 $datastack pop 0 $hiddenstack

(are these all the primitives for stack ops? i don't think it really matters, because we know that we can also 'save' a GPR to datastack, allowing us to implement another stack op assuming not only these, but also using another register)

---

ok, but the above can't be applied to either datastack or hiddenstack themselves. But we can do the register saving trick, as noted above:

dup_datastack: push $hiddenstack *$datastack push $datastack *$hiddenstack pop 0 $hiddenstack

and for dup_hiddenstack we use the register-saving trick:

swap_datastack: push $hiddenstack $GPR pop $GPR $datastack push $hiddenstack 0 pop *$hiddenstack $datastack push $datastack $GPR push $datastack *$hiddenstack pop 0 $hiddenstack pop $GPR $hiddenstack

and similarly for swap_hiddenstack

then to implement a general dup (and similarly for generic swap):

test whether the stack being operated on is the datastack, or the hiddenstack, or something else. Then branch to dup_not_ds_hs, dup_datastack, or dup_hiddenstack.

---

the previous DUP and SWAP implementations didn't use callstack, so i think it's safe to require that only things of type CODE_PTR ever get pushed to CALLSTACK.

---

and do we really need call3?

probably not:

call3 o i1 i2: swap $hiddenstack push $datastack *$hiddenstack pop 0 *$hiddenstack push $datastack *$hiddenstack pop 0 *$hiddenstack push $callstack $PC_REGISTER ji $CALL3_REGISTER push $hiddenstack 0 pop *$hiddenstack $datastack

---

so let's leave out call3 for now.

so 15 instructions: annotate loadi cpy add-u16 sub-u16 leq-u16 add-ptr sub-ptr syscall skipz skipnz ji jrel push pop

empty slot; reserved for implementation?

---

so if we have PTR, CODE_PTR, and STACK_PTR, how do we represent that?

ADD-ptr-u16 can only operate on PTR. ji can only operate on CODE_PTR. pop/push can operate on either PTR or STACK_PTR (but not CODE_PTR).

In terms of union types, pop/push operates on a union of STACK_PTR and PTR. But you could also say that there is a superclass, PTR_STACKPTR_SUPERCLASS, and that pop/push operates on that, and that STACK_PTR and PTR are subclasses of it. Since pop/push is still a primitive, and can't have additional adhoc polymorphism added to it, it doesn't have to be a polymorphic opcode though.

(there could also be an ever bigger superclass encompassing CODE_PTR too; but i guess that actually there's no need for user code to even see CODE_PTR as a pointer, since it can't directly access or manipulate offsets ('local_address' in the implementation) anymore)

---

limit the length of an ob0 file to 32k, so that the signed 16-bit relative jumps can reach anywhere in the file?

---

do we need any more extra state besides a CARRY (which we're going to put into ERR) for arithmetic? Nope, apparently the 6502 had Carry, Zero, Negative, Overflow bits. Zero and Negative could be tested after-the-fact and so presumably are just for remembering the result for later BNZ-type instructions, which we don't need. And according to this excellent article, Carry is only for unsigned arithmetic, and Overflow only for signed arithmetic, it's just that the same silicon (and instructions, i guess?) did both (assuming a 2s-complement signed representation), so if you separate signed and unsigned arithmetic instructions like we do, i think you should only need one or the other for any given instruction:

http://www.righto.com/2012/12/the-6502-overflow-flag-explained.html?m=1

---

so, the 'nested runloop's in custom instructions and in first-class function calls can be gotten rid of by inlining compilers.

leaving custom syscalls. Could we do custom syscalls just like function calls (using the ordinary callstack), except that we change some hidden state in the interpreter to reflect the new syscalls mapping?

then some simple-ish inlining interpreters could have no nested runloops at all.

added to to do list.

---

hmm.. also if code segments are constrained to 32k, then a simple interpreter could just expand them to 64, leaving most of the code empty, but then placing MOVs after first-class-function calls to handle the returns in a simple manner. That's probably less efficient that the whole 'epilog' thing i have now though.

still, another reason to insist that ob0 files be < 32k.

ok i'll change the limit from 64k to 32k.

---

i'm going back and forth a little on the -2addr mode simplification proposal. Issues:

not having an immediate addr mode does introduce temporaries -- they can be eliminated by constant propagation, but still
without a stack addr mode, we cant mimic stack ISA-style code without interspersing lots of pushes and pops

otoh:

except for when manipulating the stack itself, if you have register indirect addr mode on stack pointers but not stack addressing mode, the only thing you lose is that you need to insert 'push stack 0' to 'allocate' new items on the stack before writing to them in the next instruction with register indirect mode, rather than combining these two into one instruction, and then use 'pop stack' to 'deallocate' them rather than this being implicit in a stack-read operand

having separate push/pops is nice b/c now there can only be one stack-length-change annotation per instruction, and you can scan for the need for these by opcode. Having combined push/pops is nice because you don't have to assign '0' to a new stack location, meaning that that location may change type when the actual item is pushed, complicating typing. i guess the latter is slightly more valuable than the former?

i think we can get some of the benefits of stack addr mode while reducing complexity somewhat by:

eliminating signature-dependent stack addr mode -- field2 stack is pushed, field3 and field4 stack is popped
eliminating the +4 thingee
using the 'simplified' instructions (maybe without PUSH and POP, if we have stack addressing), which gives us 'in out out' signature consistency in the primitives.

this doesn't get us the extra data bit in SHORT mode. So SHORT mode would only have one GPR (ERR), and can't access PC or REGREG (although if we eliminate PUSH and POP, we could make these instructions rather than special registers; in fact, we could add an instruction SPECIAL to access both of these, or maybe just make that a SYSCALL).

some of the TODOs under the -2addrmode proposal in the code are still relevant.

---

i guess each addr subspace class corresponds to a ptr type? eg if you make an addr subspace class 'unicode_string' then that implies a ptr type PTR_UNICODE_STRING, which is the type of pointers pointing into addr subspaces of class 'unicode_string', right?

---

should access to the callstack be via a syscall?

---

so, we have room for movfrom and movto now in the 16 primitive operations, which could allow SHORT mode to access more than 4 registers. do we have an imm2 format (to allow 2 operands to be combined while using the addr mode bits of one of them as data in the combination)? if not, then SHORT mode is limited to 16 regs; if so, to 64 regs. Also, movto would be a special case violating our 'everything but LOADI is signature out,in,in' policy proposal.

what would movfrom and movto do when in MEDIUM mode and when the from/to 'big register' is a number more than 255? Perhaps these are absolute memory addresses? Or per-module-global memory addresses?

so how about we only add movfrom, not movto. this severely restricts SHORT mode but also makes it slightly easier to analyze, as out,in,in is maintained throughout, and only the stacks and GPR 0 can be (directly, that is, without thinking about function calls or syscalls) mutated in SHORT mode. Now we can preserve 'everything but LOADI is signature out,in,in'. This also suggests that, from the perspective of SHORT mode, registers 4-15 act like a constant table which can be loaded from but not written to.

Which suggests that in MEDIUM mode, movfrom > 255 would access the module constant table, possibly eliminating the need for a loadk addr mode.

but what about the motivation for a loadk addr mode that you want to use constant literals as input operands without needing a temporary variable to store them in? But even if you had a loadk addr mode in MEDIUM mode, only 256 of these constants can be directly accessed in this fashion, which probably isn't enough for a whole module. Perhaps these are local to each function, then? But in that case, perhaps we could get the same effect by just preinitializing various high registers to the constants for that function upon function entry?

one problem with preinitialization as opposed to loadk with temporaries is that it may make constant propagation in downstream transpilers harder, instead of easier as intended, because the downstream compiler would have to understand that the preinitialized registers are constants in order to make use of this, as opposed to just understanding that vmloadk returns a constant. But i think this is tractable.

We could also make 'movfroms inputs be a REGISTER number, not an actual value, in which case we could have 'computed register numbers', allowing a complicated way to access beyond register 15, and also allowing jump tables or arrays in the registers. I'm a little leery of this but it's worth thinking about.

thinking about movfrom in MEDIUM mode accessing a per-module-global constant table. If these are really constants then (when the index being accessed is >= 256), we could call vmloadk to get it (below 256, it's just a register). Mb we should build in a concept of 'currentmodule' into the base system, and then have the vmloadk call be able to ask which module it is being called from. If we do this, then we could build the concept of a constant table into the base system by having constant addr subspaces, and having a base vmloadk provided which maintains a lookup table of module# -> constantaddrsubspace.

if we are building in a concept of 'currentmodule', then getmodule could be a method on CODE objects (b/c it's lexical), and getcurrentmodule could be a syscall that just returns that (separate from the get_pc syscalss that returns a pointer). The currentmodule could be passed into the vmloadk (or you could pass in 0 to indicate the currentmodule, and then it would call getcurrentmodule for you).

is the addition of a 'getcurrentmodule' syscall too opinionated?

---

Is the single GPR technically necessary for custom instructions to be Turing-complete without touching memory? No, it's not necessary. The custom instructions already have access to two general-purpose stacks, the datastack and the hiddenstack. A two-stack pushdown automata is Turing-complete already [1].

---