Bayle Shanks's website: proj-oot-ootAssemblyNotes6

---

i like the idea of custom instructions.

With custom instructions, we could have only a very few core instructions that the porter must implement; the rest could then themselves be written in Oot Assembly in terms of these core instructions (or in terms of each other; they could be defined in a fixed sequence). If the porter is up to it, they could directly implement some of the other instructions too; it would be configurable which of the pure Oot implementations would be used. This also suggests to the porter which things to provide platform-level implementations of first when they want to optimize later; first, provide a platform-optimized implementation of some or all of the non-core instructions.

That makes us into sort of the language of a macro assembler. Except that it's not that we necessarily have a compiler than must sub in the macros, but we can just directly interpret them; so they don't add much to implementation complexity (and hence porting). Although mb we want (compile-time) macros also :)

It's also reminiscent of Forth; the custom opcodes are like Forth 'words' (except that we are register-based instead of stack-based; and except that the implementation is simplified by having room for only a set number of numerical opcodes, rather than actually using words like in Forth).

If we have custom instructions, we appear to have at least 2 and maybe 3 kinds of function calling:

'call' for the custom instructions
'apply', possibly partial function application
and possibly a separate 'call' for ordinary instructions

---

nary array internal representation:

bytes allocated
type of elements of array
number of levels
for each level, the length of the array at that level
data

---

i'm sold on custom instructions. but this makes me want to go back to one of my other fav ideas, custom addr modes.

---

an example awesome new addr mode:

COW (copy-on-write) behind-the-scenes processing could be done in the addressing mode; eg when you READ an immutable datastructure with the COW addressing mode, behind the scenes the processor uses a structure that shares subtrees via pointers with its predecessors; and when you WRITE an immutable datastructure with the COW addressing mode, behind the scenes the processor only copies the new stuff, and creates a structure that shares subtrees via pointers with its predecessors

another one:

GC-aware addressing: hits whatever read barriers and write barriers are needed

another one:

subscription-aware addressing: upon a change to a watched address, notifies subscribers via an event

(of course we'll probably just include those things in the core machine, but the fact that this sort of thing exists makes one suspect that custom addr modes could be useful)

(also, that isn't an argument for CUSTOM addr modes per se, it's just an argument for more addr modes than the basic 4)

a reason to prefer CUSTOM addr modes: like custom instructions, it allows easier porting while preserving readability of Oot Assembly. Imagine if, instead of putting COW into the core machine, we had a separate custom COW addressing mode. This would mean that Oot Assembly code could still be written without COW stuff cluttering up the code (readability), but the implementer/porter (the person making a new Oot Assembly interpreter or compiler on a new platform) only has to implement the 'core' addressing modes; the COW stuff would already be implemented in pure Oot Assembly. Not sure if we'll do that (probably not), but the possibility makes the flexibilty provided by custom addr modes more attractive.

also, in some cases adding more bits to the addressing modes is more efficient than adding more bits to operand-specific flags, because generally flags aren't orthogonal to addressing mode; for example, i proposed a boxed/unboxed flag, but what does it mean to have a 'boxed' immediate constant in an input operand? boxing implies multiple storage locations (ie bytes or words), with the metadata in one location and the data in another (unless you go for something like Hoon's idea of explicitly encoding data structures into single large integers, which i don't).

---

ok so i'm sold on custom addressing modes. let's eliminate the operand flags and use them to have more addressing modes, most of which are custom. The idea being that flags that we really needed (eg boxing/unboxing) can just be corss-producted with the addressing modes, and this is slightly more efficient too because some combinations don't make sense (eg 'boxed' flag on an input operand in immediate addressing mode). This seems a little silly because it's the opposite of what i was doing with instruction polymorphism, where i have explicit types even though it would be more efficient to have the opcodes be the direct product of instruction and type. But there's only 16 of them so maybe it isn't so bad.

since we're going all custom, let's make the remaining two opcode flags custom as well.

also, although i've already pretty much picked out the types, let's make them be 'custom' as well, just in the sense that core Oot Assembly itself doesn't define or use most of them.

---

so, new proposal:

custom addr modes, custom opcodes, custom types, custom opcode flags

to allow efficient compilation, these things are all static (compile-time) and lexically scoped; eg you can't change the custom stuff at runtime

?: kinda like forth with (infinite) registers, only one single stack (actually mb not; see below), and more metaprogramming (b/c you can metaprogram the addressing mode)

why only one stack: because it's nice to have one side of memory have the stack, and the other side be the heap. Eg if the stack starts at 0 and grows upwards and the heap starts at the high end of memory and grows downwards. You can use the stack as both a call stack and a data stack by pushing data onto it during each call. One issue with this is that when you return data, you have to pop the return address off of the stack; meaning that if you already have all the return data computed and pushed onto the stack, you have to move it all around before returning, which is a waste, especially if some data is just being 'returned' over multiple call levels (stack frames), whereas other data is being manipulated during the returns (so it's not a tailcall).

otoh for readability it might be good to have two virtual stacks, and the constraint that the data stack starts empty (except for params) upon each fn call, and must be empty (except for return args) upon each function return. This allows the implementation to deal with the complexity of two stacks if desired.

todo: http://www.forth.org/svfig/Len/softstak.htm

---

so this proposal is:

4 x (4 type bits + 4 addr mode bits + 8-bit operand-or-opcode bits). In the first field (the opcode field), the '4 type bits' aren't type bits: the first two are 'format bits', the last two are 'custom flag bits'.

The format bits (which are the first two bits of the instruction) are as follows:

00: this is a prefix modifier instruction. It has no operands (or rather, any operands within it are immediate constants; no need to decode an address mode, except for in the opcode field)
01: this instruction has one operand; it is a source operand (example: JMP). This operand has 40 bits.
10: this instruction has two operands; the first operand is a destination operand and the second one is a source operand (example: bitwise NOT). The first operand is 8 bits and the second operand is 24 bits.
11: this instruction has three operands; the first operand is a destination operand and the second two are source operands (example: ADD). Each operand is 8 bits.

Note that most opcodes are only compatible with one value of the format bits.

core addressing modes: immediate constant table direct indirect

there are 16 addressing modes in total, so this leaves 12 custom addressing modes. Some suggestions:

direct boxed
indirect boxed
?: shared memory/IO/do under lock/atomic operation/transaction/memory barriers around this/volatile (note: if multiple input operands have this flag set, they must be gathered atomically or in a transaction; if at least one input operand and the output operand has this set, the entire instruction must be atomic or in a transaction) (when this is not set, the platform can do its usual instruction reordering tricks, assuming that no other threads are concerned with this value)

memory map and memory-mapped special locations: 0: 'zero' register (send output to this register when storing the output is unnecessary; always reads zero) 1: PC (program counter) (maybe combine this with 0, since writing to PC is not allowed (for clarity/making it easy to see where control flow alterering instruction are)) 2: status register 3: stack0 push/pop 4: stack1 push/pop 5: ? 6: ? 7: ? 8-0x000B (15): direct read/write access to first 4 stack0 locations (without pushing or popping) C-0x000F (15): direct read/write access to first 4 stack1 locations (without pushing or popping) 0x0010-0x00FF: memory (that can be addressed by the opcode field) 0x0100-0xFEFF: main memory (that cannot be addressed by the opcode field) 0xFF00-0xFFFF: see following parenthetical phrase (x - 0xFF00 is offset to the beginning of the phrase)

note: memory locations might be able to hold values larger than 16 bits.

note: this is not to say that the virtual machine can only access 64k of memory. A pointer can be placed in any of these memory locations, and that pointer can point to somewhere else. Likewise, the PC can point to memory locations outside of this map. These alien memory locations are always considered to be 'higher' than this memory (eg if compared, the alien memory address is 'greater than' 0xFFFF) but don't have to actually be that way in the target machine. The alien memory addresses can be incremented as usual. You might say that the 0xFFFF of provided 'memory' is really 64k 'registers'. These 64k locations are the only ones that can be accessed in direct addressing mode (?maybe more if the 24 bit operands in forms 01 and 10 are used? but mb the Oot Assembly machine model only guarantees the presence of 64k); however other locations can be accessed via indirect addressing mode. In fact, only the first 255 of these are accessible in direct addressing mode by ordinary, 3-operand-format instructions.

note: there are two stacks, intended to be used as data stack (stack0) and call stack (stack1) but the result is undefined if a program attempts to access beyond the end of the stack; this means that these can be implemented as a single stack

core types: pointer int16 other/custom static type (has already been statically typechecked) dynamic (boxed/wrapped, tagged type; runtime typechecking requested)

there are 16 types in total, so this leaves 10. Suggestion:

boolean/single bit unsigned int16 byte float64 arbitrarily large/precise rational unicode string array cons cell/linked list/binary tree graph/hash/dict (opaque, as opposed to AST) function/callable, thunk/promise/future/lazy value AST/expression/block/logic greenthread/process/channel(/actor?)

(note that some implementations might not support all of these types; eg an embedded implementation might not support floating point, arbitrarily large/precise rational, unicode)

(note that when i say int is 16 bits, the implementation is free to make it MORE than 16 bits; int here is just AT LEAST 16 bits)

suggested custom flags:

blocking/async; do we block right now, or is the return value a promise?
(a) in case of error, trap/raise, or (b) wrap the result in a Maybe (does this mean that even ADD can create a Maybe? is a carry an error for ADD? or, on ADD, does this choose between using the status register for carry, vs returning a tuple?)

in the boxing:

type tag
is a thunk?
lenient evaluation requested?
follow symlink?
if array, type and length, and whether the 'data parallel' flag is set (ie if parallelizable operation is done on array, should we fork greenthreads to do this in parallel on the various array members?)
is this an immutable data structure with behind-the-scenes copy-on-write?
is this a behind-the-scenes distributed object?
is this a behind-the-scenes transaction-requiring object, or shared memory requiring locking?
are there subscribers to changes to this value?
is this value 'dataflow persistent' like a spreadsheet cell?
are operations on this value supposed to be subjected to the '__evaluate' magic protocol?
are changes in this value supposed to be serialized into an undo-able journal?

some special(ish) instructions:

left-brace: begins a quoted block
right-brace: terminates a quoted block
antiquote (optionally multilevel)
map
pmap, parallel map
lambda-left-brace
(should we have these?) load, store, loadk, loadi: unnecessary duplications of our addressing modes
?: shared memory/IO/do under lock/atomic operation/transaction/memory barriers around this/volatile
EVAL
- needs two forms: EVALINSTR execute instruction at X location; and, EVALSTACK execute the instruction indicated by the subfields on the data stack (format, flags, opcode, addressing mode, opcode, possibly up to three of: type, addressing mode, operand)
- (old, wrong: "'eval' is not needed b/c we can use indirect addressing on opcodes for that"; actually there is a problem here: the type fields and the addressing modes are not themselves part of the operands. So 'eval' is still needed)

lexically-scoped-compile-time-only (although they can easily be interpreted)-metaprogammy-instructions:

define custom opcode
- DEFOPCODE assigns code to an opcode. When the opcode is called, its input operands are pushed onto the data stack, and the assigned code is called. The assigned code must pop the inputs from the stack. If there is an output operand expected, the assigned code must push it onto the stack.
  - how do we execute a JMP from within the custom opcode?
define custom addressing mode
- code is provided for each of the input case and the output case. For input, the operand is pushed onto the data stack, and the assigned code is called. The assigned code must pop the inputs from the stack, find and read the necessary input value, and push the input value onto the stack. For output, the output value and then the operand are pushed onto the data stack. The assigned code must write the output value to the necessary location.
define instruction flag
- code is provided for a wrapper which is triggered when the indicated flag is '1'. When an instruction is encountered when the indicated flag is '1' (the two flag bits are processed in order), all subfields of the instruction are, separately, pushed onto the stack, ending with the format bits (eg the value of the format bits are at the head of the stack), then the assigned wrapper is called. The wrapper is responsible for executing the instruction (which can be done with EVAL).
note: when executing the code for custom opcodes, addressing modes, and instruction flags, the PC doesn't point to the custom code, it points to the actual pending instruction; which means that the implementation must maintain a 'shadow PC' for the purpose of tracking execution within the custom code. The custom code is in normal memory, however, although it may be a block stored at some location pointed to by accessible memory, rather than directly in accessible memory

---

what's the max number of identifiers permitted by LLVM?

http://clang.llvm.org/doxygen/Serialization_2Module_8h_source.html

which is a C++ fle, contains:

" / \brief The number of identifiers in this AST file. unsigned LocalNumIdentifiers?; "

https://en.wikipedia.org/wiki/C_data_types#Basic_types says 'unsigned' means 'unsigned int' which means 'at least 16 bits'. http://en.cppreference.com/w/cpp/language/types says that 'int' (and therefore 'unsigned', i think) has at least 16 bits, but usually 32 bits.

so, having 16-bit register identifiers in Oot doesn't seem too constraining/unusual, as Clang apparently assumes that the number of identifiers can fit in an unsigned int, which is only guaranteed to be 16 bits.

---

pretty cool, but i'm still dissatisfied. We want Oot Assembly to be somewhat similar to Oot Core, don't we? But the limitation of 8-bit operands means that identifier references in some Oot Core code will translate to single operands, while other identifier references in the same Oot Core code might translate to multi-instruction steps.

So lets look at the previous as a good exercise to focus the mind, and insist that we allow all 3 operands to be 16 bits.

we either end up back where we started, 4 16-bit fields, with at least 6 addressing mode bits in the first field, making it non-uniform, with no room for polymorphism, and with little room for custom addressing modes unless we don't give the opcode an addressing mode; or we move to a bigger representation.

let's look first at embracing non-uniformity of the first field. As we saw above, putting an addressing mode in the first field doesn't really allow beautiful virtualization because you have no room to put the virtualized addressing mode; so you still need explicit virtualization instructions, at which point, why bother with an addressable opcode (which will surely lead to inefficient implementations anyways, and will surely waste a lot of space as it will almost never be used).

if we really want custom addressing modes, we need at least 3 bits for addressing mode, so 3*3 = 9, and we have at most 7 bits for the opcode. If we still want 2 format bits, that's 5 bits for the opcode; alternately, we could carve up a 7-bit opcode space and say that each opcode only has one format, but that format is easy to read off of the first 2 bits. I like that the best.

We can still have some custom opcodes if we want, but it's not too much worse to just CALL these guys, esp. if the textual representation allows you to write procedure calls as looking just like ordinary instructions. Well.. actually it is a lot of trouble, because you have to burn one operand on the address being called. So have some custom opcodes.

We don't have room for custom flags, but we can have prefixes that execute user-defined code instead.

Custom addressing modes still seem cool.

---

old:

01: this instruction has one operand; it is a source operand (example: JMP). This operand has 48 bits. There is also an immediate constant operand of 6 bits (from the unused addressing mode bits).

Note that this means that there are exactly 32 opcode slots available for each format. Reserve half of these for custom ops, so 16 core and 16 custom. If this isn't enough, consider eliminating class 01 (one operand), as it probably isn't used much (just JMP, right?), and increasing the sizes of classes 10 (two operands) or 11 (three operands) (JMP can become JMP X+Y). Probably safest to increase the 3 operands, since if you need a 2-operand instruction you could always phrase it as a 3-operand and then ignore one operand. In Lua VM [1] about 15 of 35 2- or 3-operand instructions were 2-operand (there were also about 2 1-operand instructions, JMP and CLOSE); so the 3s are a little more than half. So yeah, just eliminate one-operands and replace with another three-operand class.

---

So we have:

so this proposal is:

7 opcode bits, where the format can be read off from the first 2 3 x 3 addressing mode bits 3 x 16-bit operands

The format bits (which are the first two bits of the opcode) are as follows:

00: this is a prefix modifier instruction. It has no operands (or rather, any operands within it are immediate constants; no need to decode an address mode, except for in the opcode field)
01: this instruction has two operands; the first operand is a destination operand and the second one is a source operand (example: bitwise NOT). The first operand is 16 bits and the second operand is 32 bits. There is also an immediate constant operand of 3 bits (from the unused addressing mode bits).
10 or 11: this instruction has three operands; the first operand is a destination operand and the second two are source operands (example: ADD). Each operand is 16 bits.

Half of 00 and 01, and all of 11, are reserved for custom ops. So there are 16 builtin prefix modifiers, and 16 custom prefix modifiers; 16 builtin two-operand instructions, and 16 custom two-operand instructions; and 32 built-in three operand instructions, and 32 custom three operand instructions.

core addressing modes: immediate constant table direct indirect

there are 8 addressing modes in total, so this leaves 4 custom addressing modes. Some suggestions:

direct boxed
indirect boxed
?: shared memory/IO/do under lock/atomic operation/transaction/memory barriers around this/volatile (note: if multiple input operands have this flag set, they must be gathered atomically or in a transaction; if at least one input operand and the output operand has this set, the entire instruction must be atomic or in a transaction) (when this is not set, the platform can do its usual instruction reordering tricks, assuming that no other threads are concerned with this value)

note: memory locations might be able to hold values larger than 16 bits.

there are no explicit types nor polymorphism; the opcode determines the type signature of the instruction. Implicitly, the following are suggested:

core types: pointer int16 other/custom static type (has already been statically typechecked) dynamic (boxed/wrapped, tagged type; runtime typechecking requested)

Suggestion for another 12 implicit types:

(note that some implementations might not support all of these types; eg an embedded implementation might not support floating point, arbitrarily large/precise rational, unicode)

there are no flags. instruction-level flags can be emulated by custom prefix ops. operand-level flags can be emulated by custom opcodes.

suggestion for some custom prefix ops:

blocking/async; do we block right now, or is the return value a promise?
(a) in case of error, trap/raise, or (b) wrap the result in a Maybe (does this mean that even ADD can create a Maybe? is a carry an error for ADD? or, on ADD, does this choose between using the status register for carry, vs returning a tuple?)

in the boxing:

type tag
is a thunk?
lenient evaluation requested?
follow symlink?
if array, type and length, and whether the 'data parallel' flag is set (ie if parallelizable operation is done on array, should we fork greenthreads to do this in parallel on the various array members?)
is this an immutable data structure with behind-the-scenes copy-on-write?
is this a behind-the-scenes distributed object?
is this a behind-the-scenes transaction-requiring object, or shared memory requiring locking?
are there subscribers to changes to this value?
is this value 'dataflow persistent' like a spreadsheet cell?
are operations on this value supposed to be subjected to the '__evaluate' magic protocol?
are changes in this value supposed to be serialized into an undo-able journal?

some special(ish) instructions:

left-brace: begins a quoted block
right-brace: terminates a quoted block
antiquote (optionally multilevel)
map
pmap, parallel map
lambda-left-brace
(should we have these?) load, store, loadk, loadi: unnecessary duplications of our addressing modes
?: shared memory/IO/do under lock/atomic operation/transaction/memory barriers around this/volatile
EVAL
- needs two forms: EVALINSTR execute instruction at X location; and, EVALSTACK execute the instruction indicated by the subfields on the data stack (format, flags, opcode, addressing mode, opcode, possibly up to three of: type, addressing mode, operand)

lexically-scoped-compile-time-only (although they can easily be interpreted)-metaprogammy-instructions:

define custom opcode
- DEFOPCODE assigns code to an opcode. When the opcode is called, its input operands are pushed onto the data stack, and the assigned code is called. The assigned code must pop the inputs from the stack. If there is an output operand expected, the assigned code must push it onto the stack.
  - how do we execute a permanent JMP from within the custom opcode? or should we just disallow that?
define custom addressing mode
- code is provided for each of the input case and the output case. For input, the operand is pushed onto the data stack, and the assigned code is called. The assigned code must pop the inputs from the stack, find and read the necessary input value, and push the input value onto the stack. For output, the output value and then the operand are pushed onto the data stack. The assigned code must write the output value to the necessary location.
define custom prefix
- code is provided for a wrapper which is triggered when the prefix is present. When the prefix is present, all subfields of the instruction are, separately, pushed onto the stack, ending with the format bits (eg the value of the format bits are at the head of the stack), then the assigned wrapper is called. The wrapper is responsible for executing the instruction (which can be done with EVAL).
note: when executing the code for custom addressing modes and prefixes, the PC doesn't point to the custom code, it points to the actual pending instruction; which means that the implementation must maintain a 'shadow PC' for the purpose of tracking execution within the custom code. The custom code is in normal memory, however, although it may be a block stored at some location pointed to by the PC, rather than directly in the 64k of main memory

---

the idea is:

we want porting to be easy
language implementations for dynamic or metaprogrammy languages are sometimes implemented in a restricted statically-typed subset of the language, eg RPython, for performance
we imagine that little Oot implementations written in Oot Assembly might be running on each of many tiny processors in a massively parallel system
embedded systems often don't have much room, hence the obsession on 16-bit (even though in reality a different implementation and language profile would be developed for embedded systems, and Oot Assembly would probably be skipped, just as Android doesn't really use JVM bytecode internally)
we want to push down technical/implementation-y platform functionality, such as GC and scheduling, below Oot Assembly, and use Oot Assembly to write 'interesting' language semantics/metaprogrammy stuff

but most of all, we want Oot Assembly as a thought exercise to focus our mind on the core of Oot (while also perhaps keeping our feet almost grounded by getting our thinking closer to implementation realities just a little bit).

---

so i keep talking about 'non-standard models of the integers' and hierarchical memory, but i don't think we really need that. Just use pointers. If you do an operation on 'the contents of register 7' but the operation calls for some structure, then it is implicit that register 7 actually holds a pointer to that struct.

This might matter (or might not, i haven't thought carefully) if we allowed unsafe typing, but we don't; either everything is provably typed at compile-time, or it is dynamically typed with type tags (that are checked at runtime).

Note that if the contents of register 7 is typed as a struct (of some specific type), then the semantics of 'copy register 7 into register 8' is to copy the STRUCT (with COW), not the pointer.

So we actually have 3 kinds of copying:

CPY: immutable value copy
MOV: equivalent to CPY and then UNDEF the old location (UNDEF is a (compile-time checked) promise not to use that value again)
ALIAS: copy the pointer, not the value it points to. Eg if you ALIAS the contents of register 7 and put the result into register 8, and register 7 is of type (some)struct (so register 7 actually holds a pointer to that struct), then the result is to put a copy of the POINTER into register 8; so later changes to either the 'contents' (struct pointed to by) register 7 or register 8 are reflected in the other

What happens if you try to apply ALIAS to an atomic value, e.g. the integer 5? Well, there are two obvious possibilities: either there's an error, or it replaces 5 with a pointer to a newly allocated integer in the heap. I prefer the latter, because that way the programmer doesn't have to distinguish between things that fit in registers and things that are too big to fit in registers and so are pointers. Note that this means that, to the implementation, the type int16 is really two types: int16, and &int16. The programmer just sees it as one, though; the VM transparently dereferences the &int16 when it is read or written to.

---

so i call these guys 'registers' a lot, but these memory locations or registers or whatever actually have the semantics of 'local variables' rather than 'registers'. What does that mean?

if there are two integer values side by side, one of them an int8 and one an int256, each of them still take up exactly one of these locations.
similarly, if you are traversing a struct, each step takes you to the next field, even if one field is eg a long array and the next field is an int16; we don't care how many bytes the fields take up
as noted in the previous section, when one of these guys holds something 'too big' to actually fit in one space in memory, it actually holds a pointer, but this is handled transparently
'too big' can be either a big value (like int256; the programmer cannot address 'within' this, it is an atomic value) or it could be a record or array (which is made up of many programmer-accessible parts)
indirect addressing (pointer dereferencing) is therefore only for accessing the parts of records or arrays
you can have two values which are aliases of each other (and maybe with boxed values you can even have 'symlinks' which can be chained). But you cannot make two 'register' memory locations aliases of each other, and in fact you cannot take a pointer to a register in any way (or in any way cause it to be updated without writing to it in the destination operand of some instruction in direct addressing mode). There are two reasons for doing it this way:
- it makes program analysis simpler; dataflow analysis can assume that no register is ever affected by anything except an instruction that writes to it; and furthermore, dynamically computed addresses can never resolve to a register.
- it makes it possible to store these things in a direct manner in actual registers, or on the stack; if you need to take a pointer to a memory location, you'd better not have that memory location be in an actual hardware register, or on the stack (or if you do, you have to emulate the pointer in some complicated way)

---

ok a problem with the above proposal, in which there is no addressing mode on the opcode, is that 'custom instructions' with two operands (i call this 'call3') can be defined at compile time, but you can't dynamically have first-class call3 functions sitting in a variable and then called.

you could add a prefix modifier instruction that does this. But that's ugly (because, a code analyzer would have to treat this as really just one instruction that does something completely different from what it looks like without the prefix modifier; which means that it must construct a different internal representation of the code than Oot Assembly; which makes me think, well that representation that it is constructing is cleaner/more regular, so couldn't we just make Oot Assembly be that representation?).

so i think we should go back to having an addressing mode in the opcode.

so, starting with the previous proposal, we need some more bits for the opcode field. If the operands are fixed at 16 bits and the addressing mode bits are fixed at >= 3, then we need 3 bits and the only place left to get them from is the 7-bit opcode field. But 7-3 = 4 bit opcodes is very little, especially if part of the point is to be able to make use of custom opcodes. I think i'd rather settle for less than 16 bits.

Before i couldn't stand 8 bit operands, and it seems (given my desire for custom addressing modes and first-class call3 functions via an addressing mode on the opcode) i can't afford 16 bit operands in a 64-bit instruction, so the choices seem to be to either make the instruction bigger than 64 bits, or use 12 bits. If the instruction is bigger it should be a power of 2, so 128 bits. But then we have a ton of extra bits; we really only need 3 more. So let's do 12-bit operands.

This hurts but i did a quick survey on programming language limits on maximum numbers of identifiers at proj-plbook-plChInternalRepresentationsImpl, and although the most common modern choice is 64k there were some older versionsof successful languages whose standard specified less (eg C and C++ which apparently used to be as few as 127 and 1024) and a few modern ones eg Lua with a limit of 200 local vars. A 4k (2^12) limit is much more than those, and unlike 200 vars, which could conceivably be bumped into by humans and certainly will be by automated programming, 4k will probably almost never be bumped into by humans (unless they have really weird style) and automated programming at least has a fighting chance of avoiding it. I guess tradeoffs must be made one way or another, and it feels like 4k is enough to cause discomfort but not to kill, which means it might be a good idea; and it goes well with the sort of 'don't take more than you need'/D.I.Y./this-is-for-the-Oot-implementors-not-for-end-users/run-on-a-zillion-tiny-parallel-processors goal for Oot Assembly. I suspect that it won't be too hard for Oot implementations to work around this limit by breaking large procedures into parts and/or storing other variables in memory or on the stack; the main effect would probably be that optimizations won't work very well in those cases.

---

i call the idea of having custom (user-defined) opcodes with 3 operands 'call3'.

before i was saying that call3 should use the stack, eg executing a call3 should push the inputs onto the stack, and the call3 function should pop these and push its result onto the stack at the end, but this might make it hard for eg LLVM to trace the dataflow (i'm not sure if it actually would make it hard, i'm just guessing). So, we'd prefer to use llvm's 'CALL' function calls for this stuff (by 'this stuff' i mean where we have functions that do not need tailcalls, exceptions, or any other metaprogrammy messing with the stack; to put it another way, it may be better to make it easy to use LLVM's 'call' functionality here, since, unlike HLL calls where we want exceptions and possibly other metaprogrammy alteration of the call stack, call3 calls use a straightforward vanilla call stack.) (is that more or less overhead on poor llvm? after all, we also don't want llvm wasting time saving variables for calling conventions, etc; in fact we probably just want it to always inline CALL3s; but in any case, we have more flexibility if we at least have the option to implement CALL3s without using the (Oot Assembly-program-accessible) stack); and even if Oot Assembly was ported to a non-LLVM platform, that other platform would probably have ordinary function calls, and again we'd prefer to just use the platform's function calls for call3, again so that the platform can better optimize them. if opcode and constant table custom (user -defined) ops are compile time anyway, mb provide special 'syntax' (encoding) rather than just using stack to pass operand s. this makes it easier to use llvm/platform call.

Now ultimately you still have to have a stack somewhere, because we want one call3 function to be able to call another call3 function within itself (and so on, to arbitrary dept). We don't want to just pass the calls in registers, because then if one call3 calls another call3 it has to be aware that it is a call3 and save the registers using some sort of calling convention, which breaks the idea that call3s behave just like ordinary opcodes. But the stack can be part of the Oot Assembly implementation, or the underlying platform, and not be accessible to the program running in Oot Assembly (so there are 3 stacks; the hidden stack within the Oot Assembly implementation, used for call3 implementation, and then the data stack and the call stack which are accessible to the Oot Assembly program (which itself is implementing the HLL)).

Could reserve 2 locations in low memory for call3 inputs; or reserve 2 opcodes to load them; or the call3 functionality could place the inputs in existing locations but save the values that are currently there (transparently implementing a caller-saves calling convention). Note that the first and last options aren't that different because even if 2 locations are reserved just for call3, the call3 functionality has to transparently save what is there when one call3 calls another call3.

note that all this call3 stuff means that the interpreter must maintain a third stack for the call3s.

note that this distinction between low-level call3 calling and HLL stack calling is reminiscent of the LLVM call/invoke distinction.

---

so basically there's 3 constant tables; operand, and opcode immediate, and opcode constant. is there any distinction between the latter two?

that is to say, each oot module gets its own constant tables (instructions and data) (but maybe not its own 'custom' ops, that might be per-program; or mb even they aren't 'custom' at all, they are part of Oot Assembly, just a part which doesn't have to be implemented directly; the reason is that the implementation is free to hardwire them).

should each module also assume that its 4k of memory is available?

or, should each 4k of memory be available separately not just for each module, but for each call? i guess so; these are really local identifiers, not machine registers, so don't worry about cost of register saving (maybe they are even on the stack, in the underlying implementation). otoh mb sometimes you'd want intra-module calls to reuse regs; this could be useful for inline-like decomposition of a large assembly routine into smaller subroutines.

So, we have two forms of call; a form in which the caller and callee each have their own 4k regs, and a form in which they share.

Mb only 'stack' call, not 'call3', gives new regs.

also mb absolute immediate and constant addresses eg in JMP are all implicitly module relative (pointers, however, can be shared between modules).

also CALL3 is not sandboxable; there is ambient authority, eg the callee has as much privilages as the caller

---

if the idea is that program analyzers can consider each single instruction in isolation (which is why i eschewed a 'call3' prefix op), and if we allow custom prefix ops, then we need to put some restrictions on prefix ops to prevent them from changing the semantics of the following instruction in ways that would break static analysis.

for example, a custom prefix op that creates a 'call3' is not okay because it could break pretty much any invariant. For example, a prefix op that causes the following instruction to be executed atomically is ok, because it only reduces the possibilities for what can happen (increases the invariants that we can assume).

i'm not sure exactly what requirement we want here. In any case, we're free to introduce non-custom prefix ops that break the requirement (although, as with a call3 prefix op, i'd prefer not to).

mb we want something like:

does not expand control flow eg let an instruction potentially jump to a location that it otherwise wouldn't
does not expand data flow eg let an instruction affect a memory location that it otherwise couldn't
does not cause something which terminates to not terminate
does not change type to something incompatible with what we would conclude without it

---

i notice that both the 'immediate constant' and the 'constant table' addressing modes are useless for the output operand (except in the special case when the output operand is immediate constant 0, in which case we might take this to mean 'discard the output value'.

So, what should immediate or constant table mean for the output operand? should we use immediate 0 output for the zero register instead of PC? If we did that, then we wouldn't have to always say that writing to the PC was always 'discard', and then that would give us a clean soution for the call3 JMP-in-caller problem (that is, how you could program an instruction like JMP as a CALL3; because the point of JMP is to make the CALLER jump, not the subroutine). The solution is: when in a CALL3 subroutine, the PC is 'frozen' at the callsite to the current CALL3 subroutine, and writing to the PC does not immediately cause YOU to jump, but it does cause the CALLER to jump. Actually, this is still a little ugly, because now the PC is behaving differently (frozen) within call3s and within ordinary code. So mb just provide a special instruction, JMP_CALLER, for this. What it really does is just modify the RETURN address of the CALL3 (so now we have 3 special regs for call3; the two inputs, and the return address; and maybe a fourth, to put the returned value into).

another idea: maybe 'constant table' addressing mode on the output operand should mean output to particular locations within local variable structs, ie the operand represents a path, eg obj.subobj.field3 (this would otherwise require multiple steps to .__get parts of local vars, then .__get a part of those, etc); ie indirection chains. A problem with this proposal is that its usefulness is not restricted to output operands; we'd want that for reading too.

another idea: these addressing modes mean that the effective address of the operand (under some addressing mode) is both an input, and also an output. But i dont see how that would work; this would be something determined by the instruction, not the operand. Eg how could 'ADD' accept its output as an input?

another idea: could act as a flag causing a conditional branch or skip. But, one design goal is to keep those instructions that jump or skip separate, for the purpose of easy program analysis.

another idea: some other random flag. No, because if we used a whole bit for a flag, then we'd have 8 addr modes left for the output, but still have 16 addr modes for each input. But most likely the missing 8 modes are things that would be useful in both input and output, so we don't want to lose them. Instead of a whole bit, we just want to reallocate 2 modes in the output, the 'immediate constant' and 'constant table' modes.

so we have to think of something that is only useful for the output operand.

another idea: something that depends on both the output operand and one or more of the input operands.

how about: output is an offset of the input. Various ways to do this. Let i be the input operand and o the output operand. Let reg(x) be the contents of register x.

reg(i+o) # this one is pretty useless because we could have just added i and o at compile time and used direct addr mode reg(i) + reg(o)

reg(i+o) reg(*i+o) reg(i+*o) reg(*i+*o)
reg(i) + o
(*reg(i) + o)
reg(i) + *o
(*reg(i) + *o)

actually, that ignores the addressing mode of i; let's use the given mode of the input operand. Define a to be the effective address specified by the input operand i.

a + o a + reg(o) a + *reg(o)

(a + o)
(a + reg(o))
(a + *reg(o))
a + o

but wait a minute, what if the input operand addressing mode was itself immediate constant or constant table? Then we don't have an effective address. Which is actually fine as long as we don't use the expression *a; in this case i must be an offset, and that offset must be added to a pointer to yield an effective address. But wait, how could o be a pointer? It is a constant. Could it possibly be a constant pointer? How would you know a pointer at compile time? Maybe if it was a pointer to a global; back to the idea of having 'path constants' i guess. But more useful here, since we're not just taking that pointer, we're doing something with it. But... if we have path constants at all, wouldn't we want the object to just use it as the effective address? Which would mean letting 'constant table' only be valid for 'static global pointer constants' (which i hadn't really planned to have, but since the opportunity presents itself..). Now we still get to choose the meaning of 'immediate constant' in the output.

this stuff is reminiscent of 'indirect indexed' vs 'indexed indirect' in the 6502. i wonder which of those was more useful? http://wiki.nesdev.com/w/index.php/CPU_addressing_modes says "The (d),y mode ((Indirect Indexed)) is used far more often than (d,x). The latter implies a table of addresses on zero page." http://www.emulator101.com.s3-website-us-east-1.amazonaws.com/6502-addressing-modes.html says "While indexed indirect addressing will only generate a zero-page address, this mode's target address is not wrapped - it can be anywhere in the 16-bit address space. " (i don't understand this; i thought the end address was not necessarily zero-page here, just the starting one.. but that's the same for both of them, not a distinguishing factor; am i misunderstanding or did they make a mistake?). https://skilldrick.github.io/easy6502/ says "Indexed indirect...You won’t see this much." and "Indirect indexed is like indexed indirect but less insane.".

well let's see the simplest thing to do would be: let $i be the argument that results from operand i. $i + o. For $i + o to be an address, since o is an integer, $i must be an address. So eg if the addressing mode of i is also 'immediate constant', then $i is an integer, not an address, and this is invalid. We could define a special case for that. But (a) that seems to be coupling the two operands too much, and (b) it is too much to remember (too irregular). Otoh doing this sort of thing at all couples the operands a bit, and is a bit irregular. So let's not.

---

i wonder if i am being confused about using llvm CALLS for call3. In an Oot Assembly interpreter, the interpretation is done within a loop where you fetch each instruction and then execute it and then update the PC; if you executed a CALL there you'd leave the interpreter loop, right?

no, i was not confused; the llvm CALL instead of using the stack may not help in an Oot Assembly interpreter itself implemented in LLVM, but it will help when compiling Oot Assembly to LLVM.

it's possible that this architecture may also make it easier to use the PyPy? RPython JIT generator, i doubt it though.

---

so what does Oot Assembly offer that Forth does not?

it's a register machine, not a stack machine
the registers have identifer semantics
this makes it easier to use as a target language for a HLL
custom addr modes (although they're not that useful vs. Forth, right? couldn't each use of a custom addr mode be replaced by just a single Forth command? After all they could each be replaced by just a few ordinary asm instructions)

---

so what does Oot Assembly offer that ordinary assembly with a macro assembler does not?

portability.

so what does Oot Assembly offer that a 'macro assembler' language does not?

maybe not too much, but are there such things? it seems to me that macro assembler languages are tightly coupled to their intended application of writing assembly language for various hardware ISAs, rather than to be a very simple portable target language to bootstrap an HLL.

another answer might be 'simplicity'.

so what does Oot Assembly offer that Parrot VM does not?

A quick glance suggests that Oot Bytecode is much simpler/more barebones/easier to implement/easier to port than Parrot; also it seems like Parrot might be implemented in C rather than self-hosting?

http://docs.parrot.org/parrot/latest/html/ https://github.com/parrot/parrot/tree/4fe3223ee11c4ab47c36f75a6be4a5d550c67133/docs

(is there documentation on porting Parrot? http://lists.parrot.org/pipermail/parrot-dev/2012-January/006527.html is the best i could find)

(note: at one point Parrot was even planning to move to a small set of core ops, did they ever do it? how does this compare to Lorito? http://trac.parrot.org/parrot/wiki/Lorito )

http://moarvm.org/ is perhaps closer. Interestingly, both Parrot and MoarVM? have custom ops https://github.com/MoarVM/MoarVM/blob/master/docs/extops.markdown . But MoarVM? still seems more complicated than Oot Assembly, harder to port, and not self-hosting (MoarVM? is in C) (note: apparently the plan is for it to be self-hosting someday? http://www.perlmonks.org/?node_id=1048515 ) http://moarvm.org/features.html

what about LLVM? Same problem; much more complicated than what i have in mind. LLVM is self-hosting though; LLVM is written in C and there are compilers from C to LLVM written in C, so one could run these compilers on the LLVM source to get an LLVM representation of a compiler from C to LLVM.

also, we have fixed-size instructions, which Parrot does not

So in summary:

focus on simplicity
very small core set of instructions that must be (initially) implemented when porting
very simple instruction encoding, with fixed-size instructions, which is amenable to parallel decoding
'registers' with local identifier semantics that make targeting from a HLL more straightforward
self-hosting. Oot Assembly compiler/interpreter in Oot Assembly itself, not in some HLL

---

http://www.modernperlbooks.com/mt/2011/07/less-magic-less-c-a-faster-parrot.html notes that if you have some 'custom instructions' implemented in OA, and some platform-optimized instructions hardcoded in the OA implementation, this is a problem is the hardcoded instructions need to call other custom instructions implemented in OA

(my feeling is that we should mostly try to just say, ok, don't do that then; the platform-optimized hardcoded instructions written in some other language should only be allowed to depend on other platform-optimized hardcoded instructions. HLL library functions can be 'jetted' and implemented by calling some other language, and our FFI will provide ways to re-entrantly call back into the running Oot/OA program; and in fact OA itself may provide a way to call out and re-entrantly call back in; but this facility shouldn't be used inside of custom instructions; although, perhaps i'm misunderstanding; perhaps the complexity of providing any interop at all is the same as allowing interop within the custom instructions, in which case we just have to bite the bullet and do it)

---

with the 4k limit, i do worry a bit about SSA; if one tried to put the Oot Assembly into SSA form, one would have to make sure there wasn't too many assignments in any one procedure.

huh, i guess in a way, SSA is like the opposite of register allocation.

---

potential inspirations for the structure of headers of modules:

https://github.com/parrot/parrot-docs5/blob/master/5.0.0/parrotbyte.pod

https://github.com/MoarVM/MoarVM/blob/master/docs/bytecode.markdown

LLVM's

https://github.com/parrot/parrot/blob/m0/docs/pdds/draft/pdd32_m0.pod

---

Parrot's M0 proposal says

"Call Frames

This is the central data structure at the M0 level. The call frame graph contains most of the state that is needed to execute M0 bytecode. ... There will be no stacks in M0, not even for argument passing. All internal control flow apart from basic gotos with code segments must be in continuation-passing style. This will give M0-based code several benefits. It will ensure immunity from stack-smashing attacks and related programming errors. It will also eliminate the nested (inferior) runloop problem and opens up interesting possibilities for an Erlang-like concurrency implementation.

CPS will also allow Parrot to use many pre-existing JIT algorithms, such as those described in the paper "Trace Based Compilation in Interpreter-less Execution Environments" [TBC]. " -- https://github.com/parrot/parrot/blob/m0/docs/pdds/draft/pdd32_m0.pod

so i understand what continuation-passing style (CPS) IS, but how does it let you avoid a stack? Say Procedure A calls Procedure B. One way or another, Procedure A has to store the state of its local variables while Procedure B is executing, and when Procedure B returns, Procedure A has to be able to access that saved state. With a stack, Procedure A's locals are stored in a call frame and pushed onto the stack; when Procedure B returns, Procedure A's call frame is right there again at the top of the stack (perhaps underneath return value(s) from Procedure B). When using continuation passing in place of a stack, instead of just having a stack you have to somehow pass this state to Procedure B as an explicit (at some level) argument; you still have to store the state one way or another, and make sure that Procedure A can access it again after Procedure B returns control. It seems like what M0 is proposing instead is for these call frames to be allocated on the heap and have pointers to each other. Which is more flexible (it allows for closures with lifetimes longer than the dynamic scope that created them). If i understand that correctly, a downside is that, since the heap is being used instead of the stack for local variables, performance is less, because it's faster to use the stack than the heap; the stack is especially faster for allocation and deallocation, so when calling and returning from functions (which is very common) i would expect significant overhead from this design choice.

that being said, i plan to do something like that myself, so i'm not complaining too much.

---

in fact, the M0 proposal seems similar in many ways to what i've been thinking of. They are trying to minimize the core instruction set to allow easy porting (although i dont think they quite follow my idea of having many core instructions that almost everyone will want to use platform primitives for nevertheless being custom instructions, not core instructions, but letting the implementation override them). They have a simple instruction encoding (8-bit opcode, all instructions take 3 8-bit arguments, no addressing mode). Their call frames can hold a fixed limited number of locals (256; but they have a 'Register Spilling' mechanism in place for when more is needed); this is similar to my 4k 'registers' which are fresh for each scope. M0 has 4 primitive types; int, 'numeric', PMC (like my 'custom'), and String. Each M0 memory location is 8 bytes (just like mine) (although elsewhere in the document they say they are 32-bit, so either they got confused or i'm misunderstanding; oh i see now, the pointers have whatever native size they have, ints only require 32-bits, but the implementation could decide to just give all memory locations the same size, or it could give certain locations different sizes; there is a hardcoded division of memory locations into int (I), number (N), string (S), and object (P), what they call INSP registers) and for the things that can't fit i that (PMC and String) it just holds a pointer.

they use to call their call frames 'contexts', just like me at some places in this document, but found that 'call frame' was less confusing to ppl.

at one point they say "Op composition is considered magic-like and will not be part of M0.", what is op composition? Is this literally function composition or do they mean what i call custom instructions?

https://github.com/parrot/parrot/blob/m0/docs/pdds/draft/pdd32_m0.pod

---

MoarVM?'s feature list looks pretty good:

" Some key features provided by MoarVM? include:

Meta-object programming, using the 6model design
Precise, generational, and parallel GC
Unicode 8.0 support (Unicode Character Database, encodings, normalization)
First-class code objects, lexical variables and closures
Exceptions
Continuations
Bounded serialization
Code generation from MAST (MoarVM? AST)
Runtime loading of code
Big integers
A range of IO and process support, including asynchronous sockets, signals, timers, and processes
Native calling and native pointer manipulation
Threads, mutexes, condition variables, semaphores, and blocking queues
Bytecode specialization by type, and numerous optimizations (including resolution of method calls and multiple dispatch, dead code elimination, inlining, and on stack replacement)
JIT compilation
Instrumentation-based profiling of call frames and allocations " -- [2]

---

let's simplify things further:

get rid of 2-operand instructions, and get rid of 'format bits'
except that instructions below 0x10 (256) are prefix modifiers or metadata
this means that absolute JMPs can only have as argument numbers up to 4k
so give each module a JMP table that immediate-mode JMP operands (or in general integers in JMPs) implicitly refer to
there should also be relative JMP (branch) instructions (signed)
so those also have a range of 4k, so this means that the largest size that a module could be and still be 'completely covered' in that any point in the code is reachable by one absolute JMP and one relative JMP is about 4k^2 = about 16 mega-instructions (16777216). Seems big enough; also it's not essentialy that a module be 'completely covered' in this manner, you could have long linear stretches that are generally reached by JMPing to some place at their beginning and then executing.

also i forgot to say above that custom instructions must not have cyclic dependencies, and must be defined in order such that when one of them is defined, any custom instructions used within its implementation have already been defined. (the order of the assigned opcodes may be different from dependency order).

there should probably be an entry point table in each module for all the custom operands, and an entry point table for exported functions, and a version, and offsets to the various tables at the beginning (table of contents)

also, let's get rid of the idea of 'regular-expression like' stack patterns. We want to avoid programs screwing up the stack; first, we want to avoid 'implementation-level' runtime crashes in Oot, and second, it's a security risk. So if there were a region in the stack with a variable number of entries, we'd probably want that region to be prefixed by an integer 'length' entry on the stack that said the size of said region, and then check that entry. So the variable-length thingee is like a 1-dim array (or bytestring). But much of the time when you do that, you could have just put the variable-length thingee on the heap and put a pointer to it on the stack. So just do that, it makes the implementation simpler, and it removes the annoying overhead of double-checking those 'length' entries at runtime.

so am i saying that we are going all the way and preventing stack1 from being used for non-standard control flow patterns, that entries put on it by subroutines must be removed when those subroutines terminate, except for a fixed number of declared return values? Maybe so.. but the thing is, if the OA user code can arbitrarily manipulate stack1, the user-level 'call stack', at runtime, then i don't think it's possible in general to statically know which program location will be at stack1 TOS (top of the stack) at a given lexical location at runtime; so therefore you also can't statically prove that stack0 will be in the correct state for that location.

For example, if function f() calls function g() then the arguments will go onto stack0 and the callsite will go onto stack1, and then you could statically verify that the stack map looks like what g() expects upon entry; and then if control flow was ordinary, g() would consume these inputs from stack0 and put a return argument onto stack0 and RETURN, which would pop the callsite from stack1 and change the PC to point to that, and then you could statically verify that the stack map looks like what that callsite within f() expects after the RETURN. BUT if control flow is extraordinary, eg if g() directly messes with stack1, swapping and inserting and removing various entries, then you don't statically know which pointer will be at stack1 TOS when g() RETURNs, so you can't statically know what the stack should look like at that point.

i propose that we do the same thing here that we are doing with gradual typing; most program code won't do this weird stuff, and stack1 will only be impacted by CALL and RETURN and associated opcodes, in which case the condition of the stack can be verified statically. But when code messes with stack1 explicitly, it must in addition contain an assertion w/r/t the state of the stack. This assertion will then be checked at runtime (but how? we aren't keeping around type tags for primitive types on the stack; i suppose the simplest thing to do would be to have another instruction, paired with and preceding the assertion instruction, that says 'start keeping track of what goes onto the stack from now on, i'm going to have to make an assertion about it later').

to make it easier to see when an explicit stack map is required, dont mmap stack1, use STACK1_POP and STACK1_PUSH and STACK1_PEEK and STACK1_POKE? well, no, since registers aren't memory addresses and can't be aliased, scanning for stack ops is just the same as scanning for direct addressing mode with the stack numbers. Custom addressing modes make this annoying though; so add some requirements on them that don't let them change the stack except in easy-to-see-circumstances (below i added the concept of 'direct' custom addressing modes).

to make implementation by eg compiling to llvm easier, instead of memory-mapping the call3 input and output arguments, just use special opcodes (call3 input1, call3 input2, call3 output) (nah; as with stack push/pop, or a direct custom addressing mode, just scan for direct addressing mode with the proper mapped memory location)

there should be a 'comefrom' annotation and a 'phi' annotation; should they be mandatory? (and comefrom 0 refers to an exported entry point)

---

error handling:

OK, so we're going to have situations where some instructions can yield errors at runtime, eg divide by 0. How to handle this?

one option: The ERR register is set if an error is encountered (eg division by zero). OA code should check the ERR register after one or more potentially error-producing operation. Problem with this: programs which fail to check the ERR register everywhere they should will continuing executing in an inconsistent state.

another option: have an 'error handler address' register, and control jumps to this location upon an error. The problem with this is that it's a hassle to setup these 'error handlers', and also sometimes you'd prefer to ignore an error or handle it inline, which this makes annoying. Also you don't want to bother setting the error handler upon each call, and if you did, then wouldn't you want to reset it to the caller's handler upon the completion of the call? But then you are messing with a stack of handlers.

another option: Golang's defer/panic/recover here http://blog.golang.org/defer-panic-and-recover . That's kind of like the simplest form of exception handling if you take the 'error handler address' and then extend it to a stack of error handlers. But it's still complicated.

another option: straight out exception handling, like LLVM. But that's too still complicated.

another option: x86 style maskable interrupts https://pdos.csail.mit.edu/6.828/2005/lec/lec8-slides.pdf . But then you get into loading and saving registers for the interrupts, which is annoying/complicated; and we don't really need I/O interrupts here b/c Oot Assembly is not intended for hard realtime systems, even though it is supposed to have relatively low and predictable latency.

another option: upon exceptions, JMP to the beginning of the program (the beginning of the program should then check if there was an error and if so empty the stack, etc). Resetting the program is probably what you want to do if you are running a program on critical equipment (eg a life support machine, or your car's breaks) and there is an unexpected error.

so i think the thing to do is to have at least 3 'error handling modes'; one of which crashes upon error, one of which just updates an ERR register, and one of which allows the program to handle the error, or reset, by JMPing to the beginning of the program.

---

so what are the usual ranges for relative branches?

ARM Cortex M0 has various instructions with different ranges; the smallest one is "Bcond" which is −256 bytes to +254 bytes [3].

6502 is ~+-256 [4]

MIPS is +-128k [5] a 16-bit signed value (times 4, i guess) [6]

Lua's TEST is a skip. Lua's FORLOOP has a relative operand of 18 bits.

MSP430 has conditional jumps with a 10-bit signed offset [7]

x86 has a JZ instruction form with a -128 to +127 offset called a 'short' conditional jump [8] [9]

Apparently that's all that's allowed in some sort of x86 16-bit mode [10]

Apparently the 8051 conditional jumps were limited to -128 to +127 [11]

Some PICs have a BRA instruction with -255 to +256 [12] although sometimes the range is greater [13] [14]

Some PICs have some sort of 2k page size related to GOTO [15]

AVR has a 12-bit signed offset [16] [17]

Z80 has 8-bit signed offset [18] [19]

So the smallest one that i see is +-127. The small ones are 8051, x86 SHORT jumps, x86 16-bit, Z80. Now, some of these offset are in bytes, but others are in instructions. So some of these may be 127 bytes but less than 127 instructions. And indeed, the ones with the shortest relative jumps, Z80, 8051, and x86, are all 127 BYTES. What are their instruction lengths? x86: "75% of x86 instructions are shorter than 4 bytes. But if you multiply the percentage by length, you will find that these short instructions take only 53% of the code size. So another half of a typical executable file consists of instructions with 32-bit immediate values, which are 5 bytes or longer." [20]. Z80: "252 instructions are one byte, the rest are 2, 3 or 4 bytes long." [21] 8051: "Instructions are all 1 to 3 bytes long, consisting of an initial opcode byte, followed by up to 2 bytes of operands.".

So you could get all those guys, and 75% of x86 instructions (but less than half, by total executable length) with 4 bytes. And 128/4 = 32.

Of course, this wouldn't REALLY necessarily make the cut, because the compiled OA instructions would often take more than 1 platform instruction! So if you said that OA relative jumps had a +-32 limit (in units of OA instructions), you'd still end up with some cases where there was more than 127 platform instruction in the compiled version.

Even a 1-instruction SKIP could be too much if the OA instruction being skipped was a custom instruction with a lengthly implementation.

So maybe we really oughtta just stick with the +-2k limit that our 12-bit encoding gives us.

also, we don't want custom instructions cluttering up the module's JMP table, so they should have to only use relative jumps. So the relative jump range puts a limit on the size of custom instruction implementations (that could perhaps be partially circumvented by dynamically calculating the relative jump target, but anyways..). So for that reason, we want to leave the relative jump range big.

But, if custom instructions are implemented by inline expansion, then relative jumps in surrounding code must be updated. It would be awkward if, upon complete expansion of custom instructions, the range of some relative jumps would have to grow larger than +-2047. So one sort-of solution is:

user-level OA code should only use skips and absolute jumps, and custom instructions must, upon complete expansion, be less than 2048 OA instructions long.

That seems kind of draconian but it does have the aesthetically nice effect of preventing OA code from creating very high-level OA instructions. But it also forces user-level control flow structures to use module-global JMPs all over the place. And since we are limiting the 'jump table' to 4k entries, that could be a problem. I feel like this would definitely impinge on HLL programmers; if you mapped HLL modules directly to OA modules, then surely some of them would have more than 4k loop constructs within them! Another solution is to make the implementation handle it: tell the implementation that if they dare to inline expand the custom instructions, they are responsible for finding some other representation with a larger bit-width to keep track of the growing relative jumps. Implementations that can't deal with that can just use the 'hidden stack' implementation method instead of inline expansion.

Not sure quite what to do here. I think the idea that we can directly map HLL to instructions with 12-bit operands is showing its weakness here.

Well, one thing that many architectures do is have fewer, larger operands on JMP instructions. We could do that. We could have a 36-bit absolute JMP destination. Then we wouldn't even need a JMP table, which would be much more efficient anyhow. So we'd have to bring back those format bits -- but for a good purpose.

A related idea is to have multiple JMP operands but combine them somehow, eg the first one is an index into the JMP table, the second one is an offset from that. That would let us cover 4k JMP table entries * 4k = 16M. But it's less efficient.

A third idea is to keep the format bits out but to have the JMP instruction just combine its operand. This is inefficient and wastes the addressing mode bits but is a little simpler. But it has the advantage that, if the MSB operands are direct to registers holding pointers, instead of immediate, then this instruction doubles as an indirect JMP, eg the first operand locations a JMP table and the third is the offset within that table. And optimizing compilers could optimize the putting-together of immediate constants.

A fourth idea is to have some of the JMP operands be pointers, and some be offsets from those pointers. But this seems to me like it gunks up the constant table with stuff that should be in the JMP table (because we can't have pointer immediate values).

Let's have JMP combine its operand, and get rid of the JMP table.

Now, 2 operands for JMP, or 3? 2^24 is only 16M, which still may not be big enough for everyone. The problem with using all 3 is that the program analyzer will think that the third operand is being written to. We could make an exception for JMP, though.

I'm reluctant to make the exception. We'd also want an exception for the constant table, in that case, and we'd want an immediate LOADCONSTANT. Lua's unconditional JMP is only 18-bits (and that's a relative JMP, and it's signed).

y'know, now that absolute jumps are so big, and now that they have a base pointer, let's just remove the relative jump instruction. That makes custom instruction expansion easier.

Wording that i deleted:

" If custom instructions are implemented by inline expansion, then relative jumps in surrounding code must be updated. It would be awkward if, upon complete expansion of custom instructions, the range of some relative jumps would have to grow larger than ~2^24 (a little less, actually), the largest effective relative jump value that can be specified in immediate mode. But that's pretty big, so hopefully it won't happen.

Any instruction that takes a pointer, if given an immediate operand, interprets that as a pointer to the n-th Oot instruction in the code (not counting inline expansion of custom instructions), with respect to the beginning of the module/function/custom instruction currently being defined.

---

related to JMPs and labels:

http://blog.llvm.org/2010/01/address-of-label-and-indirect-branches.html

---

removed some old wording from below/changed my mind:

" this code assumes its inputs are in the 'input registers' and leaves its output in the 'output register'. The call site is in the 'link register'. If it wants to change where the PC will be set after it's done, it uses the JMP_UPON_RETURN instruction, which alters the link register. "

" todo: maybe dispense with 'format bits' except for prefix modifier vs. rest? Then everything is 3 operand, which is simpler. Also do we want a 3rd format bit to distinguish possibly-custom instrs from not (probably no, since very few core instrs). The first 2 bits of the opcode are 'format bits'. 00 means prefix modifier; 01 means two-operand instruction; 10 and 11 mean three-operand instructions. Note that there is space for 1024 opcodes in each of these categories (plus up to another 1024 in the constant table in each of these categories). The next bit is the 'custom bit', set for custom instructions. "

" all instructions are either 'prefix modifiers or metadata', or 3-operand instructions. Prefix modifiers or metadata have opcodes below 256. Instructions with opcodes below 0xC00 have the first operand as the 'destination' or output, and the latter two operands as inputs. Instructions with opcodes at 0xC00 and above also take the third operand as input. Instructions with opcodes at 0xF00 use all three operands as both 'input' and 'output'. "

CARRY, ZERO, NEGATIVE, OVERFLOW (CZSO)

These 1-bit registers are intended to imitate the 'status register's available on the ALU of various processor architectures, and arithmetic and comparison instructions are encouraged to set them in the conventional way:

CARRY
ZERO: the result of something was zero, or two things being compared are equal
NEGATIVE: the result of something was negative, or the first thing being compared is less than the second
OVERFLOW: the result of something is too large to fit in its bit-width

They can also be used in other ways by non-arithmetic instructions as a way to take as input and/or yield as output up to four bits.

(todo: we don't really need NEGATIVE, right? should we get rid of it? also, is my interpretation of the 'conventional way' to use S for comparison correct? should we call it S, SIGN?) "

" The implementation-dependent bit is reserved for the use of the OA implementation (for example, it might use this to mix other data into OA code; or to mark instructions that have been JIT compiled; or to intermix some alternate instruction encoding with instructions in this encoding; or to mark certain instructions for some other purpose; note however that this bit is not in the instruction itself, but rather in the value obtained after the 4-bit addressing mode has been applied to the 12-bit operand).

The flag bit is TODO: what should they be for? Boxing? Atomicity? blocking/async? raise vs nullable types? use CZNO flags? natural language instruction encoding format flag?). "

SHORT FORMAT

1 format bit + 7-bit subinstruction + 7 x (1 subformat bit + 7-bit subinstruction)

The SHORT format is a 64-bit packed grouping of 8 sub-instructions, each one 8 bits long except for the first one, in which the first bit was consumed by the format bit.

The first instruction is a 'PUSH' instruction which pushes a value onto the stack. The next 6 instructions are each either PUSHs or operations which take their arguments from, and push their results to, the stack. The last instruction is either POP instruction which pops a value from the stack, or another operation.

The format of the first subinstruction is:

1 format bit, 3 bit opcode addr mode, 4 bit operand

The format of the next 6 subinstructions are:

1 subformat bit to choose between 'PUSH' (0) or 'OPERATION' (1), and either:
if PUSH: 3 bit opcode addr mode, 4 bit operand
if OPERATION: 2 bit opcode addr mode, 5 bit opcode

The format of the last subinstruction is:

1 subformat bit to choose between 'POP' (0) or 'OPERATION' (1), and either:
if POP: 3 bit opcode addr mode, 4 bit operand
if OPERATION: 2 bit opcode addr mode, 5 bit opcode

The inputs and outputs of the OPERATION instructions are all implicitly the data stack (stack0) (eg the inputs are POPed from the stack and the outputs are PUSHd to the stack). Note that OPERATION instructions may consume any number of inputs from the stack (todo must this number be fixed for each operation, or can some eg consume the stack up to some terminator? i'm thinking if not fixed, then known at compile-time at least).

todo this format would be simpler and much more efficient to decode in parallel if the # of addr mode bits were the same between PUSH/POP and OPERATION; but it's hard for me to choose between 32 opcodes for operations vs 3 addressing modes for data access. On the one hand, 2 addr modes are all you really need, and 32 opcodes allows you to comfortable have opcodes for eg each of the 10 or so basic instructions of the SECD machine plus some basic arithmetic and composite data operations. On the other hand, third addressing mode bit allows you to represent boxed data, and a third addressing mode bit could be useful for the opcode as well as for data because it could represent eg laziness or asynchrony in the opcode, and 16 opcodes allows eg the 10 or so basic instructions of the SECD machine plus a tiny core of basic arithmetic, and other basic operations can be represented through library functions. "

SKZ: Skip if Zero: int -> (). If input is equal to zero, increments the PC by 1.
SKN: Skip if Negative: int -> (). If input is less than zero, increments the PC by 1. TODO: replace with either <= or isPositive or isNegative or something? to separate the conditional branching from the comparison "

2 signature class bits The signature class bits are (where I means input, O means output, B means both, and eg an instruction where the first operand is an output and the second two are inputs is written OII):
00: metadata and prefix modifier
01: OII
10: BII
11: BBB "

" instruction-level flags could be emulated by custom prefix ops. "

---

having the opcode number determine the format makes it harder to have a dense opcode dispatch table. Implementations that want dense dispatch tables would want to create a different dispatch table for each format if we do it this way. Of course, allowing custom instructions makes density hard anyway, but not that hard, since each module gets to remap the opcodes for any imported custom instructions. Consider going back to format bits, where the format bits aren't part of the opcode at all, eg the opcode bitwidth is reduced by the number of format bits. We have many less potential opcodes, but 4k was too much anyways. A bigger disadvantage is that it's harder to have unequal partition sizes for the opcode space, but 4k is a lot so mb we can work with this. If instructions of different formats can be intermixed, that makes program analysis slightly more annoying but just as efficient; the format bits are still stored in the instruction encoding, so we don't have to look them up in an external table or anything. We still might want to isolate prefix modifiers into their own numerical opcode space, since the dispatcher can't really branch to an instruction implementation for them, it must treat them specially by first reading the next instruction (well it could do that after branching actually, but some implementation may prefer not to).

todo: so let's do that. If we had 2 format bits (1k max instructions), what would the 4 formats be, letting I and O and B stand for input and output and both operands? prefix, IIO (eg ADD), IIB (eg CAS), BBB (eg ROT3 on memory locs), I (eg 36-bit JMP), IO (eg 24-bit JMP)? that's 6 forms already, too many for 2 bits. If we take 3 format bits then there are 512 opcodes in each partition; we have 2 partitions left over so we could assign more to IIO, the most common one.

btw the IOB format yields 27 possibilities, plus prefix modifiers, so 28; and that's without 2 and 1 letters, which add 9 + 3 possibilities, for 40 total. But with the constraint that I < O < B and that each sequence is increasing, we only have: I O B II IO IB OO OB BB III IIO IIB IOO IOB IBB OOO OOB OBB BBB + prefix = 20 possibilities. Are any of those others useful? Maybe we could separate out 'prefix' and 'metadata', although i don't know why we'd need to; but i bet that could come in useful someday.

So if we have 3 format bits and if we cant reuse opcodes between formats, then we can only have 9 opcode bits, or 512 opcodes total. But if we can reuse opcodes, then we get 4k opcodes total, 512 per format, but if we want a dense representation then we have to have 8 opcode lookup tables, one for each format. If the interpreter is implemented in that way where you avoid jumping to a central node in between each instruction, to allow branch prediction to work better, then you have to duplicate the logic to choose one of these 8 tables at the end of each instruction, and this logic involves local branching, which is annoying/may somehow bother branch prediction, although i dont see how it would, it may even help. Also, if the operands are processed differently for different formats (are they?) then you have a bunch of branching logic repeated after each instruction implementation anyways.

Are the format bits even good for anything? Maybe we want to implement the addressing modes at the beginning of each instruction in any case. How is this worse than doing it at the end of each instruction? In that case, the format bits would still be useful for dataflow analysis, but do we really care about that? If we did, we could put that information in the module info table instead, right? But i guess one idea is that in theory a really optimized implementation might do speculative instruction decoding in parallel and applying the addressing modes to the operands of future instructions simultaneously with fetching the instruction implementations, and it could do that without looking at the module info table (which is harder than looking up a hardcoded format in a hidden info table hardcoded into the decoder, which we cant do because we have custom instructions). I guess that's nice, it goes along with my whole Oot Brain parallelization thingee, although i dont know enough about this stuff to know if that capability would really be practically useful. Anyhow, i'm persuaded, so let's keep either format bits or opcode ranges like we had before.

So the decision remains, do we want the format bits to be part of the opcode or not? If they are, then we can have density in each format, but not over the whole opcode; if they are not, then we can have density over all opcodes, but we only have 512 opcodes total rather than 512 per format. The thing is, it's not clear whether density over all opcodes matters at all; why not just have 8 lookup tables instead of 1? And on the other hand, it's not clear that we even want 512 opcodes total; that seems inelegant to me. Also, if we only had 512 opcodes total, i bet many implementations would just have a dense table for all of them; but maybe with 4k opcodes they still could do that. In the former case the table would be 512 * 8 bytes per 64-bit pointer = 4k, in the latter 4k * 8 bytes per 64-bit pointer = 32k. Hmm a 32k table is a significant portion of L1 cache, maybe implementations would avoid that.

https://github.com/MoarVM/MoarVM/blob/master/docs/extops.markdown says MoarVM? has 16-bit opcodes but only uses about 512. https://github.com/perl6/nqp/blob/master/docs/ops.markdown says NQP only uses about 268 opcodes. But https://github.com/MoarVM/MoarVM/blob/master/src/core/oplist opcodes gives about 740 normal opcodes and about 60 special ones (for the specializer or optimizer, i think). But they have stuff like 'isprime' and 'mkdir' and 'stat_time' in MoarVM?, which is probably more than we want.

JVM makes do with 1-byte opcodes. So does LuaJIT?.

I don't think it's crazy to restrict to 512. Although the whole 'custom opcode' idea helps, too many opcodes is still not simple/beautiful, and furthermore, i like the idea of implementations planning for a max 4k of an opcode jump table much better than for 32k. On a machine with 16-bit addressing such as the Apple 6502s, 512*2 = 1k for the lookup table, which is still too big considering that BASIC implementations used to only take 4k-8k.

One remaining issue is, if we are restricting to 512, why not go a little further and restrict to 256 (one byte)? I have a feeling that on many platforms, there may be some efficiencies to using lookup tables indexed with a single byte as opposed to 9 bits, and probably we'd have memory savings in various program analysis data structures. In this case we'd have an extra 'flag' bit available again.

Hmm but now that i think of it, we dont really need separate formats for I and IO as opposed to III and IIO, b/c recall that for absolute jumps to labels we can just have eg multiple input operands and multiple the first one by 2^12. So we only need prefix IIO IIB BBB, which is 4 bits.

So do we have 2 flag bits? hmm..

---

should we do one of:

require each module to have dense opcodes?
- have a 'max opcode used in this module' in the module info
- have a 'max addr mode used in this module' in the module info
require the opcodes of 'custom' instructions to be the same across modules?

the latter would allow one opcode dispatch table per program, rather than one per module

the former would allow modules that dont use many opcodes to have small dispatch tables

the latter would be a much bigger win. This would really change how modules work, though; no longer could you treat modules like libraries to be plugged together; they are now just rigid subcomponents of one program, or at least, cannot be interoperated with different OA 'ISAs'.

now, if you have a bunch of modules with the same opcodes, the OA implementation could detect that and only use one dispatch table. So maybe we wouldn't be gaining much.

so let's encourage different modules to use the same mappings for custom instructions to opcodes.

i guess that means removing the import form that individually imports custom instructions to whatever opcodes you want. The import will just import all custom instructions from a module. If you want to mix-and-match you'll have to cut-and-paste from that module.

---

random assembler program/syntax examples:

http://luajit.org/dynasm_examples.html

---

probably should learn what all this means sometime:

http://article.gmane.org/gmane.comp.lang.lua.general/58908

---

is there any use for a JIT flag on instructions themselves?

http://wiki.luajit.org/Bytecode-2.0 indicates that "Function headers" have flags for JIT compiliation, but apparently not the CALL instructions. So maybe there is no use. JIT could add metadata instructions at the start of functions to use as a 'function header'.

Otoh one would think that it would be useful to JIT compile particular calls, not just functions in general. In which case the flag would be useful. So maybe reserve one bit for the implementation's use.

---

to reiterate, some other encoding ideas:

8 bits: 2 bit opcode (4 opcodes! CPY, ADD, SKZ, JMP), 3 2-bit operands (so 4 regs)
16 bits: 4 bit opcode, 3 2-bit operands with 2-bit addressing modes
16 bits: 4 bit opcode, 3 3-bit operands with 1-bit addressing modes (and either LOADK instructions, or LOAD/STORE)
32 bits: 4 x (2 bit addr mode + 6 bit operand)
32 bits: 4 x (3 bit addr mode + 5 bit operand) (non-power-of-two numbers of bits, but allows custom addr modes/a 'boxed' bit on each operand)
a natural language-y encoding (fields replaced by clauses with a variable number of words or subclauses; words and clauses stand in 'roles' w/r/t their superclause)

if we permit stack machines (pop all inputs, push output), or if we permit dropping the first operand and instead having one of implicit destination: accumulator register, or implicit destination: stack register, then we also get:

8 bits: 8 bit opcode, stack machine
8 bits: 4 bit opcode, accumulator or stack destination, 2 2-bit operands
8 bits: 2 bit opcode, accumulator or stack destination, 2 3-bit operands
8 bits: 2 bit opcode, accumulator or stack destination, 2 2-bit operands with 1-bit addr modes
16 bits: 8 bit opcode with 4 addr mode bits, stack machine, 4 spare bits
16 bits: 4 bit opcode with 2-bit addr mode, accumulator or stack destination, 2 3-bit operands with 2-bit addressing modes
16 bits: 6 bit opcode with 2-bit addr mode, accumulator or stack destination, 2 2-bit operands with 2-bit addressing modes
16 bits: 6 bit opcode, accumulator or stack destination, 2 3-bit operands with 2-bit addressing modes
32 bits: 8 bit opcode with 2 bit addr mode, accumulator or stack destination, 2 x (2 bit addr mode + 8 bit operand), 2 spare bits

out of these, the smallest that would seem to fufill the criteria of:

at least 3 operands
at least 4 GPRs, plus one implicit push/pop reg, plus one TOS reg
at least 16 opcodes (4 bits)
at least 4 addr modes (2 bits)

requires at least 4 + 3 x (2 + 3) = 19 bits

are:

32 bits: 4 x (2 bit addr mode + 6 bit operand)
32 bits: 4 x (3 bit addr mode + 5 bit operand) (non-power-of-two numbers of bits, but allows custom addr modes/a 'boxed' bit on each operand)

are these better than the 64-bit-width encoding above? It depends:

these only have 32 or 64 opcodes. That's good if we want a small basis set, but bad if we want room for custom opcodes.
these only have 32 or 64 locals. This means that some HLL programs will definitely be spilling locals.

sounds do-able, i'm not sure if desirable. The thing is just that the benefit here is just code density, but OA is not intended to have high code density (you could design a more complicated encoding and instruction set if you wanted that, and compile to it).

if we have 2 instruction encoding format bits, then we could either:

have 4 instruction encodings
have 3 instruction encodings, one of which only needs 1 instruction encoding format bit

can some of the others fit into that? Let's first look only at the 32-bit ones, b/c they fulfill the criteria mentioned above:

32 bits: 2 format bits, 4 x (2 bit addr mode + 5 bit operand), 2 spare bits
32 bits: 2 format bits, 4 x (3 bit addr mode + 4 bit operand), 2 spare bits

at this point we have only 16 or 32 opcodes and locals, so we're certainly getting into the territory where this might be a good 'high code density' subformat, but not good for the only format, given our goals with OA to preserve HLL intent in single instructions.

But if you were to have a 'high code density' subformat, then why not try to go even smaller, on the assumption that only the most-used opcodes and registers would be expressible in the subformat; and maybe even stack machines would be useful here. Let's assume 2 format bits:

8 bits: 2 format bits, 6 bit opcode, stack machine
8 bits: 2 format bits, 2 bit opcode (4 opcodes! CPY, ADD, SKZ, JMP), accumulator or stack destination, 2 2-bit operands (so 4 regs)
16 bits: 2 format bits, 2 bit opcode, 3 2-bit operands with 2-bit addressing modes
16 bits: 2 format bits, 4 bit opcode, accumulator or stack destination, 2 3-bit operands with 2-bit addressing modes
16 bits: 2 format bits, 4 bit opcode with 2 addr mode bits, accumulator or stack destination, 2 2-bit operands with 2-bit addressing modes
16 bits: 2 format bits, 2 bit opcode, 3 3-bit operands with 1-bit addressing modes
16 bits: 2 format bits, 4 bit opcode, 3 2-bit operands with 1-bit addressing modes, 1 spare bit
16 bits: 2 format bits, 5 bit opcode, 3 2-bit operands with 1-bit addressing modes
16 bits: 2 format bits, 6 bit opcode with 4 addr mode bits, stack machine, 2 spare bits
32 bits: 2 format bits, 8 bit opcode with 2 bit addr mode, accumulator or stack destination, 2 x (2 bit addr mode + 8 bit operand)
32 bits: 2 format bits, 4 x (2 bit addr mode + 5 bit operand), 2 spare bits
32 bits: 2 format bits, 4 x (3 bit addr mode + 4 bit operand), 2 spare bits
a natural language-y encoding (fields replaced by clauses with a variable number of words or subclauses; words and clauses stand in 'roles' w/r/t their superclause)

(note these are not all possible combinations, i tried to only include potentially useful ones, and i kinda winged it so i may be missing some)

i think if we had 1 format bit for multiple instruction encoding formats, we'd have the 64-bit one described above and the natural-language-y encoding, and if we had 2 format bits for multiple instruction encoding formats, we'd use 1 for the 64-bit one described above, 1 for the natural language-y encoding, and 1 for something smaller than 32 bits. For the small one, that leaves:

8 bits: 2 format bits, 6 bit opcode, stack machine
8 bits: 2 format bits, 2 bit opcode (4 opcodes! CPY, ADD, SKZ, JMP), accumulator or stack destination, 2 2-bit operands (so 4 regs)
16 bits: 2 format bits, 2 bit opcode, 3 2-bit operands with 2-bit addressing modes
16 bits: 2 format bits, 4 bit opcode, accumulator or stack destination, 2 3-bit operands with 2-bit addressing modes
16 bits: 2 format bits, 4 bit opcode with 2 addr mode bits, accumulator or stack destination, 2 2-bit operands with 2-bit addressing modes
16 bits: 2 format bits, 2 bit opcode, 3 3-bit operands with 1-bit addressing modes
16 bits: 2 format bits, 4 bit opcode, 3 2-bit operands with 1-bit addressing modes, 1 spare bit
16 bits: 2 format bits, 5 bit opcode, 3 2-bit operands with 1-bit addressing modes
16 bits: 2 format bits, 6 bit opcode with 4 addr mode bits, stack machine, 2 spare bits
16 bits: 2 format bits, 8 bit opcode with 4 addr mode bits, stack machine, 2 spare bits

but if we have 2 format bits, the small one only needs to use 1 format bit, so for 8 bits we have:

8 bits: 1 format bits, 7 bit opcode, stack machine
8 bits: 1 format bits, 2 bit opcode, 1 bit to choose between accumulator or stack destination, 2 2-bit operands (so 4 regs)
8 bits: 1 format bits, 3 bit opcode, accumulator or stack destination, 2 2-bit operands (so 4 regs)

none of those are that appetizing; we'd have to pack 8 of these instructions into a 64-bit field to keep alignment, and i doubt there will be many 8 instruction sequences that either only use stack ops, or only use the top 4 opcodes, or only use one of accumulator or stack destination throughout (and we don't want the compiler to have to spit out different instructions for this case, that's too hard, if that difficult is worth it than a different ISA designed for code density rather simplicity would be worth it).

So with 1 format bits and 16 bits, we have:

16 bits: 1 format bits, 3 bit opcode, 3 2-bit operands with 2-bit addressing modes
16 bits: 1 format bits, 4 bit opcode, 1 bit to choose between accumulator or stack destination, 2 3-bit operands with 2-bit addressing modes
16 bits: 1 format bits, 4 bit opcode with 2 addr mode bits, 1 bit to choose between accumulator or stack destination, 2 2-bit operands with 2-bit addressing modes
16 bits: 1 format bits, 3 bit opcode, 3 3-bit operands with 1-bit addressing modes
16 bits: 1 format bits, 3 bit opcode, 3 2-bit operands with 2-bit addressing modes
16 bits: 1 format bits, 6 bit opcode, 3 2-bit operands with 1-bit addressing modes
16 bits: 1 format bits, 3 bit opcode, 3 2-bit operands with 2-bit addressing modes
16 bits: 1 format bits, 6 bit opcode, 3 2-bit operands with 1-bit addressing modes
16 bits: 1 format bits, 8 bit opcode with 4 addr mode bits, stack machine, 1 spare bits

i bet the most useful out of those would be (eliminating stack machine and 6-bit opcodes and 2-bit operands):

16 bits: 1 format bits, 4 bit opcode, 1 bit to choose between accumulator or stack destination, 2 3-bit operands with 2-bit addressing modes
16 bits: 1 format bits, 3 bit opcode, 3 3-bit operands with 1-bit addressing modes

Would 4 instruction sequences that fit either of those constraints be common enough to be worth the extra complexity of adding this instruction encoding format? The 1-bit addressing mode is not that bad because it would be trivial to emit LOADK instructions. The accumulator or stack destination might be uncommon unless instructions were written with that in mind, though. Note that instead of 'accumulator or stack destination' you could use instructions designed in the 2-operand style, eg instead of c = a + b, you'd have b = a + b. But would aligned stretches of 4 instructions that fall into that style naturally be common if we just want to use this for compression? Probably not.

So the only remaining choice is

16 bits: 1 format bits, 3 bit opcode, 3 3-bit operands with 1-bit addressing modes

The biggest problem with this is probably only having 3 bits for the opcode. Will there naturally be many aligned stretches of 4 instructions that only contain the top 8 instructions? Assuming that PUSH/POP are memory-mapped, https://www.strchr.com/x86_machine_code_statistics suggests that these might be something like: cpy call cmp add lea test je xor; including the implicit 'push' and 'pop' these would be 76% according to those stats. If instructions were I.I.D. then that would be about .76^4 ~= 1/3 chance of seeing a run of 4 of these in a randomly drawn set of 4 instructions (since we randomly drew, i guess aligned is taken care of?). If we realized a savings of 75% in these cases (64 bits*4->64 bits), that would be 1*2/3 + .25*1/3 = 75% length = a 25% savings in code density. (how much would one more bit in the opcode help? if we got 8 more common instructions, that would be another 8% according to the table on the webpage, and .844 is almost .5, and 1 - (.5 + .25*.5) = 37.5%, a (.375/.25 - 1 = 50% increase in savings) The real savings would likely be less because instructions are not I.I.D. and it's likely that it's very common to have a run of 'common' instruction broken by at least one 'uncommon' instruction; also we are assuming here that operands with more than 3 bits never occur and we are neglecting the addition of the LOADKs and the need to reserve a register for that purpose.

Would using a 32-bit shorter format be any better? The table on the website suggests that the top 20 opcodes account for about .89 of the total. 1 - ((1 - .894) + .5*.894) = .31, so we'd only save about 30% this way; so it's a little better but not much.

What if we got rid of the 1-bit addressing modes? This reduces our savings because instead of compressing 4 large instructions to 4 small instructions, we are compressing 4 large instructions to 6 small instructions, or something like that, if we assume that typically we would have to add 2 LOADs/STOREs for each run of 4 instructions (and we are still ignoring the occasional LOADKs). If we assume that we only realize a 50% savings, then we can't beat the calculations for the 32-bit case, or about a 30% savings.

If code density were a goal or if implementation simplicity were not one this might be worth it, but in our case i think a <30% savings in code size is not worth the extra complexity, especially since we have no empirical OA data with which to test these savings; and given that if code density were a goal AND implementation simplicity were not one, a more complicated instruction encoding would be worth it.

If we leave one bit in the instruction for the use of the implementation, then some implementations could implement one of the schemes here, or intermix compressed 'macroinstructions' with ordinary instructions, and maybe it could somehow help with JITing too. So let's do that.

just thinking about 16 bits, here is why it doesn't work out for us, to reiterate what we said before:

we want >=3-operand instructions (because 2-operand encoding obscures intent, and stack machines obscure data flow)
we want >=4 addressing modes (because o/w you'll have a lot of what is, semantically in terms of data flow, 2 instructions that are really one; either LOADK followed by the real instruction, or (worse), LOAD followed by the real instruction (so 2 bits)
we want at least 16 opcodes (so 4 bits)
we want at least 8 registers (because we want at least 4 GPRs and two register for stack-mapping) (so 3 bits)

so that's 4 + 3*(2 + 3) = 19

now what about just switching everything to 32 bits? Now we can also accomodate an addr mode in the opcode field, allowing for dynamic first-class functions.

The alternatives are:

32 bits: 4 x (2 bit addr mode + 6 bit operand)
32 bits: 4 x (3 bit addr mode + 5 bit operand) (non-power-of-two numbers of bits, but allows custom addr modes/a 'boxed' bit on each operand)

Really, neither of those are crazy, but they do have their downsides compared to our 64-bit scheme.

The first one doesn't allow us to do the idea of manipulating 'boxed' quantities directly in single instructions.
The second one only gives us 32 opcodes
The second one only gives us 32 registers
Neither give more than 64 registers
a 64-length constant table is not enough, so we'd have to have a LOADK instruction that combines both input operands (not a huge deal)

Is 64 registers enough? If we are treating them as registers, yes; the 'related work' section of The Need for Large Register Files in Integer Codes cites many other papers that think 64 or 32 is enough (even though they themselves argue that more is better in some cases). However, if we are treating them as variables, Table 7 of the same paper gives various programs with more than 64 global variables (eg perl, go). We are thinking of these more as locals than as globals, but it is still unsettling.

So in conclusion a 32-bit encoding would not be crazy and would even be a little simpler, but a 64-bit encoding lets us:

not feel pressured w/r/t opcodes
have 2^12 registers, which is almost certainly enough to directly represent all variables in most code, compared to 64
manipulate 'boxed' quantities directly in single instructions.

The cost is, however, perhaps a near-50% decrease in code density, which is large. Otoh, without empirical data to profile, it's not clear that the efficiency gains of higher code density would not be offset by losses from having a more spartan set of opcodes. And, as i've said before, if we really wanted an efficient ISA and didn't care about implementation complexity, we could probably design a much better one anyway.

---

a little later; i still think 'the accumulator or stack destination might be uncommon unless instructions were written with that in mind', and similarly that the stack machines would require some rewriting, but actually i think the obvious opportunities for an easy rewrite into this form would be pretty common so it may still be worth it: that is to say, expressions. An expression is a tree in which each result is used exactly once. The interior nodes of an expression tree are raw stack machine operations; the inputs and outputs are from/to other interior nodes and are specified implicitly by the RPN ordering of the instructions. In programs you often see expressions with more than two or three terms, so the interior nodes of these could be compiled to an RPN-style sequence of opcodes.

At the root and the leaf node we still need to do something else though; at the leaves, we need to get values from elsewhere and load them onto the stack, and at the root we need to pop the stack and store the result elsewhere. In 'stack machine mode' these would be implemented by LOAD (PUSH) and STORE (POP), which would be the only instructions that need to mess with operands and addressing modes.

So could we fit this 'stack mode' stuff into 8 bits? Really 7 bits, since we'd want a format bit.

Well, if we still want an addr mode on the opcode it's pretty tight. 3 bits for the addr mode, then we only have 4 bits for the opcode operand. No bits are left for flags, for boxing/unboxing, or for labeling the stack machine instructions with offsets to make the AST which is implicitly represented by the RPN easy to traverse.

Plus, we want variant instruction encodings for LOAD and STORE, so remove another bit for that. So 3 bits for the addr mode and 3 bits for the opcode. We have 6 bits left to encode LOAD and STORE. Reserve one bit to choose between them, now five bits left. Now if we have a 3-bit addr mode, we only have 2 bits to choose the register to LOAD or STORE into! That's not too great.

So we'd pretty much have to ditch the 3rd addr mode bit. Which would leave us with:

1 format bit, and either:
- 1 secondary format bit, 2 bit addr mode, 4 bit opcode
- 1 secondary format bit, 1 LOAD or STORE choosing bit, 2 bit addr mode, 3 bit register address

Now with only 8 registers, we aren't really going to be using these things as local variables, they would only be used as registers, which means that before issuing a sequence of these stack instructions things would have to be CPYd from whereever they really are stored into the 8 usable registers. In which case, why not have that code place them directly on the stack? Similarly for STOREing. So we can ditch the LOAD/STORES, and have one of:

1 format bit, 3 bit addr mode, 4 bit opcode
1 format bit, 2 bit addr mode, 5 bit opcode

Both of these are good choices. The 2/5 (2-bit addr mode) gives us 32 opcodes, and the 2 bit addr mode doesn't give us a chance to represent things like boxing, atomicity, asynchrony (vs blocking), or laziness, but we don't need some of these anyways when we're just fetching a function (eg we can assume that all function pointers are 'boxed' and that the stack is not in shared memory so no atomicity is needed), and for others we could perhaps use a language convention, and reify function application if we need to do the other thing. Eg the convention could be to apply functions strictly synchronously (block until they are done), and then if you want to apply a function lazily or to get a promise you have to use the APPLY opcode; such a convention lets us do low-level arithmetic fast, and since we can fit 8 of these operations in one 64-bit packet, there is plenty of room for a few APPLYs when dealing with HLL functions. With 2 addr mode bits we can still do immediate (the opcode is right there in the instruction, and is for a built-in instruction), constant table (the opcode is right there in the instruction, and is for a per-module library instruction), direct (we are applying a first-class function stored in a register), or indirect (we are applying a first-class function stored in the heap).

The 3/4 (3-bit addr mode) only gives us 16 opcodes but we don't have to think about how to make up for the loss of the custom addr modes.

I'm leaning towards the 2/5 here, as the whole point of this format is to be compact/frugal.

If we had a 64-bit 'middle format' and used this as a 'small format' then the structure of code would tend to alternate between these formats in a similar manner to how, in Haskell or Lisp, you alternate between 'let __ in' and then actual expressions which have many terms. The 'let's correspond to the 64-bit format being used to select local variables and load them onto the stack, and then in most cases the actual expression could be specified in one 64-bit packed field.

i kinda like that...

in fact it's even better than that, because we only need 1 format bit for the whole 64-bit packed instruction. So we get an extra bit for 7 of the 8 instructions within the pack.

---

might enable more runtime parallelism if we got rid of one of the signature classes (probably BII) and instead had a 2-operand format. This would allow decoder to decode the address of JMPs ahead of time. Otoh that's not really useful for parallelism because it's just an integer, you still have to get the address associated with the actual label, and you're not going to do that until you see it's a JMP. So no.

---

" Condition codes vs. condition registers vs. compare&test

...

Granularity of atomicity ... Transactional memory? "

---

todo mb get rid of ALU regs and have 'compare' and 'test' instead?

---

if we have ALU regs, but this doesnt show up in the signature class, then doesnt that complicate dataflow analysis a lot?

---

i skimmed these:

http://people.cs.clemson.edu/~mark/conditioncodes.html

https://community.arm.com/groups/processors/blog/2010/07/16/condition-codes-1-condition-flags-and-codes

http://people.ee.duke.edu/~sorin/ece152/lectures/2.2-isa.pdf

http://opencores.org/or2k/Branching_and_condition_codes

https://gcc.gnu.org/ml/gcc/2001-01/msg01806.html

---

ok i guess i'm conviced that we don't need to, and shouldn't, mix ALU regs with branching.

but how could we do arithmetic without at least a CARRY? would you actually do compares before each potentially-carry-needing instruction?

yes (or afterwards): https://en.wikipedia.org/wiki/Status_register#CPU_architectures_without_arithmetic_flags

---

OK that makes me want to get rid of CZNO, for the sake of explicit dependencies. But what about ERR? if we keep ERR there seems to be no reason for ADD to not use it as a carry output.

And if we have ADD shouldn't we have at least one 'bit register'?

But then wouldn't we want 5 more bits for instruction signature class, for whether or not it might read to/write to ERR and/or 'bit register' and whether or not it might have an error (as opposed to just an 'exception')?

---

OK, the point of the instruction signature class wasn't really to represent the instruction signature for data flow, that was just a useful side effect; the point was the say whether, during dispatch, the instruction needed things to be looked up and/or stored before/after it ran. So NO we don't need to add bits there just to say whether or not the instruction might affect ERR and/or a carry register and whether or not it might crash.

And no, we don't need a 'bit register'. An ordinary reg can hold many things, one of which is a bit. We can have a bit data type in function signatures, though.

And yes, putting the carry into the ERR register is fine, although i dunno if i will b/c the MIPS way sounds cleaner. Otoh this sort of thing (having another return value) is the whole point of the ERR register, so lets do it.

But i still want to get rid of CZNO.

---

i guess i'll make SHORT format have only 2 addr mode bits throughout. In most ordinary expressions, stuff like laziness and asynchrony and boxing/unboxing follows the language convention, and/or can be handled by a wrapper outside of the expression itself. Also, the point of SHORT is to increase code density, and probably most of the space in code is either taken up by high level operations, or low level ones (but which is it?), in which case one convention for boxing/unboxing will be fine.

Also, most other bytecode languages don't have any of this addressing mode/modifier stuff at all, and they get along just fine.

Code that needs to do something unusual can always just use MEDIUM format.

---