proj-oot-ootAssemblyNotes6

---

i like the idea of custom instructions.

With custom instructions, we could have only a very few core instructions that the porter must implement; the rest could then themselves be written in Oot Assembly in terms of these core instructions (or in terms of each other; they could be defined in a fixed sequence). If the porter is up to it, they could directly implement some of the other instructions too; it would be configurable which of the pure Oot implementations would be used. This also suggests to the porter which things to provide platform-level implementations of first when they want to optimize later; first, provide a platform-optimized implementation of some or all of the non-core instructions.

That makes us into sort of the language of a macro assembler. Except that it's not that we necessarily have a compiler than must sub in the macros, but we can just directly interpret them; so they don't add much to implementation complexity (and hence porting). Although mb we want (compile-time) macros also :)

It's also reminiscent of Forth; the custom opcodes are like Forth 'words' (except that we are register-based instead of stack-based; and except that the implementation is simplified by having room for only a set number of numerical opcodes, rather than actually using words like in Forth).

If we have custom instructions, we appear to have at least 2 and maybe 3 kinds of function calling:

---

nary array internal representation:

---

i'm sold on custom instructions. but this makes me want to go back to one of my other fav ideas, custom addr modes.

---

an example awesome new addr mode:

COW (copy-on-write) behind-the-scenes processing could be done in the addressing mode; eg when you READ an immutable datastructure with the COW addressing mode, behind the scenes the processor uses a structure that shares subtrees via pointers with its predecessors; and when you WRITE an immutable datastructure with the COW addressing mode, behind the scenes the processor only copies the new stuff, and creates a structure that shares subtrees via pointers with its predecessors

another one:

GC-aware addressing: hits whatever read barriers and write barriers are needed

another one:

subscription-aware addressing: upon a change to a watched address, notifies subscribers via an event

(of course we'll probably just include those things in the core machine, but the fact that this sort of thing exists makes one suspect that custom addr modes could be useful)

(also, that isn't an argument for CUSTOM addr modes per se, it's just an argument for more addr modes than the basic 4)

a reason to prefer CUSTOM addr modes: like custom instructions, it allows easier porting while preserving readability of Oot Assembly. Imagine if, instead of putting COW into the core machine, we had a separate custom COW addressing mode. This would mean that Oot Assembly code could still be written without COW stuff cluttering up the code (readability), but the implementer/porter (the person making a new Oot Assembly interpreter or compiler on a new platform) only has to implement the 'core' addressing modes; the COW stuff would already be implemented in pure Oot Assembly. Not sure if we'll do that (probably not), but the possibility makes the flexibilty provided by custom addr modes more attractive.

also, in some cases adding more bits to the addressing modes is more efficient than adding more bits to operand-specific flags, because generally flags aren't orthogonal to addressing mode; for example, i proposed a boxed/unboxed flag, but what does it mean to have a 'boxed' immediate constant in an input operand? boxing implies multiple storage locations (ie bytes or words), with the metadata in one location and the data in another (unless you go for something like Hoon's idea of explicitly encoding data structures into single large integers, which i don't).

---

ok so i'm sold on custom addressing modes. let's eliminate the operand flags and use them to have more addressing modes, most of which are custom. The idea being that flags that we really needed (eg boxing/unboxing) can just be corss-producted with the addressing modes, and this is slightly more efficient too because some combinations don't make sense (eg 'boxed' flag on an input operand in immediate addressing mode). This seems a little silly because it's the opposite of what i was doing with instruction polymorphism, where i have explicit types even though it would be more efficient to have the opcodes be the direct product of instruction and type. But there's only 16 of them so maybe it isn't so bad.

since we're going all custom, let's make the remaining two opcode flags custom as well.

also, although i've already pretty much picked out the types, let's make them be 'custom' as well, just in the sense that core Oot Assembly itself doesn't define or use most of them.

---

so, new proposal:

custom addr modes, custom opcodes, custom types, custom opcode flags

to allow efficient compilation, these things are all static (compile-time) and lexically scoped; eg you can't change the custom stuff at runtime

?: kinda like forth with (infinite) registers, only one single stack (actually mb not; see below), and more metaprogramming (b/c you can metaprogram the addressing mode)

why only one stack: because it's nice to have one side of memory have the stack, and the other side be the heap. Eg if the stack starts at 0 and grows upwards and the heap starts at the high end of memory and grows downwards. You can use the stack as both a call stack and a data stack by pushing data onto it during each call. One issue with this is that when you return data, you have to pop the return address off of the stack; meaning that if you already have all the return data computed and pushed onto the stack, you have to move it all around before returning, which is a waste, especially if some data is just being 'returned' over multiple call levels (stack frames), whereas other data is being manipulated during the returns (so it's not a tailcall).

otoh for readability it might be good to have two virtual stacks, and the constraint that the data stack starts empty (except for params) upon each fn call, and must be empty (except for return args) upon each function return. This allows the implementation to deal with the complexity of two stacks if desired.

todo: http://www.forth.org/svfig/Len/softstak.htm

---

so this proposal is:

4 x (4 type bits + 4 addr mode bits + 8-bit operand-or-opcode bits). In the first field (the opcode field), the '4 type bits' aren't type bits: the first two are 'format bits', the last two are 'custom flag bits'.

The format bits (which are the first two bits of the instruction) are as follows:

Note that most opcodes are only compatible with one value of the format bits.

core addressing modes: immediate constant table direct indirect

there are 16 addressing modes in total, so this leaves 12 custom addressing modes. Some suggestions:

memory map and memory-mapped special locations: 0: 'zero' register (send output to this register when storing the output is unnecessary; always reads zero) 1: PC (program counter) (maybe combine this with 0, since writing to PC is not allowed (for clarity/making it easy to see where control flow alterering instruction are)) 2: status register 3: stack0 push/pop 4: stack1 push/pop 5: ? 6: ? 7: ? 8-0x000B (15): direct read/write access to first 4 stack0 locations (without pushing or popping) C-0x000F (15): direct read/write access to first 4 stack1 locations (without pushing or popping) 0x0010-0x00FF: memory (that can be addressed by the opcode field) 0x0100-0xFEFF: main memory (that cannot be addressed by the opcode field) 0xFF00-0xFFFF: see following parenthetical phrase (x - 0xFF00 is offset to the beginning of the phrase)

note: memory locations might be able to hold values larger than 16 bits.

note: this is not to say that the virtual machine can only access 64k of memory. A pointer can be placed in any of these memory locations, and that pointer can point to somewhere else. Likewise, the PC can point to memory locations outside of this map. These alien memory locations are always considered to be 'higher' than this memory (eg if compared, the alien memory address is 'greater than' 0xFFFF) but don't have to actually be that way in the target machine. The alien memory addresses can be incremented as usual. You might say that the 0xFFFF of provided 'memory' is really 64k 'registers'. These 64k locations are the only ones that can be accessed in direct addressing mode (?maybe more if the 24 bit operands in forms 01 and 10 are used? but mb the Oot Assembly machine model only guarantees the presence of 64k); however other locations can be accessed via indirect addressing mode. In fact, only the first 255 of these are accessible in direct addressing mode by ordinary, 3-operand-format instructions.

note: there are two stacks, intended to be used as data stack (stack0) and call stack (stack1) but the result is undefined if a program attempts to access beyond the end of the stack; this means that these can be implemented as a single stack

core types: pointer int16 other/custom static type (has already been statically typechecked) dynamic (boxed/wrapped, tagged type; runtime typechecking requested)

there are 16 types in total, so this leaves 10. Suggestion:

boolean/single bit unsigned int16 byte float64 arbitrarily large/precise rational unicode string array cons cell/linked list/binary tree graph/hash/dict (opaque, as opposed to AST) function/callable, thunk/promise/future/lazy value AST/expression/block/logic greenthread/process/channel(/actor?)

(note that some implementations might not support all of these types; eg an embedded implementation might not support floating point, arbitrarily large/precise rational, unicode)

(note that when i say int is 16 bits, the implementation is free to make it MORE than 16 bits; int here is just AT LEAST 16 bits)

suggested custom flags:

in the boxing:

some special(ish) instructions:

lexically-scoped-compile-time-only (although they can easily be interpreted)-metaprogammy-instructions:

---

what's the max number of identifiers permitted by LLVM?

http://clang.llvm.org/doxygen/Serialization_2Module_8h_source.html

which is a C++ fle, contains:

" / \brief The number of identifiers in this AST file. unsigned LocalNumIdentifiers?; "

https://en.wikipedia.org/wiki/C_data_types#Basic_types says 'unsigned' means 'unsigned int' which means 'at least 16 bits'. http://en.cppreference.com/w/cpp/language/types says that 'int' (and therefore 'unsigned', i think) has at least 16 bits, but usually 32 bits.

so, having 16-bit register identifiers in Oot doesn't seem too constraining/unusual, as Clang apparently assumes that the number of identifiers can fit in an unsigned int, which is only guaranteed to be 16 bits.

---

pretty cool, but i'm still dissatisfied. We want Oot Assembly to be somewhat similar to Oot Core, don't we? But the limitation of 8-bit operands means that identifier references in some Oot Core code will translate to single operands, while other identifier references in the same Oot Core code might translate to multi-instruction steps.

So lets look at the previous as a good exercise to focus the mind, and insist that we allow all 3 operands to be 16 bits.

we either end up back where we started, 4 16-bit fields, with at least 6 addressing mode bits in the first field, making it non-uniform, with no room for polymorphism, and with little room for custom addressing modes unless we don't give the opcode an addressing mode; or we move to a bigger representation.

let's look first at embracing non-uniformity of the first field. As we saw above, putting an addressing mode in the first field doesn't really allow beautiful virtualization because you have no room to put the virtualized addressing mode; so you still need explicit virtualization instructions, at which point, why bother with an addressable opcode (which will surely lead to inefficient implementations anyways, and will surely waste a lot of space as it will almost never be used).

if we really want custom addressing modes, we need at least 3 bits for addressing mode, so 3*3 = 9, and we have at most 7 bits for the opcode. If we still want 2 format bits, that's 5 bits for the opcode; alternately, we could carve up a 7-bit opcode space and say that each opcode only has one format, but that format is easy to read off of the first 2 bits. I like that the best.

We can still have some custom opcodes if we want, but it's not too much worse to just CALL these guys, esp. if the textual representation allows you to write procedure calls as looking just like ordinary instructions. Well.. actually it is a lot of trouble, because you have to burn one operand on the address being called. So have some custom opcodes.

We don't have room for custom flags, but we can have prefixes that execute user-defined code instead.

Custom addressing modes still seem cool.

---

old:

Note that this means that there are exactly 32 opcode slots available for each format. Reserve half of these for custom ops, so 16 core and 16 custom. If this isn't enough, consider eliminating class 01 (one operand), as it probably isn't used much (just JMP, right?), and increasing the sizes of classes 10 (two operands) or 11 (three operands) (JMP can become JMP X+Y). Probably safest to increase the 3 operands, since if you need a 2-operand instruction you could always phrase it as a 3-operand and then ignore one operand. In Lua VM [1] about 15 of 35 2- or 3-operand instructions were 2-operand (there were also about 2 1-operand instructions, JMP and CLOSE); so the 3s are a little more than half. So yeah, just eliminate one-operands and replace with another three-operand class.

---

So we have:

so this proposal is:

7 opcode bits, where the format can be read off from the first 2 3 x 3 addressing mode bits 3 x 16-bit operands

The format bits (which are the first two bits of the opcode) are as follows:

Half of 00 and 01, and all of 11, are reserved for custom ops. So there are 16 builtin prefix modifiers, and 16 custom prefix modifiers; 16 builtin two-operand instructions, and 16 custom two-operand instructions; and 32 built-in three operand instructions, and 32 custom three operand instructions.

core addressing modes: immediate constant table direct indirect

there are 8 addressing modes in total, so this leaves 4 custom addressing modes. Some suggestions:

memory map and memory-mapped special locations: 0: 'zero' register (send output to this register when storing the output is unnecessary; always reads zero) 1: PC (program counter) (maybe combine this with 0, since writing to PC is not allowed (for clarity/making it easy to see where control flow alterering instruction are)) 2: status register 3: stack0 push/pop 4: stack1 push/pop 5: ? 6: ? 7: ? 8-0x000B (15): direct read/write access to first 4 stack0 locations (without pushing or popping) C-0x000F (15): direct read/write access to first 4 stack1 locations (without pushing or popping) 0x0010-0x00FF: memory (that can be addressed by the opcode field) 0x0100-0xFEFF: main memory (that cannot be addressed by the opcode field) 0xFF00-0xFFFF: see following parenthetical phrase (x - 0xFF00 is offset to the beginning of the phrase)

note: memory locations might be able to hold values larger than 16 bits.

note: this is not to say that the virtual machine can only access 64k of memory. A pointer can be placed in any of these memory locations, and that pointer can point to somewhere else. Likewise, the PC can point to memory locations outside of this map. These alien memory locations are always considered to be 'higher' than this memory (eg if compared, the alien memory address is 'greater than' 0xFFFF) but don't have to actually be that way in the target machine. The alien memory addresses can be incremented as usual. You might say that the 0xFFFF of provided 'memory' is really 64k 'registers'. These 64k locations are the only ones that can be accessed in direct addressing mode (?maybe more if the 24 bit operands in forms 01 and 10 are used? but mb the Oot Assembly machine model only guarantees the presence of 64k); however other locations can be accessed via indirect addressing mode.

note: there are two stacks, intended to be used as data stack (stack0) and call stack (stack1) but the result is undefined if a program attempts to access beyond the end of the stack; this means that these can be implemented as a single stack

there are no explicit types nor polymorphism; the opcode determines the type signature of the instruction. Implicitly, the following are suggested:

core types: pointer int16 other/custom static type (has already been statically typechecked) dynamic (boxed/wrapped, tagged type; runtime typechecking requested)

Suggestion for another 12 implicit types:

boolean/single bit unsigned int16 byte float64 arbitrarily large/precise rational unicode string array cons cell/linked list/binary tree graph/hash/dict (opaque, as opposed to AST) function/callable, thunk/promise/future/lazy value AST/expression/block/logic greenthread/process/channel(/actor?)

(note that some implementations might not support all of these types; eg an embedded implementation might not support floating point, arbitrarily large/precise rational, unicode)

there are no flags. instruction-level flags can be emulated by custom prefix ops. operand-level flags can be emulated by custom opcodes.

suggestion for some custom prefix ops:

in the boxing:

some special(ish) instructions:

lexically-scoped-compile-time-only (although they can easily be interpreted)-metaprogammy-instructions:

---

the idea is:

but most of all, we want Oot Assembly as a thought exercise to focus our mind on the core of Oot (while also perhaps keeping our feet almost grounded by getting our thinking closer to implementation realities just a little bit).

---

so i keep talking about 'non-standard models of the integers' and hierarchical memory, but i don't think we really need that. Just use pointers. If you do an operation on 'the contents of register 7' but the operation calls for some structure, then it is implicit that register 7 actually holds a pointer to that struct.

This might matter (or might not, i haven't thought carefully) if we allowed unsafe typing, but we don't; either everything is provably typed at compile-time, or it is dynamically typed with type tags (that are checked at runtime).

Note that if the contents of register 7 is typed as a struct (of some specific type), then the semantics of 'copy register 7 into register 8' is to copy the STRUCT (with COW), not the pointer.

So we actually have 3 kinds of copying:

What happens if you try to apply ALIAS to an atomic value, e.g. the integer 5? Well, there are two obvious possibilities: either there's an error, or it replaces 5 with a pointer to a newly allocated integer in the heap. I prefer the latter, because that way the programmer doesn't have to distinguish between things that fit in registers and things that are too big to fit in registers and so are pointers. Note that this means that, to the implementation, the type int16 is really two types: int16, and &int16. The programmer just sees it as one, though; the VM transparently dereferences the &int16 when it is read or written to.

---

so i call these guys 'registers' a lot, but these memory locations or registers or whatever actually have the semantics of 'local variables' rather than 'registers'. What does that mean?

---

ok a problem with the above proposal, in which there is no addressing mode on the opcode, is that 'custom instructions' with two operands (i call this 'call3') can be defined at compile time, but you can't dynamically have first-class call3 functions sitting in a variable and then called.

you could add a prefix modifier instruction that does this. But that's ugly (because, a code analyzer would have to treat this as really just one instruction that does something completely different from what it looks like without the prefix modifier; which means that it must construct a different internal representation of the code than Oot Assembly; which makes me think, well that representation that it is constructing is cleaner/more regular, so couldn't we just make Oot Assembly be that representation?).

so i think we should go back to having an addressing mode in the opcode.

so, starting with the previous proposal, we need some more bits for the opcode field. If the operands are fixed at 16 bits and the addressing mode bits are fixed at >= 3, then we need 3 bits and the only place left to get them from is the 7-bit opcode field. But 7-3 = 4 bit opcodes is very little, especially if part of the point is to be able to make use of custom opcodes. I think i'd rather settle for less than 16 bits.

Before i couldn't stand 8 bit operands, and it seems (given my desire for custom addressing modes and first-class call3 functions via an addressing mode on the opcode) i can't afford 16 bit operands in a 64-bit instruction, so the choices seem to be to either make the instruction bigger than 64 bits, or use 12 bits. If the instruction is bigger it should be a power of 2, so 128 bits. But then we have a ton of extra bits; we really only need 3 more. So let's do 12-bit operands.

This hurts but i did a quick survey on programming language limits on maximum numbers of identifiers at proj-plbook-plChInternalRepresentationsImpl, and although the most common modern choice is 64k there were some older versionsof successful languages whose standard specified less (eg C and C++ which apparently used to be as few as 127 and 1024) and a few modern ones eg Lua with a limit of 200 local vars. A 4k (2^12) limit is much more than those, and unlike 200 vars, which could conceivably be bumped into by humans and certainly will be by automated programming, 4k will probably almost never be bumped into by humans (unless they have really weird style) and automated programming at least has a fighting chance of avoiding it. I guess tradeoffs must be made one way or another, and it feels like 4k is enough to cause discomfort but not to kill, which means it might be a good idea; and it goes well with the sort of 'don't take more than you need'/D.I.Y./this-is-for-the-Oot-implementors-not-for-end-users/run-on-a-zillion-tiny-parallel-processors goal for Oot Assembly. I suspect that it won't be too hard for Oot implementations to work around this limit by breaking large procedures into parts and/or storing other variables in memory or on the stack; the main effect would probably be that optimizations won't work very well in those cases.

---

i call the idea of having custom (user-defined) opcodes with 3 operands 'call3'.

before i was saying that call3 should use the stack, eg executing a call3 should push the inputs onto the stack, and the call3 function should pop these and push its result onto the stack at the end, but this might make it hard for eg LLVM to trace the dataflow (i'm not sure if it actually would make it hard, i'm just guessing). So, we'd prefer to use llvm's 'CALL' function calls for this stuff (by 'this stuff' i mean where we have functions that do not need tailcalls, exceptions, or any other metaprogrammy messing with the stack; to put it another way, it may be better to make it easy to use LLVM's 'call' functionality here, since, unlike HLL calls where we want exceptions and possibly other metaprogrammy alteration of the call stack, call3 calls use a straightforward vanilla call stack.) (is that more or less overhead on poor llvm? after all, we also don't want llvm wasting time saving variables for calling conventions, etc; in fact we probably just want it to always inline CALL3s; but in any case, we have more flexibility if we at least have the option to implement CALL3s without using the (Oot Assembly-program-accessible) stack); and even if Oot Assembly was ported to a non-LLVM platform, that other platform would probably have ordinary function calls, and again we'd prefer to just use the platform's function calls for call3, again so that the platform can better optimize them. if opcode and constant table custom (user -defined) ops are compile time anyway, mb provide special 'syntax' (encoding) rather than just using stack to pass operand s. this makes it easier to use llvm/platform call.

Now ultimately you still have to have a stack somewhere, because we want one call3 function to be able to call another call3 function within itself (and so on, to arbitrary dept). We don't want to just pass the calls in registers, because then if one call3 calls another call3 it has to be aware that it is a call3 and save the registers using some sort of calling convention, which breaks the idea that call3s behave just like ordinary opcodes. But the stack can be part of the Oot Assembly implementation, or the underlying platform, and not be accessible to the program running in Oot Assembly (so there are 3 stacks; the hidden stack within the Oot Assembly implementation, used for call3 implementation, and then the data stack and the call stack which are accessible to the Oot Assembly program (which itself is implementing the HLL)).

Could reserve 2 locations in low memory for call3 inputs; or reserve 2 opcodes to load them; or the call3 functionality could place the inputs in existing locations but save the values that are currently there (transparently implementing a caller-saves calling convention). Note that the first and last options aren't that different because even if 2 locations are reserved just for call3, the call3 functionality has to transparently save what is there when one call3 calls another call3.

note that all this call3 stuff means that the interpreter must maintain a third stack for the call3s.

note that this distinction between low-level call3 calling and HLL stack calling is reminiscent of the LLVM call/invoke distinction.

---

so basically there's 3 constant tables; operand, and opcode immediate, and opcode constant. is there any distinction between the latter two?

that is to say, each oot module gets its own constant tables (instructions and data) (but maybe not its own 'custom' ops, that might be per-program; or mb even they aren't 'custom' at all, they are part of Oot Assembly, just a part which doesn't have to be implemented directly; the reason is that the implementation is free to hardwire them).

should each module also assume that its 4k of memory is available?

or, should each 4k of memory be available separately not just for each module, but for each call? i guess so; these are really local identifiers, not machine registers, so don't worry about cost of register saving (maybe they are even on the stack, in the underlying implementation). otoh mb sometimes you'd want intra-module calls to reuse regs; this could be useful for inline-like decomposition of a large assembly routine into smaller subroutines.

So, we have two forms of call; a form in which the caller and callee each have their own 4k regs, and a form in which they share.

Mb only 'stack' call, not 'call3', gives new regs.

also mb absolute immediate and constant addresses eg in JMP are all implicitly module relative (pointers, however, can be shared between modules).

also CALL3 is not sandboxable; there is ambient authority, eg the callee has as much privilages as the caller

---

if the idea is that program analyzers can consider each single instruction in isolation (which is why i eschewed a 'call3' prefix op), and if we allow custom prefix ops, then we need to put some restrictions on prefix ops to prevent them from changing the semantics of the following instruction in ways that would break static analysis.

for example, a custom prefix op that creates a 'call3' is not okay because it could break pretty much any invariant. For example, a prefix op that causes the following instruction to be executed atomically is ok, because it only reduces the possibilities for what can happen (increases the invariants that we can assume).

i'm not sure exactly what requirement we want here. In any case, we're free to introduce non-custom prefix ops that break the requirement (although, as with a call3 prefix op, i'd prefer not to).

mb we want something like:

---

i notice that both the 'immediate constant' and the 'constant table' addressing modes are useless for the output operand (except in the special case when the output operand is immediate constant 0, in which case we might take this to mean 'discard the output value'.

So, what should immediate or constant table mean for the output operand? should we use immediate 0 output for the zero register instead of PC? If we did that, then we wouldn't have to always say that writing to the PC was always 'discard', and then that would give us a clean soution for the call3 JMP-in-caller problem (that is, how you could program an instruction like JMP as a CALL3; because the point of JMP is to make the CALLER jump, not the subroutine). The solution is: when in a CALL3 subroutine, the PC is 'frozen' at the callsite to the current CALL3 subroutine, and writing to the PC does not immediately cause YOU to jump, but it does cause the CALLER to jump. Actually, this is still a little ugly, because now the PC is behaving differently (frozen) within call3s and within ordinary code. So mb just provide a special instruction, JMP_CALLER, for this. What it really does is just modify the RETURN address of the CALL3 (so now we have 3 special regs for call3; the two inputs, and the return address; and maybe a fourth, to put the returned value into).

another idea: maybe 'constant table' addressing mode on the output operand should mean output to particular locations within local variable structs, ie the operand represents a path, eg obj.subobj.field3 (this would otherwise require multiple steps to .__get parts of local vars, then .__get a part of those, etc); ie indirection chains. A problem with this proposal is that its usefulness is not restricted to output operands; we'd want that for reading too.

another idea: these addressing modes mean that the effective address of the operand (under some addressing mode) is both an input, and also an output. But i dont see how that would work; this would be something determined by the instruction, not the operand. Eg how could 'ADD' accept its output as an input?

another idea: could act as a flag causing a conditional branch or skip. But, one design goal is to keep those instructions that jump or skip separate, for the purpose of easy program analysis.

another idea: some other random flag. No, because if we used a whole bit for a flag, then we'd have 8 addr modes left for the output, but still have 16 addr modes for each input. But most likely the missing 8 modes are things that would be useful in both input and output, so we don't want to lose them. Instead of a whole bit, we just want to reallocate 2 modes in the output, the 'immediate constant' and 'constant table' modes.

so we have to think of something that is only useful for the output operand.

another idea: something that depends on both the output operand and one or more of the input operands.

how about: output is an offset of the input. Various ways to do this. Let i be the input operand and o the output operand. Let reg(x) be the contents of register x.

reg(i+o) # this one is pretty useless because we could have just added i and o at compile time and used direct addr mode reg(i) + reg(o)

actually, that ignores the addressing mode of i; let's use the given mode of the input operand. Define a to be the effective address specified by the input operand i.

a + o a + reg(o) a + *reg(o)

but wait a minute, what if the input operand addressing mode was itself immediate constant or constant table? Then we don't have an effective address. Which is actually fine as long as we don't use the expression *a; in this case i must be an offset, and that offset must be added to a pointer to yield an effective address. But wait, how could o be a pointer? It is a constant. Could it possibly be a constant pointer? How would you know a pointer at compile time? Maybe if it was a pointer to a global; back to the idea of having 'path constants' i guess. But more useful here, since we're not just taking that pointer, we're doing something with it. But... if we have path constants at all, wouldn't we want the object to just use it as the effective address? Which would mean letting 'constant table' only be valid for 'static global pointer constants' (which i hadn't really planned to have, but since the opportunity presents itself..). Now we still get to choose the meaning of 'immediate constant' in the output.

this stuff is reminiscent of 'indirect indexed' vs 'indexed indirect' in the 6502. i wonder which of those was more useful? http://wiki.nesdev.com/w/index.php/CPU_addressing_modes says "The (d),y mode ((Indirect Indexed)) is used far more often than (d,x). The latter implies a table of addresses on zero page." http://www.emulator101.com.s3-website-us-east-1.amazonaws.com/6502-addressing-modes.html says "While indexed indirect addressing will only generate a zero-page address, this mode's target address is not wrapped - it can be anywhere in the 16-bit address space. " (i don't understand this; i thought the end address was not necessarily zero-page here, just the starting one.. but that's the same for both of them, not a distinguishing factor; am i misunderstanding or did they make a mistake?). https://skilldrick.github.io/easy6502/ says "Indexed indirect...You won’t see this much." and "Indirect indexed is like indexed indirect but less insane.".

well let's see the simplest thing to do would be: let $i be the argument that results from operand i. $i + o. For $i + o to be an address, since o is an integer, $i must be an address. So eg if the addressing mode of i is also 'immediate constant', then $i is an integer, not an address, and this is invalid. We could define a special case for that. But (a) that seems to be coupling the two operands too much, and (b) it is too much to remember (too irregular). Otoh doing this sort of thing at all couples the operands a bit, and is a bit irregular. So let's not.

---

i wonder if i am being confused about using llvm CALLS for call3. In an Oot Assembly interpreter, the interpretation is done within a loop where you fetch each instruction and then execute it and then update the PC; if you executed a CALL there you'd leave the interpreter loop, right?

no, i was not confused; the llvm CALL instead of using the stack may not help in an Oot Assembly interpreter itself implemented in LLVM, but it will help when compiling Oot Assembly to LLVM.

it's possible that this architecture may also make it easier to use the PyPy? RPython JIT generator, i doubt it though.

---

so what does Oot Assembly offer that Forth does not?

---

so what does Oot Assembly offer that ordinary assembly with a macro assembler does not?

portability.

so what does Oot Assembly offer that a 'macro assembler' language does not?

maybe not too much, but are there such things? it seems to me that macro assembler languages are tightly coupled to their intended application of writing assembly language for various hardware ISAs, rather than to be a very simple portable target language to bootstrap an HLL.

another answer might be 'simplicity'.

so what does Oot Assembly offer that Parrot VM does not?

A quick glance suggests that Oot Bytecode is much simpler/more barebones/easier to implement/easier to port than Parrot; also it seems like Parrot might be implemented in C rather than self-hosting?

http://docs.parrot.org/parrot/latest/html/ https://github.com/parrot/parrot/tree/4fe3223ee11c4ab47c36f75a6be4a5d550c67133/docs

(is there documentation on porting Parrot? http://lists.parrot.org/pipermail/parrot-dev/2012-January/006527.html is the best i could find)

(note: at one point Parrot was even planning to move to a small set of core ops, did they ever do it? how does this compare to Lorito? http://trac.parrot.org/parrot/wiki/Lorito )

http://moarvm.org/ is perhaps closer. Interestingly, both Parrot and MoarVM? have custom ops https://github.com/MoarVM/MoarVM/blob/master/docs/extops.markdown . But MoarVM? still seems more complicated than Oot Assembly, harder to port, and not self-hosting (MoarVM? is in C) (note: apparently the plan is for it to be self-hosting someday? http://www.perlmonks.org/?node_id=1048515 ) http://moarvm.org/features.html

what about LLVM? Same problem; much more complicated than what i have in mind. LLVM is self-hosting though; LLVM is written in C and there are compilers from C to LLVM written in C, so one could run these compilers on the LLVM source to get an LLVM representation of a compiler from C to LLVM.

also, we have fixed-size instructions, which Parrot does not

So in summary:

---

http://www.modernperlbooks.com/mt/2011/07/less-magic-less-c-a-faster-parrot.html notes that if you have some 'custom instructions' implemented in OA, and some platform-optimized instructions hardcoded in the OA implementation, this is a problem is the hardcoded instructions need to call other custom instructions implemented in OA

(my feeling is that we should mostly try to just say, ok, don't do that then; the platform-optimized hardcoded instructions written in some other language should only be allowed to depend on other platform-optimized hardcoded instructions. HLL library functions can be 'jetted' and implemented by calling some other language, and our FFI will provide ways to re-entrantly call back into the running Oot/OA program; and in fact OA itself may provide a way to call out and re-entrantly call back in; but this facility shouldn't be used inside of custom instructions; although, perhaps i'm misunderstanding; perhaps the complexity of providing any interop at all is the same as allowing interop within the custom instructions, in which case we just have to bite the bullet and do it)

---

with the 4k limit, i do worry a bit about SSA; if one tried to put the Oot Assembly into SSA form, one would have to make sure there wasn't too many assignments in any one procedure.

huh, i guess in a way, SSA is like the opposite of register allocation.

---

potential inspirations for the structure of headers of modules:

https://github.com/parrot/parrot-docs5/blob/master/5.0.0/parrotbyte.pod

https://github.com/MoarVM/MoarVM/blob/master/docs/bytecode.markdown

LLVM's

https://github.com/parrot/parrot/blob/m0/docs/pdds/draft/pdd32_m0.pod

---

Parrot's M0 proposal says

"Call Frames

This is the central data structure at the M0 level. The call frame graph contains most of the state that is needed to execute M0 bytecode. ... There will be no stacks in M0, not even for argument passing. All internal control flow apart from basic gotos with code segments must be in continuation-passing style. This will give M0-based code several benefits. It will ensure immunity from stack-smashing attacks and related programming errors. It will also eliminate the nested (inferior) runloop problem and opens up interesting possibilities for an Erlang-like concurrency implementation.

CPS will also allow Parrot to use many pre-existing JIT algorithms, such as those described in the paper "Trace Based Compilation in Interpreter-less Execution Environments" [TBC]. " -- https://github.com/parrot/parrot/blob/m0/docs/pdds/draft/pdd32_m0.pod

so i understand what continuation-passing style (CPS) IS, but how does it let you avoid a stack? Say Procedure A calls Procedure B. One way or another, Procedure A has to store the state of its local variables while Procedure B is executing, and when Procedure B returns, Procedure A has to be able to access that saved state. With a stack, Procedure A's locals are stored in a call frame and pushed onto the stack; when Procedure B returns, Procedure A's call frame is right there again at the top of the stack (perhaps underneath return value(s) from Procedure B). When using continuation passing in place of a stack, instead of just having a stack you have to somehow pass this state to Procedure B as an explicit (at some level) argument; you still have to store the state one way or another, and make sure that Procedure A can access it again after Procedure B returns control. It seems like what M0 is proposing instead is for these call frames to be allocated on the heap and have pointers to each other. Which is more flexible (it allows for closures with lifetimes longer than the dynamic scope that created them). If i understand that correctly, a downside is that, since the heap is being used instead of the stack for local variables, performance is less, because it's faster to use the stack than the heap; the stack is especially faster for allocation and deallocation, so when calling and returning from functions (which is very common) i would expect significant overhead from this design choice.

that being said, i plan to do something like that myself, so i'm not complaining too much.

---

in fact, the M0 proposal seems similar in many ways to what i've been thinking of. They are trying to minimize the core instruction set to allow easy porting (although i dont think they quite follow my idea of having many core instructions that almost everyone will want to use platform primitives for nevertheless being custom instructions, not core instructions, but letting the implementation override them). They have a simple instruction encoding (8-bit opcode, all instructions take 3 8-bit arguments, no addressing mode). Their call frames can hold a fixed limited number of locals (256; but they have a 'Register Spilling' mechanism in place for when more is needed); this is similar to my 4k 'registers' which are fresh for each scope. M0 has 4 primitive types; int, 'numeric', PMC (like my 'custom'), and String. Each M0 memory location is 8 bytes (just like mine) (although elsewhere in the document they say they are 32-bit, so either they got confused or i'm misunderstanding; oh i see now, the pointers have whatever native size they have, ints only require 32-bits, but the implementation could decide to just give all memory locations the same size, or it could give certain locations different sizes; there is a hardcoded division of memory locations into int (I), number (N), string (S), and object (P), what they call INSP registers) and for the things that can't fit i that (PMC and String) it just holds a pointer.

they use to call their call frames 'contexts', just like me at some places in this document, but found that 'call frame' was less confusing to ppl.

at one point they say "Op composition is considered magic-like and will not be part of M0.", what is op composition? Is this literally function composition or do they mean what i call custom instructions?

https://github.com/parrot/parrot/blob/m0/docs/pdds/draft/pdd32_m0.pod

---

MoarVM?'s feature list looks pretty good:

" Some key features provided by MoarVM? include:

---

let's simplify things further:

also i forgot to say above that custom instructions must not have cyclic dependencies, and must be defined in order such that when one of them is defined, any custom instructions used within its implementation have already been defined. (the order of the assigned opcodes may be different from dependency order).

there should probably be an entry point table in each module for all the custom operands, and an entry point table for exported functions, and a version, and offsets to the various tables at the beginning (table of contents)

also, let's get rid of the idea of 'regular-expression like' stack patterns. We want to avoid programs screwing up the stack; first, we want to avoid 'implementation-level' runtime crashes in Oot, and second, it's a security risk. So if there were a region in the stack with a variable number of entries, we'd probably want that region to be prefixed by an integer 'length' entry on the stack that said the size of said region, and then check that entry. So the variable-length thingee is like a 1-dim array (or bytestring). But much of the time when you do that, you could have just put the variable-length thingee on the heap and put a pointer to it on the stack. So just do that, it makes the implementation simpler, and it removes the annoying overhead of double-checking those 'length' entries at runtime.

so am i saying that we are going all the way and preventing stack1 from being used for non-standard control flow patterns, that entries put on it by subroutines must be removed when those subroutines terminate, except for a fixed number of declared return values? Maybe so.. but the thing is, if the OA user code can arbitrarily manipulate stack1, the user-level 'call stack', at runtime, then i don't think it's possible in general to statically know which program location will be at stack1 TOS (top of the stack) at a given lexical location at runtime; so therefore you also can't statically prove that stack0 will be in the correct state for that location.

For example, if function f() calls function g() then the arguments will go onto stack0 and the callsite will go onto stack1, and then you could statically verify that the stack map looks like what g() expects upon entry; and then if control flow was ordinary, g() would consume these inputs from stack0 and put a return argument onto stack0 and RETURN, which would pop the callsite from stack1 and change the PC to point to that, and then you could statically verify that the stack map looks like what that callsite within f() expects after the RETURN. BUT if control flow is extraordinary, eg if g() directly messes with stack1, swapping and inserting and removing various entries, then you don't statically know which pointer will be at stack1 TOS when g() RETURNs, so you can't statically know what the stack should look like at that point.

i propose that we do the same thing here that we are doing with gradual typing; most program code won't do this weird stuff, and stack1 will only be impacted by CALL and RETURN and associated opcodes, in which case the condition of the stack can be verified statically. But when code messes with stack1 explicitly, it must in addition contain an assertion w/r/t the state of the stack. This assertion will then be checked at runtime (but how? we aren't keeping around type tags for primitive types on the stack; i suppose the simplest thing to do would be to have another instruction, paired with and preceding the assertion instruction, that says 'start keeping track of what goes onto the stack from now on, i'm going to have to make an assertion about it later').

to make it easier to see when an explicit stack map is required, dont mmap stack1, use STACK1_POP and STACK1_PUSH and STACK1_PEEK and STACK1_POKE? well, no, since registers aren't memory addresses and can't be aliased, scanning for stack ops is just the same as scanning for direct addressing mode with the stack numbers. Custom addressing modes make this annoying though; so add some requirements on them that don't let them change the stack except in easy-to-see-circumstances (below i added the concept of 'direct' custom addressing modes).

to make implementation by eg compiling to llvm easier, instead of memory-mapping the call3 input and output arguments, just use special opcodes (call3 input1, call3 input2, call3 output) (nah; as with stack push/pop, or a direct custom addressing mode, just scan for direct addressing mode with the proper mapped memory location)

there should be a 'comefrom' annotation and a 'phi' annotation; should they be mandatory? (and comefrom 0 refers to an exported entry point)

---

error handling:

OK, so we're going to have situations where some instructions can yield errors at runtime, eg divide by 0. How to handle this?

one option: The ERR register is set if an error is encountered (eg division by zero). OA code should check the ERR register after one or more potentially error-producing operation. Problem with this: programs which fail to check the ERR register everywhere they should will continuing executing in an inconsistent state.

another option: have an 'error handler address' register, and control jumps to this location upon an error. The problem with this is that it's a hassle to setup these 'error handlers', and also sometimes you'd prefer to ignore an error or handle it inline, which this makes annoying. Also you don't want to bother setting the error handler upon each call, and if you did, then wouldn't you want to reset it to the caller's handler upon the completion of the call? But then you are messing with a stack of handlers.

another option: Golang's defer/panic/recover here http://blog.golang.org/defer-panic-and-recover . That's kind of like the simplest form of exception handling if you take the 'error handler address' and then extend it to a stack of error handlers. But it's still complicated.

another option: straight out exception handling, like LLVM. But that's too still complicated.

another option: x86 style maskable interrupts https://pdos.csail.mit.edu/6.828/2005/lec/lec8-slides.pdf . But then you get into loading and saving registers for the interrupts, which is annoying/complicated; and we don't really need I/O interrupts here b/c Oot Assembly is not intended for hard realtime systems, even though it is supposed to have relatively low and predictable latency.

another option: upon exceptions, JMP to the beginning of the program (the beginning of the program should then check if there was an error and if so empty the stack, etc). Resetting the program is probably what you want to do if you are running a program on critical equipment (eg a life support machine, or your car's breaks) and there is an unexpected error.

so i think the thing to do is to have at least 3 'error handling modes'; one of which crashes upon error, one of which just updates an ERR register, and one of which allows the program to handle the error, or reset, by JMPing to the beginning of the program.

---

so what are the usual ranges for relative branches?

ARM Cortex M0 has various instructions with different ranges; the smallest one is "Bcond" which is −256 bytes to +254 bytes [3].

6502 is ~+-256 [4]

MIPS is +-128k [5] a 16-bit signed value (times 4, i guess) [6]

Lua's TEST is a skip. Lua's FORLOOP has a relative operand of 18 bits.

MSP430 has conditional jumps with a 10-bit signed offset [7]

x86 has a JZ instruction form with a -128 to +127 offset called a 'short' conditional jump [8] [9]

Apparently that's all that's allowed in some sort of x86 16-bit mode [10]

Apparently the 8051 conditional jumps were limited to -128 to +127 [11]

Some PICs have a BRA instruction with -255 to +256 [12] although sometimes the range is greater [13] [14]

Some PICs have some sort of 2k page size related to GOTO [15]

AVR has a 12-bit signed offset [16] [17]

Z80 has 8-bit signed offset [18] [19]

So the smallest one that i see is +-127. The small ones are 8051, x86 SHORT jumps, x86 16-bit, Z80. Now, some of these offset are in bytes, but others are in instructions. So some of these may be 127 bytes but less than 127 instructions. And indeed, the ones with the shortest relative jumps, Z80, 8051, and x86, are all 127 BYTES. What are their instruction lengths? x86: "75% of x86 instructions are shorter than 4 bytes. But if you multiply the percentage by length, you will find that these short instructions take only 53% of the code size. So another half of a typical executable file consists of instructions with 32-bit immediate values, which are 5 bytes or longer." [20]. Z80: "252 instructions are one byte, the rest are 2, 3 or 4 bytes long." [21] 8051: "Instructions are all 1 to 3 bytes long, consisting of an initial opcode byte, followed by up to 2 bytes of operands.".

So you could get all those guys, and 75% of x86 instructions (but less than half, by total executable length) with 4 bytes. And 128/4 = 32.

Of course, this wouldn't REALLY necessarily make the cut, because the compiled OA instructions would often take more than 1 platform instruction! So if you said that OA relative jumps had a +-32 limit (in units of OA instructions), you'd still end up with some cases where there was more than 127 platform instruction in the compiled version.

Even a 1-instruction SKIP could be too much if the OA instruction being skipped was a custom instruction with a lengthly implementation.

So maybe we really oughtta just stick with the +-2k limit that our 12-bit encoding gives us.

also, we don't want custom instructions cluttering up the module's JMP table, so they should have to only use relative jumps. So the relative jump range puts a limit on the size of custom instruction implementations (that could perhaps be partially circumvented by dynamically calculating the relative jump target, but anyways..). So for that reason, we want to leave the relative jump range big.

But, if custom instructions are implemented by inline expansion, then relative jumps in surrounding code must be updated. It would be awkward if, upon complete expansion of custom instructions, the range of some relative jumps would have to grow larger than +-2047. So one sort-of solution is:

That seems kind of draconian but it does have the aesthetically nice effect of preventing OA code from creating very high-level OA instructions. But it also forces user-level control flow structures to use module-global JMPs all over the place. And since we are limiting the 'jump table' to 4k entries, that could be a problem. I feel like this would definitely impinge on HLL programmers; if you mapped HLL modules directly to OA modules, then surely some of them would have more than 4k loop constructs within them! Another solution is to make the implementation handle it: tell the implementation that if they dare to inline expand the custom instructions, they are responsible for finding some other representation with a larger bit-width to keep track of the growing relative jumps. Implementations that can't deal with that can just use the 'hidden stack' implementation method instead of inline expansion.

Not sure quite what to do here. I think the idea that we can directly map HLL to instructions with 12-bit operands is showing its weakness here.

Well, one thing that many architectures do is have fewer, larger operands on JMP instructions. We could do that. We could have a 36-bit absolute JMP destination. Then we wouldn't even need a JMP table, which would be much more efficient anyhow. So we'd have to bring back those format bits -- but for a good purpose.

A related idea is to have multiple JMP operands but combine them somehow, eg the first one is an index into the JMP table, the second one is an offset from that. That would let us cover 4k JMP table entries * 4k = 16M. But it's less efficient.

A third idea is to keep the format bits out but to have the JMP instruction just combine its operand. This is inefficient and wastes the addressing mode bits but is a little simpler. But it has the advantage that, if the MSB operands are direct to registers holding pointers, instead of immediate, then this instruction doubles as an indirect JMP, eg the first operand locations a JMP table and the third is the offset within that table. And optimizing compilers could optimize the putting-together of immediate constants.

A fourth idea is to have some of the JMP operands be pointers, and some be offsets from those pointers. But this seems to me like it gunks up the constant table with stuff that should be in the JMP table (because we can't have pointer immediate values).

Let's have JMP combine its operand, and get rid of the JMP table.

Now, 2 operands for JMP, or 3? 2^24 is only 16M, which still may not be big enough for everyone. The problem with using all 3 is that the program analyzer will think that the third operand is being written to. We could make an exception for JMP, though.

I'm reluctant to make the exception. We'd also want an exception for the constant table, in that case, and we'd want an immediate LOADCONSTANT. Lua's unconditional JMP is only 18-bits (and that's a relative JMP, and it's signed).

y'know, now that absolute jumps are so big, and now that they have a base pointer, let's just remove the relative jump instruction. That makes custom instruction expansion easier.

Wording that i deleted:

" If custom instructions are implemented by inline expansion, then relative jumps in surrounding code must be updated. It would be awkward if, upon complete expansion of custom instructions, the range of some relative jumps would have to grow larger than ~2^24 (a little less, actually), the largest effective relative jump value that can be specified in immediate mode. But that's pretty big, so hopefully it won't happen.

Any instruction that takes a pointer, if given an immediate operand, interprets that as a pointer to the n-th Oot instruction in the code (not counting inline expansion of custom instructions), with respect to the beginning of the module/function/custom instruction currently being defined.

"

---

related to JMPs and labels:

http://blog.llvm.org/2010/01/address-of-label-and-indirect-branches.html

---

removed some old wording from below/changed my mind:

" this code assumes its inputs are in the 'input registers' and leaves its output in the 'output register'. The call site is in the 'link register'. If it wants to change where the PC will be set after it's done, it uses the JMP_UPON_RETURN instruction, which alters the link register. "

" todo: maybe dispense with 'format bits' except for prefix modifier vs. rest? Then everything is 3 operand, which is simpler. Also do we want a 3rd format bit to distinguish possibly-custom instrs from not (probably no, since very few core instrs). The first 2 bits of the opcode are 'format bits'. 00 means prefix modifier; 01 means two-operand instruction; 10 and 11 mean three-operand instructions. Note that there is space for 1024 opcodes in each of these categories (plus up to another 1024 in the constant table in each of these categories). The next bit is the 'custom bit', set for custom instructions. "

" all instructions are either 'prefix modifiers or metadata', or 3-operand instructions. Prefix modifiers or metadata have opcodes below 256. Instructions with opcodes below 0xC00 have the first operand as the 'destination' or output, and the latter two operands as inputs. Instructions with opcodes at 0xC00 and above also take the third operand as input. Instructions with opcodes at 0xF00 use all three operands as both 'input' and 'output'. "

"

CARRY, ZERO, NEGATIVE, OVERFLOW (CZSO)

These 1-bit registers are intended to imitate the 'status register's available on the ALU of various processor architectures, and arithmetic and comparison instructions are encouraged to set them in the conventional way:

They can also be used in other ways by non-arithmetic instructions as a way to take as input and/or yield as output up to four bits.

(todo: we don't really need NEGATIVE, right? should we get rid of it? also, is my interpretation of the 'conventional way' to use S for comparison correct? should we call it S, SIGN?) "

" The implementation-dependent bit is reserved for the use of the OA implementation (for example, it might use this to mix other data into OA code; or to mark instructions that have been JIT compiled; or to intermix some alternate instruction encoding with instructions in this encoding; or to mark certain instructions for some other purpose; note however that this bit is not in the instruction itself, but rather in the value obtained after the 4-bit addressing mode has been applied to the 12-bit operand).

The flag bit is TODO: what should they be for? Boxing? Atomicity? blocking/async? raise vs nullable types? use CZNO flags? natural language instruction encoding format flag?). "

"

SHORT FORMAT

1 format bit + 7-bit subinstruction + 7 x (1 subformat bit + 7-bit subinstruction)

The SHORT format is a 64-bit packed grouping of 8 sub-instructions, each one 8 bits long except for the first one, in which the first bit was consumed by the format bit.

The first instruction is a 'PUSH' instruction which pushes a value onto the stack. The next 6 instructions are each either PUSHs or operations which take their arguments from, and push their results to, the stack. The last instruction is either POP instruction which pops a value from the stack, or another operation.

The format of the first subinstruction is:

The format of the next 6 subinstructions are:

The format of the last subinstruction is:

The inputs and outputs of the OPERATION instructions are all implicitly the data stack (stack0) (eg the inputs are POPed from the stack and the outputs are PUSHd to the stack). Note that OPERATION instructions may consume any number of inputs from the stack (todo must this number be fixed for each operation, or can some eg consume the stack up to some terminator? i'm thinking if not fixed, then known at compile-time at least).

todo this format would be simpler and much more efficient to decode in parallel if the # of addr mode bits were the same between PUSH/POP and OPERATION; but it's hard for me to choose between 32 opcodes for operations vs 3 addressing modes for data access. On the one hand, 2 addr modes are all you really need, and 32 opcodes allows you to comfortable have opcodes for eg each of the 10 or so basic instructions of the SECD machine plus some basic arithmetic and composite data operations. On the other hand, third addressing mode bit allows you to represent boxed data, and a third addressing mode bit could be useful for the opcode as well as for data because it could represent eg laziness or asynchrony in the opcode, and 16 opcodes allows eg the 10 or so basic instructions of the SECD machine plus a tiny core of basic arithmetic, and other basic operations can be represented through library functions. "

"

"

" instruction-level flags could be emulated by custom prefix ops. "

---

having the opcode number determine the format makes it harder to have a dense opcode dispatch table. Implementations that want dense dispatch tables would want to create a different dispatch table for each format if we do it this way. Of course, allowing custom instructions makes density hard anyway, but not that hard, since each module gets to remap the opcodes for any imported custom instructions. Consider going back to format bits, where the format bits aren't part of the opcode at all, eg the opcode bitwidth is reduced by the number of format bits. We have many less potential opcodes, but 4k was too much anyways. A bigger disadvantage is that it's harder to have unequal partition sizes for the opcode space, but 4k is a lot so mb we can work with this. If instructions of different formats can be intermixed, that makes program analysis slightly more annoying but just as efficient; the format bits are still stored in the instruction encoding, so we don't have to look them up in an external table or anything. We still might want to isolate prefix modifiers into their own numerical opcode space, since the dispatcher can't really branch to an instruction implementation for them, it must treat them specially by first reading the next instruction (well it could do that after branching actually, but some implementation may prefer not to).

todo: so let's do that. If we had 2 format bits (1k max instructions), what would the 4 formats be, letting I and O and B stand for input and output and both operands? prefix, IIO (eg ADD), IIB (eg CAS), BBB (eg ROT3 on memory locs), I (eg 36-bit JMP), IO (eg 24-bit JMP)? that's 6 forms already, too many for 2 bits. If we take 3 format bits then there are 512 opcodes in each partition; we have 2 partitions left over so we could assign more to IIO, the most common one.

btw the IOB format yields 27 possibilities, plus prefix modifiers, so 28; and that's without 2 and 1 letters, which add 9 + 3 possibilities, for 40 total. But with the constraint that I < O < B and that each sequence is increasing, we only have: I O B II IO IB OO OB BB III IIO IIB IOO IOB IBB OOO OOB OBB BBB + prefix = 20 possibilities. Are any of those others useful? Maybe we could separate out 'prefix' and 'metadata', although i don't know why we'd need to; but i bet that could come in useful someday.

So if we have 3 format bits and if we cant reuse opcodes between formats, then we can only have 9 opcode bits, or 512 opcodes total. But if we can reuse opcodes, then we get 4k opcodes total, 512 per format, but if we want a dense representation then we have to have 8 opcode lookup tables, one for each format. If the interpreter is implemented in that way where you avoid jumping to a central node in between each instruction, to allow branch prediction to work better, then you have to duplicate the logic to choose one of these 8 tables at the end of each instruction, and this logic involves local branching, which is annoying/may somehow bother branch prediction, although i dont see how it would, it may even help. Also, if the operands are processed differently for different formats (are they?) then you have a bunch of branching logic repeated after each instruction implementation anyways.

Are the format bits even good for anything? Maybe we want to implement the addressing modes at the beginning of each instruction in any case. How is this worse than doing it at the end of each instruction? In that case, the format bits would still be useful for dataflow analysis, but do we really care about that? If we did, we could put that information in the module info table instead, right? But i guess one idea is that in theory a really optimized implementation might do speculative instruction decoding in parallel and applying the addressing modes to the operands of future instructions simultaneously with fetching the instruction implementations, and it could do that without looking at the module info table (which is harder than looking up a hardcoded format in a hidden info table hardcoded into the decoder, which we cant do because we have custom instructions). I guess that's nice, it goes along with my whole Oot Brain parallelization thingee, although i dont know enough about this stuff to know if that capability would really be practically useful. Anyhow, i'm persuaded, so let's keep either format bits or opcode ranges like we had before.

So the decision remains, do we want the format bits to be part of the opcode or not? If they are, then we can have density in each format, but not over the whole opcode; if they are not, then we can have density over all opcodes, but we only have 512 opcodes total rather than 512 per format. The thing is, it's not clear whether density over all opcodes matters at all; why not just have 8 lookup tables instead of 1? And on the other hand, it's not clear that we even want 512 opcodes total; that seems inelegant to me. Also, if we only had 512 opcodes total, i bet many implementations would just have a dense table for all of them; but maybe with 4k opcodes they still could do that. In the former case the table would be 512 * 8 bytes per 64-bit pointer = 4k, in the latter 4k * 8 bytes per 64-bit pointer = 32k. Hmm a 32k table is a significant portion of L1 cache, maybe implementations would avoid that.

https://github.com/MoarVM/MoarVM/blob/master/docs/extops.markdown says MoarVM? has 16-bit opcodes but only uses about 512. https://github.com/perl6/nqp/blob/master/docs/ops.markdown says NQP only uses about 268 opcodes. But https://github.com/MoarVM/MoarVM/blob/master/src/core/oplist opcodes gives about 740 normal opcodes and about 60 special ones (for the specializer or optimizer, i think). But they have stuff like 'isprime' and 'mkdir' and 'stat_time' in MoarVM?, which is probably more than we want.

JVM makes do with 1-byte opcodes. So does LuaJIT?.

I don't think it's crazy to restrict to 512. Although the whole 'custom opcode' idea helps, too many opcodes is still not simple/beautiful, and furthermore, i like the idea of implementations planning for a max 4k of an opcode jump table much better than for 32k. On a machine with 16-bit addressing such as the Apple 6502s, 512*2 = 1k for the lookup table, which is still too big considering that BASIC implementations used to only take 4k-8k.

One remaining issue is, if we are restricting to 512, why not go a little further and restrict to 256 (one byte)? I have a feeling that on many platforms, there may be some efficiencies to using lookup tables indexed with a single byte as opposed to 9 bits, and probably we'd have memory savings in various program analysis data structures. In this case we'd have an extra 'flag' bit available again.

Hmm but now that i think of it, we dont really need separate formats for I and IO as opposed to III and IIO, b/c recall that for absolute jumps to labels we can just have eg multiple input operands and multiple the first one by 2^12. So we only need prefix IIO IIB BBB, which is 4 bits.

So do we have 2 flag bits? hmm..

---

should we do one of:

the latter would allow one opcode dispatch table per program, rather than one per module

the former would allow modules that dont use many opcodes to have small dispatch tables

the latter would be a much bigger win. This would really change how modules work, though; no longer could you treat modules like libraries to be plugged together; they are now just rigid subcomponents of one program, or at least, cannot be interoperated with different OA 'ISAs'.

now, if you have a bunch of modules with the same opcodes, the OA implementation could detect that and only use one dispatch table. So maybe we wouldn't be gaining much.

so let's encourage different modules to use the same mappings for custom instructions to opcodes.

i guess that means removing the import form that individually imports custom instructions to whatever opcodes you want. The import will just import all custom instructions from a module. If you want to mix-and-match you'll have to cut-and-paste from that module.

---

random assembler program/syntax examples:

http://luajit.org/dynasm_examples.html

---

probably should learn what all this means sometime:

http://article.gmane.org/gmane.comp.lang.lua.general/58908

---

is there any use for a JIT flag on instructions themselves?

http://wiki.luajit.org/Bytecode-2.0 indicates that "Function headers" have flags for JIT compiliation, but apparently not the CALL instructions. So maybe there is no use. JIT could add metadata instructions at the start of functions to use as a 'function header'.

Otoh one would think that it would be useful to JIT compile particular calls, not just functions in general. In which case the flag would be useful. So maybe reserve one bit for the implementation's use.

---

to reiterate, some other encoding ideas:

if we permit stack machines (pop all inputs, push output), or if we permit dropping the first operand and instead having one of implicit destination: accumulator register, or implicit destination: stack register, then we also get:

out of these, the smallest that would seem to fufill the criteria of:

requires at least 4 + 3 x (2 + 3) = 19 bits

are:

are these better than the 64-bit-width encoding above? It depends:

sounds do-able, i'm not sure if desirable. The thing is just that the benefit here is just code density, but OA is not intended to have high code density (you could design a more complicated encoding and instruction set if you wanted that, and compile to it).

if we have 2 instruction encoding format bits, then we could either:

can some of the others fit into that? Let's first look only at the 32-bit ones, b/c they fulfill the criteria mentioned above:

at this point we have only 16 or 32 opcodes and locals, so we're certainly getting into the territory where this might be a good 'high code density' subformat, but not good for the only format, given our goals with OA to preserve HLL intent in single instructions.

But if you were to have a 'high code density' subformat, then why not try to go even smaller, on the assumption that only the most-used opcodes and registers would be expressible in the subformat; and maybe even stack machines would be useful here. Let's assume 2 format bits:

(note these are not all possible combinations, i tried to only include potentially useful ones, and i kinda winged it so i may be missing some)

i think if we had 1 format bit for multiple instruction encoding formats, we'd have the 64-bit one described above and the natural-language-y encoding, and if we had 2 format bits for multiple instruction encoding formats, we'd use 1 for the 64-bit one described above, 1 for the natural language-y encoding, and 1 for something smaller than 32 bits. For the small one, that leaves:

but if we have 2 format bits, the small one only needs to use 1 format bit, so for 8 bits we have:

none of those are that appetizing; we'd have to pack 8 of these instructions into a 64-bit field to keep alignment, and i doubt there will be many 8 instruction sequences that either only use stack ops, or only use the top 4 opcodes, or only use one of accumulator or stack destination throughout (and we don't want the compiler to have to spit out different instructions for this case, that's too hard, if that difficult is worth it than a different ISA designed for code density rather simplicity would be worth it).

So with 1 format bits and 16 bits, we have:

i bet the most useful out of those would be (eliminating stack machine and 6-bit opcodes and 2-bit operands):

Would 4 instruction sequences that fit either of those constraints be common enough to be worth the extra complexity of adding this instruction encoding format? The 1-bit addressing mode is not that bad because it would be trivial to emit LOADK instructions. The accumulator or stack destination might be uncommon unless instructions were written with that in mind, though. Note that instead of 'accumulator or stack destination' you could use instructions designed in the 2-operand style, eg instead of c = a + b, you'd have b = a + b. But would aligned stretches of 4 instructions that fall into that style naturally be common if we just want to use this for compression? Probably not.

So the only remaining choice is

The biggest problem with this is probably only having 3 bits for the opcode. Will there naturally be many aligned stretches of 4 instructions that only contain the top 8 instructions? Assuming that PUSH/POP are memory-mapped, https://www.strchr.com/x86_machine_code_statistics suggests that these might be something like: cpy call cmp add lea test je xor; including the implicit 'push' and 'pop' these would be 76% according to those stats. If instructions were I.I.D. then that would be about .76^4 ~= 1/3 chance of seeing a run of 4 of these in a randomly drawn set of 4 instructions (since we randomly drew, i guess aligned is taken care of?). If we realized a savings of 75% in these cases (64 bits*4->64 bits), that would be 1*2/3 + .25*1/3 = 75% length = a 25% savings in code density. (how much would one more bit in the opcode help? if we got 8 more common instructions, that would be another 8% according to the table on the webpage, and .844 is almost .5, and 1 - (.5 + .25*.5) = 37.5%, a (.375/.25 - 1 = 50% increase in savings) The real savings would likely be less because instructions are not I.I.D. and it's likely that it's very common to have a run of 'common' instruction broken by at least one 'uncommon' instruction; also we are assuming here that operands with more than 3 bits never occur and we are neglecting the addition of the LOADKs and the need to reserve a register for that purpose.

Would using a 32-bit shorter format be any better? The table on the website suggests that the top 20 opcodes account for about .89 of the total. 1 - ((1 - .894) + .5*.894) = .31, so we'd only save about 30% this way; so it's a little better but not much.

What if we got rid of the 1-bit addressing modes? This reduces our savings because instead of compressing 4 large instructions to 4 small instructions, we are compressing 4 large instructions to 6 small instructions, or something like that, if we assume that typically we would have to add 2 LOADs/STOREs for each run of 4 instructions (and we are still ignoring the occasional LOADKs). If we assume that we only realize a 50% savings, then we can't beat the calculations for the 32-bit case, or about a 30% savings.

If code density were a goal or if implementation simplicity were not one this might be worth it, but in our case i think a <30% savings in code size is not worth the extra complexity, especially since we have no empirical OA data with which to test these savings; and given that if code density were a goal AND implementation simplicity were not one, a more complicated instruction encoding would be worth it.

If we leave one bit in the instruction for the use of the implementation, then some implementations could implement one of the schemes here, or intermix compressed 'macroinstructions' with ordinary instructions, and maybe it could somehow help with JITing too. So let's do that.

just thinking about 16 bits, here is why it doesn't work out for us, to reiterate what we said before:

so that's 4 + 3*(2 + 3) = 19

now what about just switching everything to 32 bits? Now we can also accomodate an addr mode in the opcode field, allowing for dynamic first-class functions.

The alternatives are:

Really, neither of those are crazy, but they do have their downsides compared to our 64-bit scheme.

Is 64 registers enough? If we are treating them as registers, yes; the 'related work' section of The Need for Large Register Files in Integer Codes cites many other papers that think 64 or 32 is enough (even though they themselves argue that more is better in some cases). However, if we are treating them as variables, Table 7 of the same paper gives various programs with more than 64 global variables (eg perl, go). We are thinking of these more as locals than as globals, but it is still unsettling.

So in conclusion a 32-bit encoding would not be crazy and would even be a little simpler, but a 64-bit encoding lets us:

The cost is, however, perhaps a near-50% decrease in code density, which is large. Otoh, without empirical data to profile, it's not clear that the efficiency gains of higher code density would not be offset by losses from having a more spartan set of opcodes. And, as i've said before, if we really wanted an efficient ISA and didn't care about implementation complexity, we could probably design a much better one anyway.

---

a little later; i still think 'the accumulator or stack destination might be uncommon unless instructions were written with that in mind', and similarly that the stack machines would require some rewriting, but actually i think the obvious opportunities for an easy rewrite into this form would be pretty common so it may still be worth it: that is to say, expressions. An expression is a tree in which each result is used exactly once. The interior nodes of an expression tree are raw stack machine operations; the inputs and outputs are from/to other interior nodes and are specified implicitly by the RPN ordering of the instructions. In programs you often see expressions with more than two or three terms, so the interior nodes of these could be compiled to an RPN-style sequence of opcodes.

At the root and the leaf node we still need to do something else though; at the leaves, we need to get values from elsewhere and load them onto the stack, and at the root we need to pop the stack and store the result elsewhere. In 'stack machine mode' these would be implemented by LOAD (PUSH) and STORE (POP), which would be the only instructions that need to mess with operands and addressing modes.

So could we fit this 'stack mode' stuff into 8 bits? Really 7 bits, since we'd want a format bit.

Well, if we still want an addr mode on the opcode it's pretty tight. 3 bits for the addr mode, then we only have 4 bits for the opcode operand. No bits are left for flags, for boxing/unboxing, or for labeling the stack machine instructions with offsets to make the AST which is implicitly represented by the RPN easy to traverse.

Plus, we want variant instruction encodings for LOAD and STORE, so remove another bit for that. So 3 bits for the addr mode and 3 bits for the opcode. We have 6 bits left to encode LOAD and STORE. Reserve one bit to choose between them, now five bits left. Now if we have a 3-bit addr mode, we only have 2 bits to choose the register to LOAD or STORE into! That's not too great.

So we'd pretty much have to ditch the 3rd addr mode bit. Which would leave us with:

Now with only 8 registers, we aren't really going to be using these things as local variables, they would only be used as registers, which means that before issuing a sequence of these stack instructions things would have to be CPYd from whereever they really are stored into the 8 usable registers. In which case, why not have that code place them directly on the stack? Similarly for STOREing. So we can ditch the LOAD/STORES, and have one of:

Both of these are good choices. The 2/5 (2-bit addr mode) gives us 32 opcodes, and the 2 bit addr mode doesn't give us a chance to represent things like boxing, atomicity, asynchrony (vs blocking), or laziness, but we don't need some of these anyways when we're just fetching a function (eg we can assume that all function pointers are 'boxed' and that the stack is not in shared memory so no atomicity is needed), and for others we could perhaps use a language convention, and reify function application if we need to do the other thing. Eg the convention could be to apply functions strictly synchronously (block until they are done), and then if you want to apply a function lazily or to get a promise you have to use the APPLY opcode; such a convention lets us do low-level arithmetic fast, and since we can fit 8 of these operations in one 64-bit packet, there is plenty of room for a few APPLYs when dealing with HLL functions. With 2 addr mode bits we can still do immediate (the opcode is right there in the instruction, and is for a built-in instruction), constant table (the opcode is right there in the instruction, and is for a per-module library instruction), direct (we are applying a first-class function stored in a register), or indirect (we are applying a first-class function stored in the heap).

The 3/4 (3-bit addr mode) only gives us 16 opcodes but we don't have to think about how to make up for the loss of the custom addr modes.

I'm leaning towards the 2/5 here, as the whole point of this format is to be compact/frugal.

If we had a 64-bit 'middle format' and used this as a 'small format' then the structure of code would tend to alternate between these formats in a similar manner to how, in Haskell or Lisp, you alternate between 'let __ in' and then actual expressions which have many terms. The 'let's correspond to the 64-bit format being used to select local variables and load them onto the stack, and then in most cases the actual expression could be specified in one 64-bit packed field.

i kinda like that...

in fact it's even better than that, because we only need 1 format bit for the whole 64-bit packed instruction. So we get an extra bit for 7 of the 8 instructions within the pack.

---

might enable more runtime parallelism if we got rid of one of the signature classes (probably BII) and instead had a 2-operand format. This would allow decoder to decode the address of JMPs ahead of time. Otoh that's not really useful for parallelism because it's just an integer, you still have to get the address associated with the actual label, and you're not going to do that until you see it's a JMP. So no.

---

" Condition codes vs. condition registers vs. compare&test

...

Granularity of atomicity ... Transactional memory? "

---

todo mb get rid of ALU regs and have 'compare' and 'test' instead?

---

if we have ALU regs, but this doesnt show up in the signature class, then doesnt that complicate dataflow analysis a lot?

---

i skimmed these:

http://people.cs.clemson.edu/~mark/conditioncodes.html

https://community.arm.com/groups/processors/blog/2010/07/16/condition-codes-1-condition-flags-and-codes

http://people.ee.duke.edu/~sorin/ece152/lectures/2.2-isa.pdf

http://opencores.org/or2k/Branching_and_condition_codes

https://gcc.gnu.org/ml/gcc/2001-01/msg01806.html

---

ok i guess i'm conviced that we don't need to, and shouldn't, mix ALU regs with branching.

but how could we do arithmetic without at least a CARRY? would you actually do compares before each potentially-carry-needing instruction?

yes (or afterwards): https://en.wikipedia.org/wiki/Status_register#CPU_architectures_without_arithmetic_flags

---

OK that makes me want to get rid of CZNO, for the sake of explicit dependencies. But what about ERR? if we keep ERR there seems to be no reason for ADD to not use it as a carry output.

And if we have ADD shouldn't we have at least one 'bit register'?

But then wouldn't we want 5 more bits for instruction signature class, for whether or not it might read to/write to ERR and/or 'bit register' and whether or not it might have an error (as opposed to just an 'exception')?

---

OK, the point of the instruction signature class wasn't really to represent the instruction signature for data flow, that was just a useful side effect; the point was the say whether, during dispatch, the instruction needed things to be looked up and/or stored before/after it ran. So NO we don't need to add bits there just to say whether or not the instruction might affect ERR and/or a carry register and whether or not it might crash.

And no, we don't need a 'bit register'. An ordinary reg can hold many things, one of which is a bit. We can have a bit data type in function signatures, though.

And yes, putting the carry into the ERR register is fine, although i dunno if i will b/c the MIPS way sounds cleaner. Otoh this sort of thing (having another return value) is the whole point of the ERR register, so lets do it.

But i still want to get rid of CZNO.

---

i guess i'll make SHORT format have only 2 addr mode bits throughout. In most ordinary expressions, stuff like laziness and asynchrony and boxing/unboxing follows the language convention, and/or can be handled by a wrapper outside of the expression itself. Also, the point of SHORT is to increase code density, and probably most of the space in code is either taken up by high level operations, or low level ones (but which is it?), in which case one convention for boxing/unboxing will be fine.

Also, most other bytecode languages don't have any of this addressing mode/modifier stuff at all, and they get along just fine.

Code that needs to do something unusual can always just use MEDIUM format.

---