Bayle Shanks's website: proj-oot-ootAssemblyNotes3

i've been reading about Lua's VM from

http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf

which is an excellent, easy-to-read document. I'm interested in the VM because Lua seems somewhat small, and the VM seems somewhat small, and i am looking at it as a basis set of operations for Lua (a core language). Seen for this purpose, there are some unnecessarily complex parts of the VM, doubtless for efficiency, such as:

the LOADBOOL instruction also has a built-in optional SKIP
some instructions (those with an RK in http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf , including GETTABLE, SETTABLE, ADD/SUB/MUL/DIV/MOD/POW) have a built-in 'constant addressing mode' (the high order bit of the 8-bit register number operand says whether this is to be interpreter as direct addressing (a register) or constant addressing (a constant)), but not all possible instructions (e.g. MOVE is just direct addressing in both operands). Note that the constants in these instructions are indices into the constant table, not immediate constants.

so i guess an assembly for Oot would want to be less complex. Other thoughts:

at first glance, the proliferation of LOADK vs LOADNIL vs LOADBOOL seems wasteful, but upon second glance, it provides useful type information to the reader or compiler/interpreter
GETGLOBAL and SETGLOBAL don't refer to the global ID directly, but indirectly via the constant table
Lua's 'constant addressing mode' may be a better idea than my 'immediate addressing mode + variable-length constant suffixes'. Even if you wanted variable-length constants, at least with Lua's way these would be kept in the constant tables, rather than interspersed with code. My concerns about locality of reference and caching would probably be satisfied with Lua, since its 'code segments' are only the length of one function (identifying code segments with functions also makes things slightly less complicated). Note that I'd still like an 'immediate addressing mode' for small values. If we went this way, maybe we could also dispense with variable length prefixes and fix a set of prefix flags?
(begin part already copied to plPartImplementation). it's interesting to me how some languages go from a source language to a assembly-language-like IR or VM (eg Python, Lua, Java, Smalltalk?, Clang/LLVM), and other languages go from a source language to a non-assembly-like 'core language' and from there to either a C-like language or a VM (eg Haskell->GHC Core -> STG -> C--, Hoon -> Hoon 'non-synthetic glyphs' -> Nock, Shen -> KLambda). Note that (a) most languages' assembly-like language or VM or C-like language is Turing/von Neumann/imperative/linear, but Nock is not; (b) probably many languages that i am saying go straight to an assembly-like language have some sort of desugaring step, but to me desugaring a few constructs in a complex language is distinct from defining a small core language, although i agree that the distinction there is subjective; (c) there are also some language implementations that pass thru a 'generic AST' stage, such as gcc's Gimple, which i don't know much about but which i guess may be like an assembly-like IR, except in tree form instead of linear form (gcc goes: source language -> gcc GENERIC -> gcc GIMPLE -> gcc RTL; RTL is an assembly-like language; GENERIC is a generic AST language; GIMPLE is a 3-operand variant of GENERIC, i think); i see this more an a tree-structured form of assembly than as a core language.
- i'm guessing that the proliferation of steps in GHC is because they ultimately convert from one computational paradigm (functional language/'graph machine'/lambda-calculusy with lots of partial evaluation, laziness, closures) to another (turing machine/imperative/von Neumann); this probably happens in the STG -> C-- step.
- (end part already copied to plPartImplementation).
- i guess i like the architecture with the core language as an extra step. This allows you to conceptually highlight which high-level operations are important in your language, as well as to focus implementation effort around these without worrying about VM impedence mismatch until the next level down. For example, in Haskell we see that things like partially applied functions, closures (let-bound stuff), variables applied to arguments, and generic thunks are all pretty 'core'.
  - however one disadvantage here is that you lose the chance for optimized compilation of some larger structures, eg parts of the language that are broken into many small parts in the core language. One way to retain this is to have a large core, comparable to Guile's VM with about 180 instructions, but that's too much complexity for me. Another way is to allow lower level compilation stages to look back at higher-level representations, such as the original source code; http://www.aosabook.org/en/llvm.html argues/implies that doing that is part of what makes gcc dirty, but i might argue it's a good thing, as long as it's only permitted for optimization, not actual semantics, so that if the higher-level is unavailable, the lower level/later compilation stages can still do their job.

---

so, addressing modes:

immediate
constant table (constant table immediate)
direct (register direct)
indirect (register indirect)

(note: in this scheme there is no stack addressing mode!)

an alternative:

immediate / constant table: high-order bit indicates whether this is an immediate constant or a constant-table constant
direct
indirect
stack

an alternative:

1 bit says: constant if constant: one bit says: immediate constant or constant table constant if non-constant: one bit says: direct or indirect stacks, and the top-two elements of each stack, are memory-mapped (so, as described above, with 4 stacks, that's 16 memory-mapped locations; one bit is whether this is a straight read/write or a pop/push, and one bit is whether this is to the TOS or the second-element-of-stack, and two bits are which stack) could also have one bit for whether we are accessing 'special' memory, like the memory-mapped stacks, or 'ordinary' memory; 'special' memory has the properties that (a) when you read from a special memory location, you might get a value different from what you last wrote to it, (b) 'reads' might cause mutation (e.g. when they are interpreted as POPs). We might extend this to say that special memory is 'volatile' and then accesses to it should be executed immediately and non-reordered because they might have arbitrary side-effects. We might also say that special memory is always thread-local and never shared.

also, as noted above, we need an 'emulation'/'cascaded addressing mode' bit (at least for direct and indirect addressing, do we also need it for constants?). do we need PC-relative addressing? do we need table (eg displacement) addressing?

if we have: constant bit, direct/indirect bit, emulation bit, special bit, that's 4 bits, so if we have 16-bit words, we have 12 bits left, ie 4k. note that in this scheme, addresses can be up to 4k but constants can be up to 16k, because constants don't need 'emulation' or 'special' bits.

(Instead, could use the ARM-ish barrel-shift-ish encoding where the 12 low-order bits of immediate constants are interpreted straighforwardly, but the two high-order bits shift the others, e.g. if the two high order bits are 00 then the value is the value given by the 12 low-order bits, if the two high order bits are 01 then the value is 2*(the value given by the 12 low-order bits), if 10 then 4*(the value given by the 12 low-order bits), and 11 is multiplication by 8, allowing representation of all values up to 4k, and some values with trailing zeros up to 32k. This sounds useful/efficient but is slightly more complicated, so i probably won't do it)

---

in [1] i wrote:

" Note that Haskell is implemented a bit differently from many languages; this stems from Haskell's properties:

lazy (ie. at runtime we often pass around suspended expressions or 'thunks' rather than values)
higher-order functions (ie. functions are first-class values)
currying (ie. if we apply a function f to two arguments using the expression 'f x y', and the result type is an integer, then (f x) is a function which returns a function which takes one argument and returns an integer; in some implementations, the 'f' in (f x y) could be internally implemented either by (a) a function which takes one argument and returns a function which returns an integer, (b) a function which takes two arguments, and returns an integer (c) a function which takes more than two arguments, in which case the result of 'f x y' is a partially applied function)

It is therefore assumed that Haskell programs will, at runtime, very frequently be dealing with unevaluated thunks, with closures which represent unevaluated or partially evaluated functions, and with the application of functions whose arity was unknown at compile time. "

due to these considerations, the GHC compiler appears at first rather complicated (for example: in reading about it i heard about 'spineless tagless G-machines' and i found a paper describing that and read about a weird system in which to evaluate a function whose arity is only known at runtime you put a bunch of potential arguments on the stack and then jump into code for the function and then IT decides how many arguments to consume, and if it generates another function as a result, it calls that function directly to consume the rest of the arguments, etc; then in later papers i found that they dropped this weird 'push/enter' system for the simpler 'eval/apply' system that i would have expected (the caller looks up the arity of function at runtime and then calls the function with only the arguments it will use, then calling the output of the function if appropriate), and also that they also later switched from a tagless system to one with tags (but not type tags, rather a tag for whether an object has been evaluated yet, and for which outermost constructor was used to generate the object, which is type-specific because different types have different sets of constructors, so we kind of have a type-within-a-type and the type is known at compile-time but the choice of constructor, the type-within-a-type, is not!). And then there are info tables describing each object, and the garbage collector has to read these, and the garbage collector also apparently can sometimes help in propagating tags for some reason, and it all sounds very complicated)

but now that i've read about that, i wonder if i can't boil down the basic operations that the runtime is trying to do into an intermediate VM language, and leave questions like push/enter vs. eval/apply, and of how/where you store the stuff that Haskell puts in type tags and info tables, to the VM implementation (compiler or interpreter). Then Oot Core could do all this Haskelly stuff but the reference implementation could be comprehensible. Of course, the optimized reference implementation could compile this intermediate language into something more efficient, like Haskell does.

Now, what is the difference between GHC Core and the thing i am talking about? I guess it's that i'm talking about an intermediate VM language for what instructions are being executed at runtime, rather than a representation of the source code that is being compiled.

operations like: APPLY (what else)

also, other 'runtime'-y things, such as COW (copy-on-write)

http://www.haskell.org/haskellwiki/Ministg

so, this suggests the place of expressions are included in oot assembly; not only because we could have an AST instead of a linear sequence of VM instructions, but also because (for the same reason that lazy languages are like a reduction of expression graphs), thunks must contain expressions. also, should think about how to generalize/unify this with dynamically passing around references to imperative code.

one piece of the puzzle that is missing is how to combine delimited continuation and friends (eg metaprogramming based on direct manipulation of the call stack) with haskell-style graph reduction this may also be related to the puzzle of how to have tail calls but also support good stack traces; as ministg points out, "Lazy evaluation means that the evaluation stack used by the STG machine is not so useful for diagnosing uncaught exceptions. The main issue is that the evaluation context in which thunks are created can be different to the evaluation context in which thunks are evaluated. The same problem does not occur in eager languages because the construction of (saturated) function applications and their evaluation happens atomically, that is, in the same evaluation context. ". In fact, you might say that in such a language there are not 2, but 4, stack contexts for each function: the lexical (source code) ancestors of the function definition in the AST, the lexical (source code) ancestors of the function application in the AST, the ancestors in the call stack at the runtime time of the function's creation, and the ancestors in the call stack at the runtime time of the function's calling. should also look at di franco's effort to combine call-stack metaprogramming with constraint propagation

people say 'synchronize' and have target language instructions like SYNC but what they really mean is often 'WAIT until the other thread gets here, or WAIT until the other thread does this', so i should call that opcode WAIT.

the idea of haskell data constructors as a type-within-a-type is interesting and possibly a fundamental way to generalize/simplify. So now we seem to have at least 3 types of types: atomic types (Hoon's odors), composite types (C structs, Python objects, Python and Haskell and Lisp lists, Python dicts, Hoon's structural typing), and the choice of data constructor within a Haskell type (eg is this List a Cons or a NilList?? is this Maybe a Nothing or a Just?) in addition, there are interface types vs representation types, which may be an orthogonal axis

why is a g machine graph not tree (recursion, but why?)? http://stackoverflow.com/questions/11921683/understanding-stg says "Haskell's pervasive recursion mean that Haskell expressions form general graphs, hence graph-reduction and not tree-reduction" but i'm not sure i understand that, wouldn't we just be working with a lazily generated (possibly infinite) tree?

another primitive that we may need to deal with is fexprs, that is, (first-class) functions that don't evaluate their arguments first: https://en.wikipedia.org/wiki/Fexpr (how is this different from any function in the context of a lazy language? should we just generalize this to strictness annotations on each argument?)

APPLY COW WAIT GETTABLE? SETTABLE? CPUID -- args in reg, first call CPUID(0) to get highest arg supported: http://en.m.wikipedia.org/wiki/CPUID, vendor id, processor id, feature bits, serial number, core topology, cache config, virtual and physical address sizes GENSYM (see below) MALLOC, MEMFREE UNDEF (done with the contents of this register) VARTYPE PERSPECTIVE PUSH POP GET SET CPY (deep copy; value copy, actually COW?) MOV FILL-IN-TEMPLATE (like python's '%' on strings, but with graphs) MATCH (graph regex) REPLACE (graph regex) and stuff about grammars (parsing and producing?) and stuff about logic and oberon stuff (while loops, etc)? SHALLOWCPY ALIAS, GETALIAS (see [2]) DIRECTGET, DIRECTSET (non-indirection versions of array indexing; take the array stride as arguments; 1d arrays only?) (or, just GET and SET with modalities, but that would decrease code density) ENTER, LEAVE (enter and leave block scopes; something similar may be/seems to be present in Ruby YARV bytecode and Perl55555) PUTSELF (both Lua and Ruby Yarv appear to have an instruction to reference Self/This)

note: list of ops copied to and continued at [3]

from [4]: some core operations from Guile's VM:

getting and setting and isBound? of values from/to variables in various namespaces (the stack, the local variable stack frame, the 'toplevel environment', the 'program’s vector of captured variables'
boxing and unboxing
make-closure
procedure call and return
branch
data constructor (also primitive data?)
loading
primitive ops
not
structural equality
reference equality
null?
get, set
cons, car, cdr
struct-vtable (meta)
class-of
number
bool
nil
eol
char
symbol
list/vector/array
struct

a new (to me, not to the world) interpretation of Nock's core instructions:

select, identity, nock, isLeaf, successor, isStructurallyEqual:

head and tail on s-expressions,
get the environment as a value,
apply a function to the environment then run a function under the new environment being the result,
primitive arithmetic,
structural equality

in other words:

s-expression manipulation
environment manipulation
run a function within an environment
primitive arithmetic
structural equality

(note: no mention of stack manipulation/continuations/jumping here, except for 'run a function within an environment')

(note: we can construct functions because we can build an AST as data and then execute it as code using Nock)

(note: why is this 'functional'? maybe b/c: (a) operations are defined in terms of macros ie combinators, (b) to use eg the environment you copy it, you dont mutate it)

this interpretation shows the similarity between even Nock and various other core languages:

define and execute functions
manipulate composite data structures
manipulate the environment
primitive arithmetic
equality
(note: i guess we get boolean ops like AND because we have access to structural equality, not just equality, so we can simply compare inputs to entries in a truth table)
(note: as Nock shows, we can build both 'if' and also program composition from these primitives)

so some stack ops:

DROP DUP SWAP OVER ROT DUP2 DROP2 PICK TUCK

DRIP: a b c ... -> a a c ... (MOV reg1 reg2) DIP: a b c ... -> b b c ... (MOV reg2 reg1) NUP: a b c ... -> a b b c ... (PUSH reg2 onto the (rest of the) stack) NIP: a b c ... -> a c ... (POP the (rest of the) stack into reg2) TUCK: a b c ... -> a b a c ... (PUSH reg1 onto the (rest of the) stack) TAKE: a b c ... -> c b ... (POP the (rest of the) stack into reg1)

PUSH POP OVER UNDER PEEK

cow, fold, map as core ops

---

some things u see in a lot of languages:

APPLY (call a function, possibly with partial arguments or currying) APPLYTAIL return GOTO/JMP conditional JMP comparisons, esp. equality a way to make a composite data structure a way to index into or destructure that composite data structure arithmetic and other 'primitive functions' looping constructs variable assignment a way to make a closure (a function with an environment of variables to be used in that function) switch/jump table (special case: 'case' of an ADT) try, raise, catch call FFI

and some have: call/cc, delimited continuations, and other 'first class stack frame metaprogramming' first-class, executable representation of code polymorphic dispatch the nock operator (do something to the environment, then do something to the code, then evaluate the resulting code upon the resulting environment)

some popular data structures: tuples linked lists (used as binary trees in the Lisp or Nock way, e.g. first/rest) random access/packed lists/arrays/vectors dicts graphs? functions, closures, objects, tagged unions/ADTs

stacks seem to be used mostly for keeping track of evaluating an expression, which is shaped like a tree, linearly (eg in the AST you don't need the stack, but in bytecode you do). the call stack can be thought of as an extension to this

could have calls with arbitrary many arguments; have the second operand have a sentinel that means 'composite argument', which is found outside of the fixed-width bytecode (are the composite args embedded following the instruction, or in a special constant table?)

---

it should be made clear that the use of <16-bit wide operands does not imply a word size of 16-bits. In fact, the word size is (unspecified? or specified to be infinite?). The <16-bit operands merely impose (a) an upper limit on the number of registers, (b) an upper limit on the number of entries in the constant table for a given code segment (but you can always go to a new code segment with a new constant table), (c) an upper limit on the size of immediate values encoded right in the operand, rather than in a constant table. Note that i say <16-bit operands because (reverse search for 'addressing mode'), some of the bits in the operand field are used to select the addressing mode (so we only have between about 12 and 14 bits left for the actual values, depending on the addressing mode used, and on how we decide to encode addressing modes).

---

analysis of 8-bit and 16-bit video game consoles:

(note: often i say K when i mean KiB?)

I looked at the tech specs of the NES (8-bit), the SNES (16-bit) and the Sega Genesis (sometimes called "16 bit" but appears to have a 32-bit processor with 24-bit addressing and a 16-bit wide data bus; this is sometimes called 16/32-bit). See data at [5], sections "8-bit video game consoles" and "16-bit video game consoles".

My main objectives were to determine how much memory was addressed, and what the instruction sets were. I am particularly interesting in the question of if the <16-bit operand field sizes that i am contemplating in Oot Assembly would be in the ballpark of the amounts of memory needed for systems like these.

Copying from the summary i wrote over there:

" The instruction sets were 6502 (and 65C816, a 6502 extension) (NES, SNES) and 68000 (Genesis). The address sizes were 16-bits (NES), 24-bit (SNES, Sega Genesis). The builtin RAM was 2 K (NES), 128 K (SNES), 72 K (Genesis). In-cartridge memory was typically up to 8 K, but could go up to 64K with 8K banks (NES; unsure about the other 2). (Program) ROM sizes were cheaply 32K (NES), typically up to 256K (NES) or 4 MiB? (SNES), max 512K (NES), in banks of 32K (NES).

Builtin graphic RAM was 2 K (NES) or 64K (SNES, Genesis), cartridge-expandable in banks of 8 KiB? (NES). Max typical CHR RAM was ~128 KiB? (NES, MMC1). Critical numbers regarding max # of sprites were 8 (NES), 64 (NES). "

How does this square with my <16-bit operand size limit proposal?

16 bits would be enough to directly address:

the NES
the NES builtin and in-cartridge max RAM
the NES PRG ROM bank size (which is also the NES non-MMC max ROM size)
the NES CHR ROM (?) bank size
the builtin graphic RAM in the SNES or the Genesis
almost: the builtin RAM in the Genesis

16 bits would NOT be enough to directly address:

the SNES or Genesis address space
the builtin RAM in the SNES (by 2x) or the Genesis (barely)
typical (or max) PRG ROM or CHR ROM in any of these systems

But now because of addressing mode, i have less than 16 bits. So let's say i have 13 bits (8k addressable).

13 bits is enough to directly address:

the NES builtin and in-cartridge max feasible RAM
the NES CHR ROM (?) bank size

13 bits would NOT be enough to directly address:

the NES PRG ROM bank size (which is also the NES non-MMC max ROM size) (32k)
the NES or SNES or Genesis address space
the builtin RAM in the SNES (by 2x) or the Genesis
typical (or max) PRG ROM or CHR ROM in any of these systems
the builtin graphic RAM in the SNES or the Genesis

12 bits would not be enough to address any of these items.

How does this square with parallelism? The NES sprite critical numbers 8 and 64 seem relevant.

Conclusions:

16 bits seems feasible, although it cannot directly address the typical NES ROM sizes
13 bits is the lowest feasible number, because it is the NES in-cartridge max feasible RAM, as well as the NES CHR ROM bank size. But it seems underpowered; the NES max ROM bank size is 32k; and the builtin RAM in the Genesis, and the video RAM in both the SNES andthe Genesis, is on the order of 64k. It would be nicer if we could hit 32k, which is also a common cache size elsewhere.
64 is a critical lower bound for parallelism

---

so, if 13 bits is underpowered, maybe we should take the addressing mode bits out of the operands and put them in the first of the 4 16-bit fields, reducing the number of opcodes. We had too many opcodes anyways. If we only need 32k, that's just 15 bits, but since that's close there may be some benefit to having the 3 operand field lengths be an even 16 bits instead of a non-power-of-2. Perhaps on some architectures we'll have to execute less operations in order to emulate/execute OOM code if each of the 3 operands are an even 16 bits (not sure about that b/c we still have to unpack the same bits out of the first field).

Also, come to think of it, if Nintendo provides for 32K of ROM and that's too little for most cartridges so bankswitching is common, then i guess that even a small program on an 8-bit system is often >32kb, so Oot will surely be at least this big. And i guess it's kind of weird and potentially inefficient to have to put program offsets into a constant table when the program is the inner core of an interpreter. And i could be wrong about all these being the right numbers, so at least having 64k to address, when i think 4k to 8k to 32k is needed, future proofs it a tiny bit.

A disadvantage to that is that if we need 3 addressing mode bits per operand, that's 9 new bits that we now stuff into the first field, leaving only 7 bits. If the first field also has an emulation bit, then there's only 6 bits left, e.g. 64 opcodes, the minimum that we contemplated.

Otoh, we are trying to be minimalistic here, so maybe that's okay.

as i said earlier, as with the Dalvik JVM implementation on Android, we expect that there will be faster VM languages to implement Oot in, and that implementations bent on efficiency might compile OOT VM (OVM, i guess) to an idiosyncratic or platform-specific faster language. So not having enough opcodes isn't really a big problem.

Also, we're going to have to have a Jets-like system to replace at least anything in the standard library, and maybe anything in the standard distribution, with optimized platform-specific FFI calls, so we don't need to put eg. every flavor of addition into OotAssembly? (ovm? osm? oosm? oovm?)

Now, if we are doing this, we really can't go below 6 bits for the opcode, so the last scheme described above, in which each operand had 4 addressing mode bits (constant bit, direct/indirect bit, emulation bit, special bit), is too expensive. Even if you left those in the operands, that would mean the operand values would be only 12 bits, and the analysis above calls for at least 13 bits.

The obvious thing to do is to drop the 'special bit' and rely on a slightly dirtier (but more frugal with memory) 'zero page'-ish memory-mapped specials architecture for the stacks, and rely on more traditional mechanisms for threadlocals and volatiles (which makes some sense, because the amount of these that you'll want or have will depend on the platform and the application, and will not always correspond to approximately half of main memory).

so to summarize, the current scheme appears to be:

undefined word size

64-bit instruction size

instructions divided into 4 16-bit fields: ctrl, op1, op2, op3.

ctrl is a 16-bit field divided into smaller fields as follows:

1 bit: emulation or not (note: this means that the emulation target must be addressable in 6 bits, that is, within the first 64 memory locations!)
6 bits: opcode (64 opcodes)
3 bits: op1 addressing mode: constant, emulation, direct/indirect
3 bits: op2 addressing mode: constant, emulation, direct/indirect
3 bits: op3 addressing mode: constant, emulation, direct/indirect

hmm... the emulation here seems a bit unbalanced; what's the use of op1, op2, op3 getting to address 16 bits when emulating if ctrl can only address 6 bits? So let's just have an EMU instruction and give up on the emulation bits. so now the current scheme appears to be:

ctrl is a 16-bit field divided into smaller fields as follows:

10 bits: opcode (1024 opcodes)
2 bits: op1 addressing mode: constant, direct/indirect
2 bits: op2 addressing mode: constant, direct/indirect
2 bits: op3 addressing mode: constant, direct/indirect

it's tempting to take those 4 bits back and add custom opcodes and custom addressing modes...

but also, don't we need an immediate mode? so how about 3 bits for addressing mode:

1st bit: is the operand value right here, or is what is here just an index into the constant table?
last two bits:
- 00: immediate
- 01: direct
- 10: indirect
- 11: custom addr mode

and, as noted above, reserve one bit out of the remaining 7 bits for 'custom opcodes'. Leaving 6 bits for non-custom opcodes (64 opcodes).

---

this doesn't leave any bits for some of my other oot-y ideas, such as:

a lazy/strict bit
a vectorized (map) bit
an Option (Maybe) wrapper bit
attached type information

if we were to do any of these, we'd need to either use prefixes (annotations), or increase the instruction length. Instruction length is currently 64 bits (4 16-bit units), so the next power of 2 is 128 bits, giving us 4 more units. If we did that we could:

use an entire 16 bits for the addressing modes of the 3 operands (instead of the current 9 bits) (and maybe do some "addressing" of the opcode)
- if we used 4 bits for each of 3 operands and for the opcode, then we can add back in the EMU bit, and it makes sense to do so
use an entire 16 bits for the opcode (instead of the current 7 bits)
use 16 bits for flags (such as lazy, map, option)
use 16 bits for type info
?? we still have 16 bits to spare (capabilities? more operands? annotations? reserved?)

as noted elsewhere above, putting the type info separate slows down execution because each opcode implementation must dispatch on type; it's more efficient to just have a separate opcode for each operation-type combination.

arguments against this:

it changes the instruction length from 64 bits (a little long; 32 bits seems to be the more normal number for fixed-length instruction sets) to 128 (very long; apparently even long x86 instructions can't be longer than 120 bits)
the main gains are in emulation and lazy, map, option, but these can all be supported by higher-order instruction opcodes, and we must support that anyways
as noted above, the introduction of a separate 'type' field is fraught, and makes the language unsuitable for fast execution, necessitating an additional compile step to compile Oot Assembly to something more efficient
- although, with 16 bits for opcodes, we could reserve half of these opcodes for fixed-type opcodes, allowing the additional compile step to be from Oot Assembly -> Oot Assembly
it changes the character of Oot Assembly from a simple and somewhat efficient language to a moderately complex and inefficient one
the decrease in instruction density and in speed puts on the pressure to have two instruction forms, a short 32-bit one and a long 128-bit one, greatly increasing implementation complexity

arguments for:

there's no reason for Oot Assembly to be like ordinary assembly, in fact if it is too much like ordinary assembly, it is unnecessary; similarly, perhaps efficiency of direct execution should not be a goal
with lots of spare bits for custom opcodes, custom addressing modes, custom flags, and EMU, Oot Assembly because more of a meta-language that can be specialized by the implementation to an efficient encoding of the program that it needs
with extra flag bits, and type annotations, we could have per-instruction-optional runtime time checking, and bounds checking
the implementation is more elegant since the opcode field is addressed the same as the operand fields
the flags allow for more Ooty-ness
the spare bits will allow us to pack in various efficiency things like postincrement which may turn out to improve efficiency

---

i probably had better ideas above, but this makes me wonder again, what a 32-bit instruction length would look like:

1 form bit 1 reserved bit 6 addr mode bits (2 addr mode bits x 3 operands) 6 opcode bits 18 operand bits (3 6-bit operands)

again, however, if we want to fix the length (to 128 bits, since that's what the longer format needs), then maybe these would come in packets of 4:

1 form bit 7 reserved bits 4*30 = 120 bits

also, if we are getting all complicated here, perhaps we should make the 'long' format shorter by just giving up on the 16-bits (!) and say, go ahead and use the constant table more? No, because we want 16-bit indicies into the constant table

well wait, what if we only had 11-bit indices into the constant table? we'd save 20 bits in the long format. we already have 16 unused bits, so that's 36 bits. we can give up 12 of the type info bits, for 48 bits saved. we can give up 12 of the flag bits, for 60 bits saved. we can give up 4 opcode bits, and now we're back to a 64-bit instruction length, albeit one with non-power-of-2 field lengths. In other words:

7 opcode bits
16 addr mode bits
4 flag bits
33 operand bits (11 x 3 operands)
4 type bits

note the same argument about the EMU addr mode bit being useless if the length of the opcode field is shorter than the operand fields applies here. So instead:

10 opcode bits
16 addr mode bits
4 flag bits
30 operand bits (10 x 3 operands)
4 type bits

the four flag bits could then be: lazy, map, maybe, typecheck/boundscheck

some comments:

we need a general way to do things like applying 'map' over the next instruction
i'd like some more flag bits
this one doesn't make me as intuitively happy
8 bytes per instruction * 2^10 instructions = 8k to hold the code reachable by an immediate offset from the current PC
64k would be 8 bytes per instruction * 2^13 instructions
since 2^10 isn't enough to hold all of variable references you'll need in a larger program, you'd have to switch code segments in order to use different constant tables ('bank switch' constant tables). What's wrong with that? Well, the whole reason we are using 3-operand instructions, and memory-to-memory (rather than registers), is to make the code more closely match its meaning (e.g. not obscure "c = a + b" by having MOVs in the middle of it), both to make it easier to write quick-and-dirty program analysis, and also to make it more elegant/readable/writable. In particular, we want to make it so that, often, when there is one variable at the source code level, that translates into one particular memory location in the Oot Assembly level, rather than translating into a bunch of different memory locations with MOVs etc in between them. Now, if one part of the source code refers to this variable as 293, and another as 596, because they are using different constant tables, this property is lost. We want to, for the most part, only use the constant tables for large constants; we don't want to use it as a way of bank-switching between references to memory locations that we need to read and write from. So this suggests that this solution (reduce the width of operands, and rely on frequently switching constant tables) is not in sync with our goals for Oot Assembly.
- note: LotusScript? has a max of 64k symbols ( http://www-01.ibm.com/support/knowledgecenter/SSVRGU_9.0.1/com.ibm.designer.domino.main.doc/H_STR_MAXSYMBSIZE.html )
- note: the JVM has a max constant table size of 64k entries ( https://docs.oracle.com/javase/specs/jvms/se5.0/html/ClassFile.doc.html )
if we really want 16 bits for each operand, and 16 bits for the opcode, then we are already at 64 bits. So there's no point in wishing we were 64 bits or less.
if that's too large, perhaps the answer is to have a second, 'compact' instruction form, rather than to try to stuff ourselves into 64 bits (which necessitates makes the opcode field width less than the other operands' width, which may be bad for 'functional' stuff
must focus on how to do things like 'map' in general, that's unsolved as yet. i suspect that it will be useful to make the opcode field the same as the others, e.g. functions are just data

---

an adder has 3 inputs and two outputs; two normal inputs and a carry input, and a normal output and a carry output. This may have implications for what the ideal arity of assembly instructions is.

In traditional assembly, the status register is used to provide hidden inputs and hidden outputs for arithmetic instructions. This is a particular case of the same general idea as accumulator registers, wherein we make instruction encoding shorter by hardcoding certain inputs and outputs to/from special registers. One could imagine that instead of having a fixed status register, every arithmetic instruction would take an additional input , and an additional output saying which register to use as the status register for output. Clearly in the majority of cases this wouldn't help much over just having a fixed status register, so it would waste a lot of space in the instruction encoding (plus, make the execution more complex and hence slower). From the point of view of functional programming, however, this is abusing side effects where you really should be just taking inputs and returning outputs.

In a higher-level language, the need for a carry input could possibly be obviated; for example, the second input could be a tuple which is a pair of (second input, carry). All you really need is two inputs (because you need to have a pair() primitive to create tuples; there's probably a way to get around this with one input, but it would be ugly).

In a higher-level language, the carry output seems similar to an error code.

In a higher-level language, one might think the carry output could be eliminated by using an Option type (a type with special sentinel values, e.g. IEEE floating point's +Inf, -Inf, NaN?); carry could be expressed by returning +Inf or -Inf. This almost does the trick, however, this can't express the difference between adding 1 (first input) + 1 (second input) + 1 (carry input), which yields 11 (1 at the carry output, and 1 at the value output), vs. adding 1 + 0 + 1 or 0 + 1 + 1 or 1 + 1 + 0, all of which yield 10; if you just used a carry sentinel, then both of these cases would yield +Inf (the carry sentinel).

In a higher-level language, two outputs could be expressed using tuples (in this case, a pair); the single output would be (carry, value).

I don't really have any firm conclusions here, those are just thoughts.

---

i guess there's a very simple argument to be made for 16-bit addressing:

16 bits is the smallest power of 2 that could possibly comfortably work; 8 bits is only 256 memory locations, which is clearly too small for many very simple applications

---

more evidence for 16-bit addressing, and more citations for which 16-bit or less CPU families were popular/notable:

6502 (8-bit; Apple I, Atari, Apple II, Nintendo Entertainment System, Commodore 64, BBC Micro) 65C02 (8-bit backwards-compatible extension; Apple IIc; Apple enhanced IIe; BBC Master; ) 65C816 (16-bit backwards-compatible extension; SNES; Apple IIgs)

8080 (various CP/M systems) Z80 (backwards-compatible with 8080) 8086, 8088 (16-bit, backwards-compatible ("source" but not binary) with 8080; IBM PC) 80286 (16-bit, predecessor of 32-bit 386 which is the foundation of present-day x86)

6800 (8-bit, Altair 8800) 6809 (8-bit, backwards-compatible (source-compatible) with 6800; TRS-80) 68HC11 (8-bit embedded, based on 6800) 68HC12 (16-bit MCU, mostly backwards-compatible with the 68HC11, 16-bit program counter) 68HC16 (16-bit MCU, based on the 68HC11)

68000 (16-bit; Sun, Mac, Amiga, Atari ST, Sega Genesis, Sinclair QL, early laser printers, various MCUs)

MSP430 (16-bit MCU)

PDP-11

"The Z80 and its derivatives and clones made up one of the most commonly used CPU families of all time, and, along with the MOS Technology 6502 family, dominated the eight-bit microcomputer market from the late 1970s to the mid-1980s." -- http://en.wikipedia.org/w/index.php?title=Zilog_Z80&oldid=634577740

The Z80 ISA was an extension of the Intel 8080 ISA (binary compatible with 8080 code). The Z80 was an 8-bit architecture with 16-bit addressing [6]. It had 8-bit registers which were in pairs, so the pair could be used as a 16-bit register.

" New syntax

Because Intel had a copyright on their assembly mnemonics,[20] a new assembly syntax had to be developed. This time a more systematic approach was used:

    All registers and register pairs are explicitly denoted by their full names
    Parentheses are consistently used to indicate "memory contents at" (indirection, or pointer dereferencing) with the exception of some jump instructions.[21]
    All load and store instructions use the same mnemonic name, LD, for LOAD (a return to the simplistic Datapoint 2200 vocabulary); other common instructions, such as ADD and INC, use the same mnemonic regardless of addressing mode or operand size. This is possible because the operands themselves carry enough information.

These principles made it straightforward to find names and forms for all new Z80 instructions, as well as orthogonalizations of old ones, such as LD BC,(1234).

Apart from naming differences, and despite a certain discrepancy in basic register structure, the Z80 and 8086 syntax are virtually isomorphous for a large portion of instructions. Only quite superficial similarities (such as the word MOV, or the letter X, for extended register) exist between the 8080 and 8086 assembly languages, although 8080 programs can be assembled into 8086 object code using a special assembler or translated to 8086 assembly language by a translator program.[22][23] Instruction set and encoding

The Z80 uses 252 out of the available 256 codes as single byte opcodes ("root instruction"); the four remaining codes are used extensively as opcode prefixes:[24] CB and ED enable extra instructions and DD or FD selects IX+d or IY+d respectively (in some cases without displacement d) in place of HL. This scheme gives the Z80 a large number of permutations of instructions and registers; Zilog categorizes these into 158 different "instruction types", 78 of which are the same as those of the Intel 8080[24] (allowing operation of 8080 programs on a Z80). The Zilog documentation further groups instructions into the following categories:

    8-bit arithmetic and logic operations
    16-bit arithmetic
    8-bit load
    16-bit load
    Bit set, reset, and test
    Call, return, and restart
    Exchange, block transfer, and search
    General purpose arithmetic and CPU control
    Input and output
    Jump
    Rotate and shift

No multiply instruction is available in the original Z80.[25] Different sizes and variants of additions, shifts, and rotates have somewhat differing effects on flags because the flag-influencing properties of the 8080 were copied. Load instructions do not affect the flags (except for the special purpose I and R register loads). "

-- [7]

" The first widely adopted 8-bit microprocessor was the Intel 8080, being used in many hobbyist computers of the late 1970s and early 1980s, often running the CP/M operating system; it had 8-bit data words and 16-bit addresses. The Zilog Z80 (compatible with the 8080) and the Motorola 6800 were also used in similar computers. The Z80 and the MOS Technology 6502 8-bit CPUs were widely used in home computers and second- and third-generation game consoles of the '70s and '80s. Many 8-bit CPUs or microcontrollers are the basis of today's ubiquitous embedded systems. " -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098

" The address bus is typically a double octet wide (i.e. 16-bit), due to practical and economical considerations. This implies a direct address space of only 64 KB on most 8-bit processors. " -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098

the http://en.wikipedia.org/wiki/Motorola_6800 also had a 16-bit address bus

wikipedia also lists "12-bit" in http://en.wikipedia.org/w/index.php?title=Template:Computer_architecture_bit_widths&oldid=633925183 but this isn't a power of 2 and isn't as popular as 8-bit and 16-bit

" The IBM System/360 introduced byte-addressable memory with 8-bit bytes, as opposed to bit-addressable or decimal digit-addressable or word-addressable memory, although its general purpose registers were 32 bits wide, and addresses were contained in the lower 24 bits of those addresses." -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098

4-bit processors often have 10 to 20-bit addressing:

" The first commercial microprocessor was the binary coded decimal (BCD-based) Intel 4004,[2][3] developed for calculator applications in 1971; it had a 4-bit word length, but had 8-bit instructions and 12-bit addresses." -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377

" The TMS 1000, the world's first single-chip microprocessor, was a 4-bit CPU; it had a Harvard architecture, with an on-chip instruction ROM with 8-bit-wide instructions and an on-chip data RAM with 4-bit words. " -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377 . It had 1k bytes of ROM (addressed by 10 bits, in a 6-bit registers and a 4-bit register) and 32 bytes of RAM (addressed as 256 bits of RAM by 2 4-bit registers).

" The Saturn processors, used in calculators such as the commonly used HP-48 scientific calculator, are 4-bit machines; as the Intel 4004 did, they string multiple 4-bit words together, e.g. to form a 20-bit memory address, and most of its registers are 64 bits, storing 16 4-bit digits." -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377

" Very often, when referring to the word size of a modern computer, one is also describing the size of address space on that computer. ... But this does not always hold. Computers often have memory addresses larger or smaller than their word size. For instance, almost all 8-bit processors, such as 6502, supported 16-bit addresses—[citation needed] if not they would have been limited to a mere 256 byte memory. The 16-bit Intel 8088 had only an 8-bit external memory bus on early IBM PCs, and the 16-bit Intel 8086 supported 20-bit addressing,[citation needed] allowing it to access 1 MiB? rather than 64 KiBs?