Bayle Shanks's website: proj-oot-ootAssemblyNotes3

i've been reading about Lua's VM from

http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf

which is an excellent, easy-to-read document. I'm interested in the VM because Lua seems somewhat small, and the VM seems somewhat small, and i am looking at it as a basis set of operations for Lua (a core language). Seen for this purpose, there are some unnecessarily complex parts of the VM, doubtless for efficiency, such as:

the LOADBOOL instruction also has a built-in optional SKIP
some instructions (those with an RK in http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf , including GETTABLE, SETTABLE, ADD/SUB/MUL/DIV/MOD/POW) have a built-in 'constant addressing mode' (the high order bit of the 8-bit register number operand says whether this is to be interpreter as direct addressing (a register) or constant addressing (a constant)), but not all possible instructions (e.g. MOVE is just direct addressing in both operands). Note that the constants in these instructions are indices into the constant table, not immediate constants.

so i guess an assembly for Oot would want to be less complex. Other thoughts:

at first glance, the proliferation of LOADK vs LOADNIL vs LOADBOOL seems wasteful, but upon second glance, it provides useful type information to the reader or compiler/interpreter
GETGLOBAL and SETGLOBAL don't refer to the global ID directly, but indirectly via the constant table
Lua's 'constant addressing mode' may be a better idea than my 'immediate addressing mode + variable-length constant suffixes'. Even if you wanted variable-length constants, at least with Lua's way these would be kept in the constant tables, rather than interspersed with code. My concerns about locality of reference and caching would probably be satisfied with Lua, since its 'code segments' are only the length of one function (identifying code segments with functions also makes things slightly less complicated). Note that I'd still like an 'immediate addressing mode' for small values. If we went this way, maybe we could also dispense with variable length prefixes and fix a set of prefix flags?
(begin part already copied to plPartImplementation). it's interesting to me how some languages go from a source language to a assembly-language-like IR or VM (eg Python, Lua, Java, Smalltalk?, Clang/LLVM), and other languages go from a source language to a non-assembly-like 'core language' and from there to either a C-like language or a VM (eg Haskell->GHC Core -> STG -> C--, Hoon -> Hoon 'non-synthetic glyphs' -> Nock, Shen -> KLambda). Note that (a) most languages' assembly-like language or VM or C-like language is Turing/von Neumann/imperative/linear, but Nock is not; (b) probably many languages that i am saying go straight to an assembly-like language have some sort of desugaring step, but to me desugaring a few constructs in a complex language is distinct from defining a small core language, although i agree that the distinction there is subjective; (c) there are also some language implementations that pass thru a 'generic AST' stage, such as gcc's Gimple, which i don't know much about but which i guess may be like an assembly-like IR, except in tree form instead of linear form (gcc goes: source language -> gcc GENERIC -> gcc GIMPLE -> gcc RTL; RTL is an assembly-like language; GENERIC is a generic AST language; GIMPLE is a 3-operand variant of GENERIC, i think); i see this more an a tree-structured form of assembly than as a core language.
- i'm guessing that the proliferation of steps in GHC is because they ultimately convert from one computational paradigm (functional language/'graph machine'/lambda-calculusy with lots of partial evaluation, laziness, closures) to another (turing machine/imperative/von Neumann); this probably happens in the STG -> C-- step.
- (end part already copied to plPartImplementation).
- i guess i like the architecture with the core language as an extra step. This allows you to conceptually highlight which high-level operations are important in your language, as well as to focus implementation effort around these without worrying about VM impedence mismatch until the next level down. For example, in Haskell we see that things like partially applied functions, closures (let-bound stuff), variables applied to arguments, and generic thunks are all pretty 'core'.
  - however one disadvantage here is that you lose the chance for optimized compilation of some larger structures, eg parts of the language that are broken into many small parts in the core language. One way to retain this is to have a large core, comparable to Guile's VM with about 180 instructions, but that's too much complexity for me. Another way is to allow lower level compilation stages to look back at higher-level representations, such as the original source code; http://www.aosabook.org/en/llvm.html argues/implies that doing that is part of what makes gcc dirty, but i might argue it's a good thing, as long as it's only permitted for optimization, not actual semantics, so that if the higher-level is unavailable, the lower level/later compilation stages can still do their job.

---

so, addressing modes:

immediate
constant table (constant table immediate)
direct (register direct)
indirect (register indirect)

(note: in this scheme there is no stack addressing mode!)

an alternative:

immediate / constant table: high-order bit indicates whether this is an immediate constant or a constant-table constant
direct
indirect
stack

an alternative:

1 bit says: constant if constant: one bit says: immediate constant or constant table constant if non-constant: one bit says: direct or indirect stacks, and the top-two elements of each stack, are memory-mapped (so, as described above, with 4 stacks, that's 16 memory-mapped locations; one bit is whether this is a straight read/write or a pop/push, and one bit is whether this is to the TOS or the second-element-of-stack, and two bits are which stack) could also have one bit for whether we are accessing 'special' memory, like the memory-mapped stacks, or 'ordinary' memory; 'special' memory has the properties that (a) when you read from a special memory location, you might get a value different from what you last wrote to it, (b) 'reads' might cause mutation (e.g. when they are interpreted as POPs). We might extend this to say that special memory is 'volatile' and then accesses to it should be executed immediately and non-reordered because they might have arbitrary side-effects. We might also say that special memory is always thread-local and never shared.

also, as noted above, we need an 'emulation'/'cascaded addressing mode' bit (at least for direct and indirect addressing, do we also need it for constants?). do we need PC-relative addressing? do we need table (eg displacement) addressing?

if we have: constant bit, direct/indirect bit, emulation bit, special bit, that's 4 bits, so if we have 16-bit words, we have 12 bits left, ie 4k. note that in this scheme, addresses can be up to 4k but constants can be up to 16k, because constants don't need 'emulation' or 'special' bits.

(Instead, could use the ARM-ish barrel-shift-ish encoding where the 12 low-order bits of immediate constants are interpreted straighforwardly, but the two high-order bits shift the others, e.g. if the two high order bits are 00 then the value is the value given by the 12 low-order bits, if the two high order bits are 01 then the value is 2*(the value given by the 12 low-order bits), if 10 then 4*(the value given by the 12 low-order bits), and 11 is multiplication by 8, allowing representation of all values up to 4k, and some values with trailing zeros up to 32k. This sounds useful/efficient but is slightly more complicated, so i probably won't do it)

---

in [1] i wrote:

" Note that Haskell is implemented a bit differently from many languages; this stems from Haskell's properties:

lazy (ie. at runtime we often pass around suspended expressions or 'thunks' rather than values)
higher-order functions (ie. functions are first-class values)
currying (ie. if we apply a function f to two arguments using the expression 'f x y', and the result type is an integer, then (f x) is a function which returns a function which takes one argument and returns an integer; in some implementations, the 'f' in (f x y) could be internally implemented either by (a) a function which takes one argument and returns a function which returns an integer, (b) a function which takes two arguments, and returns an integer (c) a function which takes more than two arguments, in which case the result of 'f x y' is a partially applied function)

It is therefore assumed that Haskell programs will, at runtime, very frequently be dealing with unevaluated thunks, with closures which represent unevaluated or partially evaluated functions, and with the application of functions whose arity was unknown at compile time. "

due to these considerations, the GHC compiler appears at first rather complicated (for example: in reading about it i heard about 'spineless tagless G-machines' and i found a paper describing that and read about a weird system in which to evaluate a function whose arity is only known at runtime you put a bunch of potential arguments on the stack and then jump into code for the function and then IT decides how many arguments to consume, and if it generates another function as a result, it calls that function directly to consume the rest of the arguments, etc; then in later papers i found that they dropped this weird 'push/enter' system for the simpler 'eval/apply' system that i would have expected (the caller looks up the arity of function at runtime and then calls the function with only the arguments it will use, then calling the output of the function if appropriate), and also that they also later switched from a tagless system to one with tags (but not type tags, rather a tag for whether an object has been evaluated yet, and for which outermost constructor was used to generate the object, which is type-specific because different types have different sets of constructors, so we kind of have a type-within-a-type and the type is known at compile-time but the choice of constructor, the type-within-a-type, is not!). And then there are info tables describing each object, and the garbage collector has to read these, and the garbage collector also apparently can sometimes help in propagating tags for some reason, and it all sounds very complicated)

but now that i've read about that, i wonder if i can't boil down the basic operations that the runtime is trying to do into an intermediate VM language, and leave questions like push/enter vs. eval/apply, and of how/where you store the stuff that Haskell puts in type tags and info tables, to the VM implementation (compiler or interpreter). Then Oot Core could do all this Haskelly stuff but the reference implementation could be comprehensible. Of course, the optimized reference implementation could compile this intermediate language into something more efficient, like Haskell does.

Now, what is the difference between GHC Core and the thing i am talking about? I guess it's that i'm talking about an intermediate VM language for what instructions are being executed at runtime, rather than a representation of the source code that is being compiled.

operations like: APPLY (what else)

also, other 'runtime'-y things, such as COW (copy-on-write)

http://www.haskell.org/haskellwiki/Ministg

so, this suggests the place of expressions are included in oot assembly; not only because we could have an AST instead of a linear sequence of VM instructions, but also because (for the same reason that lazy languages are like a reduction of expression graphs), thunks must contain expressions. also, should think about how to generalize/unify this with dynamically passing around references to imperative code.

one piece of the puzzle that is missing is how to combine delimited continuation and friends (eg metaprogramming based on direct manipulation of the call stack) with haskell-style graph reduction this may also be related to the puzzle of how to have tail calls but also support good stack traces; as ministg points out, "Lazy evaluation means that the evaluation stack used by the STG machine is not so useful for diagnosing uncaught exceptions. The main issue is that the evaluation context in which thunks are created can be different to the evaluation context in which thunks are evaluated. The same problem does not occur in eager languages because the construction of (saturated) function applications and their evaluation happens atomically, that is, in the same evaluation context. ". In fact, you might say that in such a language there are not 2, but 4, stack contexts for each function: the lexical (source code) ancestors of the function definition in the AST, the lexical (source code) ancestors of the function application in the AST, the ancestors in the call stack at the runtime time of the function's creation, and the ancestors in the call stack at the runtime time of the function's calling. should also look at di franco's effort to combine call-stack metaprogramming with constraint propagation

people say 'synchronize' and have target language instructions like SYNC but what they really mean is often 'WAIT until the other thread gets here, or WAIT until the other thread does this', so i should call that opcode WAIT.

the idea of haskell data constructors as a type-within-a-type is interesting and possibly a fundamental way to generalize/simplify. So now we seem to have at least 3 types of types: atomic types (Hoon's odors), composite types (C structs, Python objects, Python and Haskell and Lisp lists, Python dicts, Hoon's structural typing), and the choice of data constructor within a Haskell type (eg is this List a Cons or a NilList?? is this Maybe a Nothing or a Just?) in addition, there are interface types vs representation types, which may be an orthogonal axis

why is a g machine graph not tree (recursion, but why?)? http://stackoverflow.com/questions/11921683/understanding-stg says "Haskell's pervasive recursion mean that Haskell expressions form general graphs, hence graph-reduction and not tree-reduction" but i'm not sure i understand that, wouldn't we just be working with a lazily generated (possibly infinite) tree?

another primitive that we may need to deal with is fexprs, that is, (first-class) functions that don't evaluate their arguments first: https://en.wikipedia.org/wiki/Fexpr (how is this different from any function in the context of a lazy language? should we just generalize this to strictness annotations on each argument?)

APPLY COW WAIT GETTABLE? SETTABLE? CPUID -- args in reg, first call CPUID(0) to get highest arg supported: http://en.m.wikipedia.org/wiki/CPUID, vendor id, processor id, feature bits, serial number, core topology, cache config, virtual and physical address sizes GENSYM (see below) MALLOC, MEMFREE UNDEF (done with the contents of this register) VARTYPE PERSPECTIVE PUSH POP GET SET CPY (deep copy; value copy, actually COW?) MOV FILL-IN-TEMPLATE (like python's '%' on strings, but with graphs) MATCH (graph regex) REPLACE (graph regex) and stuff about grammars (parsing and producing?) and stuff about logic and oberon stuff (while loops, etc)? SHALLOWCPY ALIAS, GETALIAS (see [2]) DIRECTGET, DIRECTSET (non-indirection versions of array indexing; take the array stride as arguments; 1d arrays only?) (or, just GET and SET with modalities, but that would decrease code density) ENTER, LEAVE (enter and leave block scopes; something similar may be/seems to be present in Ruby YARV bytecode and Perl55555) PUTSELF (both Lua and Ruby Yarv appear to have an instruction to reference Self/This)

note: list of ops copied to and continued at [3]

from [4]: some core operations from Guile's VM:

getting and setting and isBound? of values from/to variables in various namespaces (the stack, the local variable stack frame, the 'toplevel environment', the 'program’s vector of captured variables'
boxing and unboxing
make-closure
procedure call and return
branch
data constructor (also primitive data?)
loading
primitive ops
not
structural equality
reference equality
null?
get, set
cons, car, cdr
struct-vtable (meta)
class-of
number
bool
nil
eol
char
symbol
list/vector/array
struct

a new (to me, not to the world) interpretation of Nock's core instructions:

select, identity, nock, isLeaf, successor, isStructurallyEqual:

head and tail on s-expressions,
get the environment as a value,
apply a function to the environment then run a function under the new environment being the result,
primitive arithmetic,
structural equality

in other words:

s-expression manipulation
environment manipulation
run a function within an environment
primitive arithmetic
structural equality

(note: no mention of stack manipulation/continuations/jumping here, except for 'run a function within an environment')

(note: we can construct functions because we can build an AST as data and then execute it as code using Nock)

(note: why is this 'functional'? maybe b/c: (a) operations are defined in terms of macros ie combinators, (b) to use eg the environment you copy it, you dont mutate it)

this interpretation shows the similarity between even Nock and various other core languages:

define and execute functions
manipulate composite data structures
manipulate the environment
primitive arithmetic
equality
(note: i guess we get boolean ops like AND because we have access to structural equality, not just equality, so we can simply compare inputs to entries in a truth table)
(note: as Nock shows, we can build both 'if' and also program composition from these primitives)

so some stack ops:

DROP DUP SWAP OVER ROT DUP2 DROP2 PICK TUCK

DRIP: a b c ... -> a a c ... (MOV reg1 reg2) DIP: a b c ... -> b b c ... (MOV reg2 reg1) NUP: a b c ... -> a b b c ... (PUSH reg2 onto the (rest of the) stack) NIP: a b c ... -> a c ... (POP the (rest of the) stack into reg2) TUCK: a b c ... -> a b a c ... (PUSH reg1 onto the (rest of the) stack) TAKE: a b c ... -> c b ... (POP the (rest of the) stack into reg1)

PUSH POP OVER UNDER PEEK

cow, fold, map as core ops

---

some things u see in a lot of languages:

APPLY (call a function, possibly with partial arguments or currying) APPLYTAIL return GOTO/JMP conditional JMP comparisons, esp. equality a way to make a composite data structure a way to index into or destructure that composite data structure arithmetic and other 'primitive functions' looping constructs variable assignment a way to make a closure (a function with an environment of variables to be used in that function) switch/jump table (special case: 'case' of an ADT) try, raise, catch call FFI

and some have: call/cc, delimited continuations, and other 'first class stack frame metaprogramming' first-class, executable representation of code polymorphic dispatch the nock operator (do something to the environment, then do something to the code, then evaluate the resulting code upon the resulting environment)

some popular data structures: tuples linked lists (used as binary trees in the Lisp or Nock way, e.g. first/rest) random access/packed lists/arrays/vectors dicts graphs? functions, closures, objects, tagged unions/ADTs

stacks seem to be used mostly for keeping track of evaluating an expression, which is shaped like a tree, linearly (eg in the AST you don't need the stack, but in bytecode you do). the call stack can be thought of as an extension to this

could have calls with arbitrary many arguments; have the second operand have a sentinel that means 'composite argument', which is found outside of the fixed-width bytecode (are the composite args embedded following the instruction, or in a special constant table?)

---

it should be made clear that the use of <16-bit wide operands does not imply a word size of 16-bits. In fact, the word size is (unspecified? or specified to be infinite?). The <16-bit operands merely impose (a) an upper limit on the number of registers, (b) an upper limit on the number of entries in the constant table for a given code segment (but you can always go to a new code segment with a new constant table), (c) an upper limit on the size of immediate values encoded right in the operand, rather than in a constant table. Note that i say <16-bit operands because (reverse search for 'addressing mode'), some of the bits in the operand field are used to select the addressing mode (so we only have between about 12 and 14 bits left for the actual values, depending on the addressing mode used, and on how we decide to encode addressing modes).

---

analysis of 8-bit and 16-bit video game consoles:

(note: often i say K when i mean KiB?)

I looked at the tech specs of the NES (8-bit), the SNES (16-bit) and the Sega Genesis (sometimes called "16 bit" but appears to have a 32-bit processor with 24-bit addressing and a 16-bit wide data bus; this is sometimes called 16/32-bit). See data at [5], sections "8-bit video game consoles" and "16-bit video game consoles".

My main objectives were to determine how much memory was addressed, and what the instruction sets were. I am particularly interesting in the question of if the <16-bit operand field sizes that i am contemplating in Oot Assembly would be in the ballpark of the amounts of memory needed for systems like these.

Copying from the summary i wrote over there:

" The instruction sets were 6502 (and 65C816, a 6502 extension) (NES, SNES) and 68000 (Genesis). The address sizes were 16-bits (NES), 24-bit (SNES, Sega Genesis). The builtin RAM was 2 K (NES), 128 K (SNES), 72 K (Genesis). In-cartridge memory was typically up to 8 K, but could go up to 64K with 8K banks (NES; unsure about the other 2). (Program) ROM sizes were cheaply 32K (NES), typically up to 256K (NES) or 4 MiB? (SNES), max 512K (NES), in banks of 32K (NES).

Builtin graphic RAM was 2 K (NES) or 64K (SNES, Genesis), cartridge-expandable in banks of 8 KiB? (NES). Max typical CHR RAM was ~128 KiB? (NES, MMC1). Critical numbers regarding max # of sprites were 8 (NES), 64 (NES). "

How does this square with my <16-bit operand size limit proposal?

16 bits would be enough to directly address:

the NES
the NES builtin and in-cartridge max RAM
the NES PRG ROM bank size (which is also the NES non-MMC max ROM size)
the NES CHR ROM (?) bank size
the builtin graphic RAM in the SNES or the Genesis
almost: the builtin RAM in the Genesis

16 bits would NOT be enough to directly address:

the SNES or Genesis address space
the builtin RAM in the SNES (by 2x) or the Genesis (barely)
typical (or max) PRG ROM or CHR ROM in any of these systems

But now because of addressing mode, i have less than 16 bits. So let's say i have 13 bits (8k addressable).

13 bits is enough to directly address:

the NES builtin and in-cartridge max feasible RAM
the NES CHR ROM (?) bank size

13 bits would NOT be enough to directly address:

the NES PRG ROM bank size (which is also the NES non-MMC max ROM size) (32k)
the NES or SNES or Genesis address space
the builtin RAM in the SNES (by 2x) or the Genesis
typical (or max) PRG ROM or CHR ROM in any of these systems
the builtin graphic RAM in the SNES or the Genesis

12 bits would not be enough to address any of these items.

How does this square with parallelism? The NES sprite critical numbers 8 and 64 seem relevant.

Conclusions:

16 bits seems feasible, although it cannot directly address the typical NES ROM sizes
13 bits is the lowest feasible number, because it is the NES in-cartridge max feasible RAM, as well as the NES CHR ROM bank size. But it seems underpowered; the NES max ROM bank size is 32k; and the builtin RAM in the Genesis, and the video RAM in both the SNES andthe Genesis, is on the order of 64k. It would be nicer if we could hit 32k, which is also a common cache size elsewhere.
64 is a critical lower bound for parallelism

---

so, if 13 bits is underpowered, maybe we should take the addressing mode bits out of the operands and put them in the first of the 4 16-bit fields, reducing the number of opcodes. We had too many opcodes anyways. If we only need 32k, that's just 15 bits, but since that's close there may be some benefit to having the 3 operand field lengths be an even 16 bits instead of a non-power-of-2. Perhaps on some architectures we'll have to execute less operations in order to emulate/execute OOM code if each of the 3 operands are an even 16 bits (not sure about that b/c we still have to unpack the same bits out of the first field).

Also, come to think of it, if Nintendo provides for 32K of ROM and that's too little for most cartridges so bankswitching is common, then i guess that even a small program on an 8-bit system is often >32kb, so Oot will surely be at least this big. And i guess it's kind of weird and potentially inefficient to have to put program offsets into a constant table when the program is the inner core of an interpreter. And i could be wrong about all these being the right numbers, so at least having 64k to address, when i think 4k to 8k to 32k is needed, future proofs it a tiny bit.

A disadvantage to that is that if we need 3 addressing mode bits per operand, that's 9 new bits that we now stuff into the first field, leaving only 7 bits. If the first field also has an emulation bit, then there's only 6 bits left, e.g. 64 opcodes, the minimum that we contemplated.

Otoh, we are trying to be minimalistic here, so maybe that's okay.

as i said earlier, as with the Dalvik JVM implementation on Android, we expect that there will be faster VM languages to implement Oot in, and that implementations bent on efficiency might compile OOT VM (OVM, i guess) to an idiosyncratic or platform-specific faster language. So not having enough opcodes isn't really a big problem.

Also, we're going to have to have a Jets-like system to replace at least anything in the standard library, and maybe anything in the standard distribution, with optimized platform-specific FFI calls, so we don't need to put eg. every flavor of addition into OotAssembly? (ovm? osm? oosm? oovm?)

Now, if we are doing this, we really can't go below 6 bits for the opcode, so the last scheme described above, in which each operand had 4 addressing mode bits (constant bit, direct/indirect bit, emulation bit, special bit), is too expensive. Even if you left those in the operands, that would mean the operand values would be only 12 bits, and the analysis above calls for at least 13 bits.

The obvious thing to do is to drop the 'special bit' and rely on a slightly dirtier (but more frugal with memory) 'zero page'-ish memory-mapped specials architecture for the stacks, and rely on more traditional mechanisms for threadlocals and volatiles (which makes some sense, because the amount of these that you'll want or have will depend on the platform and the application, and will not always correspond to approximately half of main memory).

so to summarize, the current scheme appears to be:

undefined word size

64-bit instruction size

instructions divided into 4 16-bit fields: ctrl, op1, op2, op3.

ctrl is a 16-bit field divided into smaller fields as follows:

1 bit: emulation or not (note: this means that the emulation target must be addressable in 6 bits, that is, within the first 64 memory locations!)
6 bits: opcode (64 opcodes)
3 bits: op1 addressing mode: constant, emulation, direct/indirect
3 bits: op2 addressing mode: constant, emulation, direct/indirect
3 bits: op3 addressing mode: constant, emulation, direct/indirect

hmm... the emulation here seems a bit unbalanced; what's the use of op1, op2, op3 getting to address 16 bits when emulating if ctrl can only address 6 bits? So let's just have an EMU instruction and give up on the emulation bits. so now the current scheme appears to be:

ctrl is a 16-bit field divided into smaller fields as follows:

10 bits: opcode (1024 opcodes)
2 bits: op1 addressing mode: constant, direct/indirect
2 bits: op2 addressing mode: constant, direct/indirect
2 bits: op3 addressing mode: constant, direct/indirect

it's tempting to take those 4 bits back and add custom opcodes and custom addressing modes...

but also, don't we need an immediate mode? so how about 3 bits for addressing mode:

1st bit: is the operand value right here, or is what is here just an index into the constant table?
last two bits:
- 00: immediate
- 01: direct
- 10: indirect
- 11: custom addr mode

and, as noted above, reserve one bit out of the remaining 7 bits for 'custom opcodes'. Leaving 6 bits for non-custom opcodes (64 opcodes).

---

this doesn't leave any bits for some of my other oot-y ideas, such as:

a lazy/strict bit
a vectorized (map) bit
an Option (Maybe) wrapper bit
attached type information

if we were to do any of these, we'd need to either use prefixes (annotations), or increase the instruction length. Instruction length is currently 64 bits (4 16-bit units), so the next power of 2 is 128 bits, giving us 4 more units. If we did that we could:

use an entire 16 bits for the addressing modes of the 3 operands (instead of the current 9 bits) (and maybe do some "addressing" of the opcode)
- if we used 4 bits for each of 3 operands and for the opcode, then we can add back in the EMU bit, and it makes sense to do so
use an entire 16 bits for the opcode (instead of the current 7 bits)
use 16 bits for flags (such as lazy, map, option)
use 16 bits for type info
?? we still have 16 bits to spare (capabilities? more operands? annotations? reserved?)

as noted elsewhere above, putting the type info separate slows down execution because each opcode implementation must dispatch on type; it's more efficient to just have a separate opcode for each operation-type combination.

arguments against this:

it changes the instruction length from 64 bits (a little long; 32 bits seems to be the more normal number for fixed-length instruction sets) to 128 (very long; apparently even long x86 instructions can't be longer than 120 bits)
the main gains are in emulation and lazy, map, option, but these can all be supported by higher-order instruction opcodes, and we must support that anyways
as noted above, the introduction of a separate 'type' field is fraught, and makes the language unsuitable for fast execution, necessitating an additional compile step to compile Oot Assembly to something more efficient
- although, with 16 bits for opcodes, we could reserve half of these opcodes for fixed-type opcodes, allowing the additional compile step to be from Oot Assembly -> Oot Assembly
it changes the character of Oot Assembly from a simple and somewhat efficient language to a moderately complex and inefficient one
the decrease in instruction density and in speed puts on the pressure to have two instruction forms, a short 32-bit one and a long 128-bit one, greatly increasing implementation complexity

arguments for:

there's no reason for Oot Assembly to be like ordinary assembly, in fact if it is too much like ordinary assembly, it is unnecessary; similarly, perhaps efficiency of direct execution should not be a goal
with lots of spare bits for custom opcodes, custom addressing modes, custom flags, and EMU, Oot Assembly because more of a meta-language that can be specialized by the implementation to an efficient encoding of the program that it needs
with extra flag bits, and type annotations, we could have per-instruction-optional runtime time checking, and bounds checking
the implementation is more elegant since the opcode field is addressed the same as the operand fields
the flags allow for more Ooty-ness
the spare bits will allow us to pack in various efficiency things like postincrement which may turn out to improve efficiency

---

i probably had better ideas above, but this makes me wonder again, what a 32-bit instruction length would look like:

1 form bit 1 reserved bit 6 addr mode bits (2 addr mode bits x 3 operands) 6 opcode bits 18 operand bits (3 6-bit operands)

again, however, if we want to fix the length (to 128 bits, since that's what the longer format needs), then maybe these would come in packets of 4:

1 form bit 7 reserved bits 4*30 = 120 bits

also, if we are getting all complicated here, perhaps we should make the 'long' format shorter by just giving up on the 16-bits (!) and say, go ahead and use the constant table more? No, because we want 16-bit indicies into the constant table

well wait, what if we only had 11-bit indices into the constant table? we'd save 20 bits in the long format. we already have 16 unused bits, so that's 36 bits. we can give up 12 of the type info bits, for 48 bits saved. we can give up 12 of the flag bits, for 60 bits saved. we can give up 4 opcode bits, and now we're back to a 64-bit instruction length, albeit one with non-power-of-2 field lengths. In other words:

7 opcode bits
16 addr mode bits
4 flag bits
33 operand bits (11 x 3 operands)
4 type bits

note the same argument about the EMU addr mode bit being useless if the length of the opcode field is shorter than the operand fields applies here. So instead:

10 opcode bits
16 addr mode bits
4 flag bits
30 operand bits (10 x 3 operands)
4 type bits

the four flag bits could then be: lazy, map, maybe, typecheck/boundscheck

some comments:

we need a general way to do things like applying 'map' over the next instruction
i'd like some more flag bits
this one doesn't make me as intuitively happy
8 bytes per instruction * 2^10 instructions = 8k to hold the code reachable by an immediate offset from the current PC
64k would be 8 bytes per instruction * 2^13 instructions
since 2^10 isn't enough to hold all of variable references you'll need in a larger program, you'd have to switch code segments in order to use different constant tables ('bank switch' constant tables). What's wrong with that? Well, the whole reason we are using 3-operand instructions, and memory-to-memory (rather than registers), is to make the code more closely match its meaning (e.g. not obscure "c = a + b" by having MOVs in the middle of it), both to make it easier to write quick-and-dirty program analysis, and also to make it more elegant/readable/writable. In particular, we want to make it so that, often, when there is one variable at the source code level, that translates into one particular memory location in the Oot Assembly level, rather than translating into a bunch of different memory locations with MOVs etc in between them. Now, if one part of the source code refers to this variable as 293, and another as 596, because they are using different constant tables, this property is lost. We want to, for the most part, only use the constant tables for large constants; we don't want to use it as a way of bank-switching between references to memory locations that we need to read and write from. So this suggests that this solution (reduce the width of operands, and rely on frequently switching constant tables) is not in sync with our goals for Oot Assembly.
- note: LotusScript? has a max of 64k symbols ( http://www-01.ibm.com/support/knowledgecenter/SSVRGU_9.0.1/com.ibm.designer.domino.main.doc/H_STR_MAXSYMBSIZE.html )
- note: the JVM has a max constant table size of 64k entries ( https://docs.oracle.com/javase/specs/jvms/se5.0/html/ClassFile.doc.html )
if we really want 16 bits for each operand, and 16 bits for the opcode, then we are already at 64 bits. So there's no point in wishing we were 64 bits or less.
if that's too large, perhaps the answer is to have a second, 'compact' instruction form, rather than to try to stuff ourselves into 64 bits (which necessitates makes the opcode field width less than the other operands' width, which may be bad for 'functional' stuff
must focus on how to do things like 'map' in general, that's unsolved as yet. i suspect that it will be useful to make the opcode field the same as the others, e.g. functions are just data

---

an adder has 3 inputs and two outputs; two normal inputs and a carry input, and a normal output and a carry output. This may have implications for what the ideal arity of assembly instructions is.

In traditional assembly, the status register is used to provide hidden inputs and hidden outputs for arithmetic instructions. This is a particular case of the same general idea as accumulator registers, wherein we make instruction encoding shorter by hardcoding certain inputs and outputs to/from special registers. One could imagine that instead of having a fixed status register, every arithmetic instruction would take an additional input , and an additional output saying which register to use as the status register for output. Clearly in the majority of cases this wouldn't help much over just having a fixed status register, so it would waste a lot of space in the instruction encoding (plus, make the execution more complex and hence slower). From the point of view of functional programming, however, this is abusing side effects where you really should be just taking inputs and returning outputs.

In a higher-level language, the need for a carry input could possibly be obviated; for example, the second input could be a tuple which is a pair of (second input, carry). All you really need is two inputs (because you need to have a pair() primitive to create tuples; there's probably a way to get around this with one input, but it would be ugly).

In a higher-level language, the carry output seems similar to an error code.

In a higher-level language, one might think the carry output could be eliminated by using an Option type (a type with special sentinel values, e.g. IEEE floating point's +Inf, -Inf, NaN?); carry could be expressed by returning +Inf or -Inf. This almost does the trick, however, this can't express the difference between adding 1 (first input) + 1 (second input) + 1 (carry input), which yields 11 (1 at the carry output, and 1 at the value output), vs. adding 1 + 0 + 1 or 0 + 1 + 1 or 1 + 1 + 0, all of which yield 10; if you just used a carry sentinel, then both of these cases would yield +Inf (the carry sentinel).

In a higher-level language, two outputs could be expressed using tuples (in this case, a pair); the single output would be (carry, value).

I don't really have any firm conclusions here, those are just thoughts.

---

i guess there's a very simple argument to be made for 16-bit addressing:

16 bits is the smallest power of 2 that could possibly comfortably work; 8 bits is only 256 memory locations, which is clearly too small for many very simple applications

---

more evidence for 16-bit addressing, and more citations for which 16-bit or less CPU families were popular/notable:

6502 (8-bit; Apple I, Atari, Apple II, Nintendo Entertainment System, Commodore 64, BBC Micro) 65C02 (8-bit backwards-compatible extension; Apple IIc; Apple enhanced IIe; BBC Master; ) 65C816 (16-bit backwards-compatible extension; SNES; Apple IIgs)

8080 (various CP/M systems) Z80 (backwards-compatible with 8080) 8086, 8088 (16-bit, backwards-compatible ("source" but not binary) with 8080; IBM PC) 80286 (16-bit, predecessor of 32-bit 386 which is the foundation of present-day x86)

6800 (8-bit, Altair 8800) 6809 (8-bit, backwards-compatible (source-compatible) with 6800; TRS-80) 68HC11 (8-bit embedded, based on 6800) 68HC12 (16-bit MCU, mostly backwards-compatible with the 68HC11, 16-bit program counter) 68HC16 (16-bit MCU, based on the 68HC11)

68000 (16-bit; Sun, Mac, Amiga, Atari ST, Sega Genesis, Sinclair QL, early laser printers, various MCUs)

MSP430 (16-bit MCU)

PDP-11

"The Z80 and its derivatives and clones made up one of the most commonly used CPU families of all time, and, along with the MOS Technology 6502 family, dominated the eight-bit microcomputer market from the late 1970s to the mid-1980s." -- http://en.wikipedia.org/w/index.php?title=Zilog_Z80&oldid=634577740

The Z80 ISA was an extension of the Intel 8080 ISA (binary compatible with 8080 code). The Z80 was an 8-bit architecture with 16-bit addressing [6]. It had 8-bit registers which were in pairs, so the pair could be used as a 16-bit register.

" New syntax

Because Intel had a copyright on their assembly mnemonics,[20] a new assembly syntax had to be developed. This time a more systematic approach was used:

    All registers and register pairs are explicitly denoted by their full names
    Parentheses are consistently used to indicate "memory contents at" (indirection, or pointer dereferencing) with the exception of some jump instructions.[21]
    All load and store instructions use the same mnemonic name, LD, for LOAD (a return to the simplistic Datapoint 2200 vocabulary); other common instructions, such as ADD and INC, use the same mnemonic regardless of addressing mode or operand size. This is possible because the operands themselves carry enough information.

These principles made it straightforward to find names and forms for all new Z80 instructions, as well as orthogonalizations of old ones, such as LD BC,(1234).

Apart from naming differences, and despite a certain discrepancy in basic register structure, the Z80 and 8086 syntax are virtually isomorphous for a large portion of instructions. Only quite superficial similarities (such as the word MOV, or the letter X, for extended register) exist between the 8080 and 8086 assembly languages, although 8080 programs can be assembled into 8086 object code using a special assembler or translated to 8086 assembly language by a translator program.[22][23] Instruction set and encoding

The Z80 uses 252 out of the available 256 codes as single byte opcodes ("root instruction"); the four remaining codes are used extensively as opcode prefixes:[24] CB and ED enable extra instructions and DD or FD selects IX+d or IY+d respectively (in some cases without displacement d) in place of HL. This scheme gives the Z80 a large number of permutations of instructions and registers; Zilog categorizes these into 158 different "instruction types", 78 of which are the same as those of the Intel 8080[24] (allowing operation of 8080 programs on a Z80). The Zilog documentation further groups instructions into the following categories:

    8-bit arithmetic and logic operations
    16-bit arithmetic
    8-bit load
    16-bit load
    Bit set, reset, and test
    Call, return, and restart
    Exchange, block transfer, and search
    General purpose arithmetic and CPU control
    Input and output
    Jump
    Rotate and shift

No multiply instruction is available in the original Z80.[25] Different sizes and variants of additions, shifts, and rotates have somewhat differing effects on flags because the flag-influencing properties of the 8080 were copied. Load instructions do not affect the flags (except for the special purpose I and R register loads). "

-- [7]

" The first widely adopted 8-bit microprocessor was the Intel 8080, being used in many hobbyist computers of the late 1970s and early 1980s, often running the CP/M operating system; it had 8-bit data words and 16-bit addresses. The Zilog Z80 (compatible with the 8080) and the Motorola 6800 were also used in similar computers. The Z80 and the MOS Technology 6502 8-bit CPUs were widely used in home computers and second- and third-generation game consoles of the '70s and '80s. Many 8-bit CPUs or microcontrollers are the basis of today's ubiquitous embedded systems. " -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098

" The address bus is typically a double octet wide (i.e. 16-bit), due to practical and economical considerations. This implies a direct address space of only 64 KB on most 8-bit processors. " -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098

the http://en.wikipedia.org/wiki/Motorola_6800 also had a 16-bit address bus

wikipedia also lists "12-bit" in http://en.wikipedia.org/w/index.php?title=Template:Computer_architecture_bit_widths&oldid=633925183 but this isn't a power of 2 and isn't as popular as 8-bit and 16-bit

" The IBM System/360 introduced byte-addressable memory with 8-bit bytes, as opposed to bit-addressable or decimal digit-addressable or word-addressable memory, although its general purpose registers were 32 bits wide, and addresses were contained in the lower 24 bits of those addresses." -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098

4-bit processors often have 10 to 20-bit addressing:

" The first commercial microprocessor was the binary coded decimal (BCD-based) Intel 4004,[2][3] developed for calculator applications in 1971; it had a 4-bit word length, but had 8-bit instructions and 12-bit addresses." -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377

" The TMS 1000, the world's first single-chip microprocessor, was a 4-bit CPU; it had a Harvard architecture, with an on-chip instruction ROM with 8-bit-wide instructions and an on-chip data RAM with 4-bit words. " -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377 . It had 1k bytes of ROM (addressed by 10 bits, in a 6-bit registers and a 4-bit register) and 32 bytes of RAM (addressed as 256 bits of RAM by 2 4-bit registers).

" The Saturn processors, used in calculators such as the commonly used HP-48 scientific calculator, are 4-bit machines; as the Intel 4004 did, they string multiple 4-bit words together, e.g. to form a 20-bit memory address, and most of its registers are 64 bits, storing 16 4-bit digits." -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377

" Very often, when referring to the word size of a modern computer, one is also describing the size of address space on that computer. ... But this does not always hold. Computers often have memory addresses larger or smaller than their word size. For instance, almost all 8-bit processors, such as 6502, supported 16-bit addresses—[citation needed] if not they would have been limited to a mere 256 byte memory. The 16-bit Intel 8088 had only an 8-bit external memory bus on early IBM PCs, and the 16-bit Intel 8086 supported 20-bit addressing,[citation needed] allowing it to access 1 MiB? rather than 64 KiBs? of memory. Popular Intel Pentium processors[which?] since introduction of Physical Address Extensions (PAE) support 36-bit physical addresses,[citation needed] while generally having only a 32-bit word. " -- http://en.wikipedia.org/wiki/Memory_address#Unit_of_address_resolution

" Other notable 16-bit processors include the Intel 8086, the Intel 80286, the WDC 65C816, and the Zilog Z8000. The Intel 8088 was program-compatible with the Intel 8086, and was 16-bit in that its registers were 16 bits long and arithmetic instructions, even though its external bus was 8 bits wide. " -- http://en.wikipedia.org/wiki/16-bit

" The 16/32-bit Motorola 68000 and Intel 386SX

The Motorola 68000 is sometimes called 16-bit because its internal and external data buses were 16 bits wide, however it could be considered a 32-bit processor in that the general purpose registers were 32 bits wide and most arithmetic instructions supported 32-bit arithmetic. The MC68000 was a microcoded processor with three internal 16-bit ALU units. Only 24-bits of the Program Counter were available on original DIP packages, with up to 16 megabytes of addressable RAM. MC68000 software is 32-bit in nature and forward-compatible with other 32-bit processors in the same family.[7] The MC68008 was a version of the 68000 with 8-bit external data path and 1 megabyte addressing. Several Apple Inc. Macintosh models; e.g., LC series, used 32-bit MC68020 and MC68030 processors on a 16-bit data bus to save cost.

Similar analysis applies to Intel's 80286 CPU replacement called the 386SX which is a 32-bit processor with 32-bit ALU and internal 32-bit data paths with a 16-bit external bus and 24-bit addressing of the processor it replaced.

The 68000 processor of the Sega Mega Drive was a highly advertised feature of the video game system. Due to the saturation of this advertising, the 1988-1995 era (fourth generation) of video game consoles is often called the 16-bit era. " -- http://en.wikipedia.org/wiki/16-bit

"The intel 80286 had a 24-bit address bus and was able to address up to 16 MB of RAM, compared to 1 MB for its predecessor. However cost and initial rarity of software using the memory above 1 MB meant that 80286 computers were rarely shipped with more than one megabyte of RAM.[8] Additionally, there was a performance penalty involved in accessing extended memory from real mode, as noted below." -- http://en.wikipedia.org/wiki/Intel_80286

"The protected mode which debuted in the 286 was extended to allow the 386 to address up to 4 GB of memory." -- http://en.wikipedia.org/wiki/Intel_80386

" Offset, a 16 or 32-bit displacement referring to a memory location (using any addressing mode). Pointer, a 16-bit selector together with a 16 or 32 bit offset. " -- http://en.wikipedia.org/wiki/Intel_80386

" Just as in the 80386, the 32-bit offset registers (x86-terminology for normal CPU registers) allowed a simple flat 4 GB memory model, by setting all segment registers to zero. " -- http://en.wikipedia.org/wiki/Intel_80486

---

with 3-operand assembly instructions, each instruction is like a node in a graph with 3 arcs, directed arcs, two incoming, and one outgoing, such that the value of the outgoing arc is fixed by the node type (the opcode/opcode label; note: this suggests that 'type' may be a special kind of annotation, perhaps separate from 'label') and the values on the incoming arcs.

When all values are binary, this is a boolean logic gate. How many such functions are there? There are 2*2 possible incoming value patterns, and 2 possible outgoing value patterns to be chosen for each input, so 2^(2*2) = 16 possible binary logic gate types.

if we allow the node to also store a state (a single bit of memory), then there are (2*2)^(2*2*2) = 65536 possible stateful binary logic gate types.

i suppose another fundamental computation model is one in which the arcs are undirected, e.g. the value of every arc depends on the value of every other. I suppose this is like a cellular automata; at each step, the node's state updates, depending on what its previous state was and also on the values at all 3 arcs. However, unlike cellular automata, the gates are not arranged in a regular low-dimensional lattice, but rather in an arbitrary topology.

If the node didn't have any state separate from its inputs (which are all 3 arcs, now), then we'd have 2*2*2 output patterns to assign to each of the 2*2*2 input patterns, so (2*2*2)^(2*2*2) = 64 gate types. If there was an internal state, then its (2*2*2*2)^(2*2*2*2) = a very big number (16^16) of gate types. If we worked in trinary rather than binary, with no internal state, that's (3*3*3)^(3*3*3) = a very big number (27^27) of gate types.

Another model we could restrict this to reversible computation; this is like the previous model, except that we demand that, given the values for all three arcs of a gate, and its internal state (if any), as of time t, we should be able to deduce what all three values and the internal state were at time t-1. That is, the mapping from time t-1 to time t must be invertible and therefore injective. This is quite restrictive; i think it means that the mapping is just a permutation. I guess this isn't a model for computation in general, but is an important special case.

---

so some paradigms to be respected/used:

goto
macros/combinators
function calling (function stack), and direct manipulation of the stack (call/cc and the like)

---

need to find a general way to do things like applying "map" to another function, then doing it

e.g. how to represent the following in Oot assembly:

map (+1) list

---

i guess if we have 16-bit opcodes, and this refers to the same memory space as the operands, then this is the key (to applying higher-order functions). We can start with a function, then transform it by operating on it with a higher-order function, and put the resulting function somewhere in memory, then apply it simply by issuing an instruction whose opcode is the memory storage location of the transformed function.

i guess this only works for functions that take no more than 2 inputs and give no more than 1 output, but that's ok.

we also need a way to do partial function application for these, so that we can eg be give f, and then say "g = map f", even if 'map' is really a function that takes two arguments. i suggest either a flag or a sentinel operand value. If we are having 128-bit instructions then there are plenty of flags.

---

things with more than 2 args can just pass a pointer to the argstruct

---

note: the 64k constant table is per-module

note: code cant fit in 64k. Code involves jumping/branching to many destinations, probably (at the assembly levels) more than 64k in larger programs/modules. Perhaps one solution is:

code does not have to fit in 64k
code is still resident in normal memory (note that we are not in a 64k addressing space; we just have a 64k per-module constant table limit)
perhaps the absolute jmp instruction should have a 64k table of jump labels, in addition to the constant table??
there are also relative jump instructions (jmpr? br?)

---

if spare bits are provided in the instruction encoding for 'custom addressing modes' or 'custom instructions'/'fast functions', some platform extensions might go ahead and fix the definition of these, causing programs that actually have their own custom addr mode or instruction tables to malfunction. One soln is to double the number of spare bits, e.g. one addr mode bit for platform use, and one addr mode bit for per-program (dynamic) use. But that's a lot of bits (6 mostly unused bits per instruction; and if the opcode similarly has 1 spare bit for platform use and one for per-program/dynamic use, that's 2 more for a total of 8 bits, leaving only 8 bits for predefined opcodes). Another is to define all of these as per-platform, and then define a special 'platform emulation platform' that allows redefinition; this has the benefit that support for the complex custom addr mode remapping stuff is not part of std oot assembly, and so doesnt have to be supported by most implementations; this requires only 4 reserved bits per instruction (3 addr mode, and 1 opcode; but we can just say '4 reserved bits' and let the platform use those as it will; this leaves us 12 non-reserved bits in the first of the 4 fields, of which 6 is taken up by addressing mode bits, leaving 6 bits for opcodes, or 64 opcodes; this is the tightest we would contemplate, as even the Lua VM has over 32, and for the bloated x86 ISA i elsewhere counted about 678 mnemonics). Then there is a notion of a 'std oot program' which is one that doesnt use any platform-specific functionality.

i still sometimes flirt with the idea of having tagged memory, e.g. having the last 3 fields (and every memory loc) split into 12 bits or so of data, and 4 bits or so of metadata, and perhaps the first field, too, if we want to use a standard addressing mode on it. Or tagged pointers (descriptors?) This would allow nested lookups. My arguments against this were (a) efficiency loss, as implementations will natively deal with 8-bit and 16-bit quantities but not e.g. 12 bit ones, so each of the 4 fields must now be 'decoded' into an eg 12-bit subfield and a 4-bit subfield, (b) implementation complexity.

(b) is negated if the same or similar scheme is used identically on all 4 fields, since we have to split up the first field in any case.

if we could restrain ourselves to only 2 addr mode bits per field, then we could stuff them all into 8 bits in the first field, negating (a) and leaving space for 8-bit opcodes (256). This doesn't leave us with any reserved bits, however, nor does it allow for nested lookups.

using std addressing mode lookup on the first field would have various benefits:

this provides us with a low-level 'first-class function' or 'function pointer', because it allows us to have Oot Assembly instructions whose opcode is dynamically chosen at runtime
because the opcode itself can be register or even register indirect, and because items in memory have no preset bit width limit, this allows for unlimited-size opcodes, which means that we can use this as a mechanism to call arbitrary functions in memory as 'fast functions'
the different addressing modes correspond to useful things:
- immediate addressing mode gives us access to 256 core opcodes
- constant table address mode gives us similar functionality to program-segment-static 'custom opcodes', which could also substitute for 'platform opcodes', if the platform takes some of the constant table for itself
- register table address mode gives us similar functionality to dynamic 'custom opcodes' or as noted above to first-class functions, provided the function addresses are loaded into one of the first 256 memory locations ('registers')
- register table indirect address mode gives us similar functionality to dynamic 'custom opcodes' or as noted above to first-class functions, if pointers to the function addresses are loaded into one of the first 256 memory locations ('registers')

also, we should probably fix which memory locations will be mapped into normal memory, and which are 'special memory' (e.g. stacks etc), even if most of the locs in special memory are currently reserved. The above suggests that we reserve the first 256 locs as 'special', although many of these are ordinary 'registers'.

---

we probably want a mechanism to attach __GET to memory locations. The obvious mechanism would be tagged memory. Another obvious one is only to recognize __GET when using special 'fetch' opcodes. Another one is to use the virtual MMU functionality to trap reads from certain memory locs. The first and third of these solutions imposes an overhead on all or many memory accesses.

---

should we provide MMU functionality?

consists of: page table, possibly with 'nested page table' structure (for virtualization); do we want to provide for lookup via handler for some entries in the table, or just specify the page table data structure (even if there is a handler, can assume that 'hardware' will cache into TLB instead of calling the handler each time). potential per-page metadata in the table: no_execute, dirty, cachable, memory concurrency guaranteees (eg write-combining), if we have nested page tables; do we want to in addition allow some nodes to trap into a handler via Oot __GET?

benefits of an MMU: this would seem to be needed for efficient emulation costs: this would seem to impose an overhead on every memory read or write perhaps the program could specify that all memory locations below (or would above be better?) a certain value are 'real memory' (except for the 'special' low memory, in the case that we choose 'below' earlier in this sentence), and don't have to be looked up via the MMU? This would allow programs to opt-out of most of the virtual MMU overhead (there may still be the small overhead of the extra compare and branch to determine if it's real memory if we choose 'below'), or to accept the overhead for only some memory accesses.

---

on tagged pointers:

Self:proj-plbook-plChSingleIntermedLangs notes that GHC Haskell efficiently uses 2-3 tag bits per pointer to represent both (a) whether we are pointing to an unevaluated thunk, (b) for types with few constructors, which constructor generated this value (one could imagine extending this for types with many constructors by having an 'OTHER CONSTRUCTOR' value, in which case the runtime would do what it normally does, which is look in the info table for the value). So providing 4 tag bits and reducing real (hardcoded) pointer size from 16 bits to 12 bits wouldn't be crazy ('hardcoded' is there to remind you that pointers in memory, rather than in Oot Assembly encoded instructions, could still use an arbitrarily long pointer width)
__GET is like a tagged pointer with an addressing mode of 'call this fn to get the result' instead of 'register indirect'. The 'indirection/nested lookup' flag bit could be considered a special case of a 'call this function' tag
..which makes me think maybe we need more addressing modes? or do we just not support direct access to pointers in memory (register indirect), instead adding this additional layer of indirection to all memory accesses?
GHC (boxed only?) objects ('closures') have 'headers', the only contents of which is a pointer to an 'info table' (the header may also contain profiling info), an 'info table' has: (a) the 'closure type': "The closure type is a constant describing the kind of closure this is (function, thunk, constructor etc.). All the closure types are defined in ClosureTypes?.h, and many of them have corresponding C struct definitions in Closures.h." (b) 'SRT bitmap': The SRT bitmap field is used to support garbage collection of CAFs, (c) the 'layout' field (concerned with telling the garbage collector where every pointer is within the payload data for this object, (d) a pointer to the code for evaluating this colsure; (e) 'Some types of object add more fields to the end of the info table, notably functions, return addresses, and thunks.'; that page also lists 'Types of object', see https://ghc.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects#Typesofobject

the info table thing makes me think, hey, instead of trying to stuff everything into tagged pointers, perhaps just support something like an info table at the Oot Assembly level and let the issue of what information is stored with the pointer and what isn't be a platform-dependent implementation issue

todo toread:

---

to make room for one more addressing mode, we could combine immediate and constant table immediate, saying that immediate only goes to 32k, and a value above 32k is a constant table immediate reference. But do register and register indirect really need more than 32k more than immediate and constant table? Why not just take away a bit uniformally if we're going to do that?

i suppose really we only need one addressing mode, register, and immediate, constant immediate, and register indirect could be replaced by special instructions (loadi, loadci, load/store). This would create a lot of unnecessary register traffic, however.

y'know, even with only 4 addressing modes, we won't be doing many 'fast functions' if we allow addressing in the first field, because that's already 2*4 = 8 addr mode bits, so only 8 bits are left for the opcode 'operand'

which makes me seriously want to consider doing everything in 12 bits, not 16, in which case we'd have 8 bits left over. ... but 4k might not be enough to cover all variables and functions likely to appear in a module? and we might need more than 4k constants to cover all of the jumps over 4k likely to appear in a module?

otoh Lua has a stack size of 256, a limit of 200 locals, and 60 upvals: http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf and we could always have jump and branch instructions that combine the two operands, giving 24 bit jumps instead of 12 in fact, we could also have a "load_immediate_constant" instruction that combines the two operands, allowing access to a 24-bit constant table! which is alot.

could always increase the instruction size over 64-bits. But the next obvious thing to do is double it, and that's some big instructions! and we don't really need another 64 bits, we only need another 8 or 16 or so. But 72 or 80-bit instructions? i dunno if it's a good idea for the instruction size to not be a power of 2.

so this is looking attractive. But what about the inefficiency of unpacking every operand? well, Lua does it, so it can't be that bad.

otoh 'nm /usr/lib/x86_64-linux-gnu/libc.a

otoh nm /usr/lib/x86_64-linux-gnu/libc.a

wc' gives about 16k symbols already (as does nm /usr/lib/x86_64-linux-gnu/libc.a	grep -v '^$'	grep -v '.o:'	wc )
grep -v '^$'	grep -v '.o:'	grep -v '_'	wc gives about 2.8k symbols

---

in addition to the commands that squash the two operands together, could also have an LEA-like command that evaluates the first operand, then indexes into it using the value of the second operand on that base.

so, even if the operands were limited to 4k in size, if each module only exported less than 4k functions, and if you there were less than 4k modules total imported into the program (including submodules of imported modules), then without analyzing stuff, you could use this instruction to load a reference to any accessible function. if there were more than 4k total imported/accessible modules, you could still load each of them into a 24-bit constant table (it's unlikely there would be more than 64k accessible modules).

and with analysis, you could identify the 2k or so most used variables and 0-ary or unary or binary functions, and load references to them into the constant table

(the way Lua does it is to load function references into registers with GETGLOBAL, and then call a function reference in a register with CALL; GETGLOBAL implicitly uses LOADK to load from the constant table, and then looks up the string loaded in the globals table; GETGLOBAL has two operands, and 18-bit global identifier, and the destination register; so this would be like the two-step procedure given above as a fallback; load a reference to the function via the 24-bit constant table into a 'register' (a local variable, of which we have 4k), then execute it from the 'registers')

---

after sleeping on it, i still like the 12-bit thing. While a little icky, the efficiency cost of shuffling a few bits for each field for each instruction is probably relatively small; it's not like it's an additional memory access or anything. Increasing the instruction size beyond 64 bits might be an appreciable cost, as (a) 64-bit machines, which are currently standard, are probably great at fetching 64-bit words, and they might be slower at fetching, say, 72-bit words (b) 64-bit (8-byte) instructions are already 'bloated' and just for the sake of having a design constraint it seems like it would be good to be reluctant to push that further (c) i don't understand the cost, if any, of having fixed non-power-of-two instruction widths.

The 4k limits will not catch everything, as 64k would, but it'll work for the vast majority of cases, the 80/20 rule but probably more like 99% than 80%.

I like being able to address the first field, and to use this for single-instruction calls to first-class functions with no more than 2 inputs. This should let many Haskell-ish algorithms be expressed in a natural way in Oot assembly. Because sometimes we'll hit the 4k limits, sometimes we'll have to fallback to multistep function calls, but at least this will be less common.

I REALLY like that i have a spare 2 bits per-field now. Whether laziness or error-handling or 'execute this' addressing modes or custom addressing modes or nested indirection or emulation or reversible computing or quantum computing or something else (somehow combining combinatory logic, lambda calculus, logic programming, and turing machines?), i bet i'll find some use for these.

---

also, one note: we are back to having some a bit, or at least two spare addressing modes, in the destination operand, because there's no need for immediate and constant addressing modes there

(except: we do need a way to specify that the return argument is discarded)

---

interesting discussion of an ISA proposal:

https://fail0verflow.com/blog/2012/dcpu-16-review.html

the only thing i think i really learned from it that matters here that i didn't already know is the need for "instructions that add in (the overflow register) in one go, much like “add with carry” and “subtract with carry” in other CPUs. This allows for much more compact code that can add/sub/shift wider ints at one instruction per word."

---

otoh if we WANTED to find an excuse to exceed 64-bit instruction, we could have a go at my 5-based proposal.

So now if each field's payload is x bits, we have 5 fields, for 5*x bits so far, and if each field's addressing modes and other per-field metadata were y bits, we'd have

5*(x+y)

so far. Plus we'd probably want z bits of overall flags, so 5*(x+y) + z. So if x = 16, y = 4, z = 16, we already have:

116 bits

which is only 12 bits from 128 bits.

if x = 12, y = 4, z = 12, we'd have:

92 bits

which is 4 bits from 96 (a multiple of 12), and 36 bits from 128.

---

we could indicate a throwaway result by writing it to memory location (register) #0. This could be a 'zero register'.

---

what would 'the fifth element' in the 5-element system be? tenses, whether or not the statement is 'hypothetical'/quoted? which turing-equivalent machine type the instruction is written for? which adjacent instructions the instruction is connected to (branching, conditional execution, or some sort of weird neural connectionist/graph edges thing)? reified graph or hypergraph stuff or cons cells or parentheses? 'looking outward' to represent position within an expression tree (instead of explict stack ops etc to represent a computation path through an expression tree linearly) (perhaps either or both of absolute identifiers or offsets to adjacent nodes)? annotation (type or name or (provenance or probability or trust) or otherwise)? annotation of runtime information/runtime scratchpad, such as if the result of this node has been memoized and if so, where? oot views? 'levels'? whether addresses are to follow bindings and __GETs or to act directly/concretely on their stated target even if it's a symlink? whether or not metaprogramming overrides of instructions are to be respected? something else regarding layers of indirection? which language the instruction is written in? data flow routing between operands or between instructions? concurrency/parallelization stuff? mutable state domains, state versions, 'worlds'? transactions this instruction is inside? list of 'code regions' this transaction is inside / list of metaprogramming directives it's subject to? error handing/'backwards execution'? 'receptors'/component architecture? instruction encoding format? more than one of the above?

body of assumed knowledge (logic programming) (like state version?), or mb body of permitted inference rules?

memory ordering consistency model, reordering time (relation to state version) ?

primary key (annotation; redundant in that memory loc already provides a primary key) could memory loc already be an implicit 5th field? perhaps...

(note: this is mainly in the context of possibly using an extra field for this, but also in the context of possibly using the spare addr mode bits for it)

---

my current thought is that the best 'fifth field' would be a primary key/unique identifer for this statement (a form of 'meta'), so that arbitrary annotations could be attached to it elsewhere, however as noted above the memory location at which the statement is found serves as an implicit primary key, so we don't really need an explicit 5th field to have this in every instruction.

so, for now let's stick with 4 fields.

which means we can fit it in 64 bits. which means we should. so if we want the first field to be more than 8 bits, so that we can practically use 'fast functions' as such and use memory-to-memory first-class functions with some headroom for lots of in-scope variables, and we also want at least 4 addr modes then we should use the 12/12/12/12 bit format, not the 8/16/16/16 bit format. so we have 9 spare bits; 2*4 + 1; 2 spare addr mode-ish bits per field, plus one spare bit for the destination field, because both the immediate and the constant addr modes are useless for that one; really 3 spare bits if you assume that the addr mode bits must be used the same way for all 4 fields.

i feel joy knowing looking forward to getting to put those spare bits to use

---

some ideas on using these:

well, any sort of per-field (per-argument) annotation

lazy/strict

quoted symlink vs follow the symlink implicitly; or similarly, quoted __get and __set vs access the field directly

parallelism annotations, eg 'it's worth it to spawn a thread to do this sub-computation'

'execute what you find there and give me the result' (orthogonal) address mode

async annotation, eg 'start computing this argument now and sleep this thread until it's done' (or if async is the default, blocking annotation)

error handling mode, eg 'return the null in a nullable type if there's an error, vs. raise an exception if there's an error'

copy-on-write vs. shallow-copy (alias) a reference

stack addressing modes (to make it easy for the reader to 'paste together' various linear instructions into an expression tree, eliminating the intermediate output registers)

16 address modes instead of 4

take the addr of (the '&' prefix in C) (the opposite of the register indirect addr mode, which is the '*' prefix in C)

note that the above are just shorthand for doing things that you could do by pre-processing the argument or post-processing the result (in some cases 'by default', eg if you want __gets and __sets and/or async and/or strictify to be the default) (eg; strictify (if lazy is the default) or thunkify (if strict is the default) with a separate instruction as a preprocessing step, taking from a register and storing back into that register; similarly, follow a symlink (if that's not the default) or fetch the quoted (potential) symlink (if following is the default); execute the function you find in the register; launch a parallel or async process; process errors in the argument; COW or shallow copy the data; etc). That's okay because aside from 'register', the other 3 of our 4 core addressing modes are also like that, eg instead of using an immediate addr mode you could precede this instr with a 'LOADI' instruction, and instead of register indirect addressing mode you could precede this instruction with an LEA instruction. But we're (tentatively) allowing those 3 addressing modes because it reduces the need for very temporary, statically introduced/eliminated variables (registers), which makes the code harder to read, and requires a dataflow analysis to precede various optimizations. So the above are the same idea (otoh maybe we actually should get rid of those 3 addr modes! the large number of things in the above list suggest that we won't be able to really get rid of such temporary variables anyways, so if the reader/automatic code analyzer already has to do dataflow/reconstruct an expression tree in any case, why bother to minimize temporary variables at all?)

others ideas that aren't like that (or at least, might not be like that):

quote/antiquote inside macros (i think what i mean here is quasiquote, unquote, splice: see http://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Quoting.html ; note that a quasiquote is just a quote that recognizes unquote and splice inside of it)

implicit map, eg 'if this is a list, implicitly map this over the list and return a list, vs. operate on the whole list'

provide cardinal directions for how to locate this instruction vs. its arguments in space, e.g. 'this argument is South of this instruction'

(or, 'input' vs 'output' as a direction, whatever that means) mb for asking questions, ie inverses eg "C = ADD(A,B)" is like "? = ADD(A,B)" but you could also ask "C = ADD(A,?)", which is equivalent to subtraction

something about the 'reverse computation' mode of bubbling up exceptions, etc

something about locking, coloring arguments the same way if they need to be synchronized, or ordering argument evaluation (or saying they can be unordered) or something like that

even if things are async by default, whether the current instruction should begin executing immediately and immdiately return a promise, or should be encapsulated in a thunk and lazily wait until its result is needed (a third alternative, which is what normally happens in ordinary, synchronous languages, would be to block until the current instruction completes; but this can be achieved by doing either of the above and then afterwards executing a special block/strictify instruction on the result location

custom addr mode/instruction wrapper

language/universal machine paradigm selector

as a generic boundary/mark used by other metaprogramming features

in general one could look at these either as arbitrary colored 'boundaries' in the instruction graph, or as a modification to the way that the operand is interpreted that changes the destination of the edge we draw between the node representing this instruction and the node representing its argument.

note: the above ideas also give some ideas to the sorts of 'defaults' that we want in the oot library ecosystem; things like lazyness, __get and __set by default, symlink-ish binding, immutable data, async by default, inverses; these are things that are 'just' syntactic sugar for common patterns, but that you want to be in the language in order to encourage the library ecosystem to default to them

---

in the previous, we think again about the comparison between a linear language representing an expression by writing intermediate results to registers (or to the stack), vs an expression tree. Consider the idea of annotations being cardinal spatial directions; really, the registers or stack locations form 'directions' in that, if one is just doing an expression evaluation, the expression tree can be reconstructed by pasting the bits of code together that read/write to the same register (or stack location); eg "return 1 + (2*3)" could be "t = 2*3; return 1 + t", and you can reconstruct the original tree by substituting in t.

When it doesn't work like that is when the intermediate variable will be used multiple times (hmm, echos of linear logic?), or when its evaluation causes side-effects; "t = 2*3; return 1 + t + 2*t" is more efficient than "return 1 + (2*3) + 2*(2*3)"

Which suggests another use for annotations; to say 'this variable is just a one-time temporary variable and you can throw it away and reconstruct the expression tree". Note that 'strict stacks' with only PUSH and POP and no peeking enforce the same constraint, given the additional constraint that each expression is only permitted to add exactly one thing to the stack (i think... i havent thought it through, that might not be true). This leads to the interesting idea of having a special stack register, or addressing mode, that allows the compiler to make its intent clear when it is just creating a linearization of an expression tree.

Related to this is the idea of having a special stack register or addressing mode upon which a JVM-esque guarantee of stack layout static predictability is enforced.

---

out of the above, the only ideas that jump out at me as almost a must-have are:

saying whether, in an async situation, the current instruction should start executing now or should wait lazily in a thunk until it is needed.
as a generic boundary/mark used by other metaprogramming features (eg for marking w/r/t which argument a function is being inverted)
- this could be done with annotation instructions (eg prefix the marked instruction with an annotation instruction) but that's not as clean. Otoh, if these marks aren't used very often, that's a lot of wasted bits.

---

more ideas for using those 3 bits of flags:

atomicity uniqueness typing anotations reference counting boxing (boxed vs. unboxed values) something relating to capabilities whether or not runtime type checking is needed whether or not runtime bounds checking is needed 'who' is running this code (and what to do/who to notify if there is an error, eg what is the 'robustness paradigm' (is it 'continue going no matter what' or 'if the correct answer is uncertain, refuse the temptation to guess'? do we neeed to run this operation multiple times and compare results to be sure?)) evaluation strategy (call by name, call by need, call by value, call by refererence, etc) variable sigil annotations, eg to mark some variable within a probability expectation expression as a random variable hmm i guess this adds some color to the idea of switching computing modes between turing-equivalent paradigms; eg maybe in a Turing mode "C = ADD(X,Y)" is an imperative (with one-directional flow of information, and where X and Y and C are registers), whereas in a Logic mode it's a fact, where X and Y and C are just 'variables'. so could have imperative vs logic vs quantum/fixpoint-of-gates vs combinatorial/callByName vs matching graph structures (SATisfaction?) (= logic? = assertions?) eg C = ADD(X,Y) imperative: "take X and Y and add them and put the result into C" logic/satisfaction/matching/assertion: "assume that it is a true fact that C = ADD(X,Y)" or maybe "put the fact ADD(X,Y) into C" quantum/fixpoint-of-gates: "assume that it is a true fact that C = ADD(X,Y), where this is a dataflow that is being continuously re-evaluated" combinatorial/callByName: "take the program code X and the program code Y, combine them under an ADD node, and put the result into C" mb characterize in terms of time? something is: statically universally true (a truth), something that the computer should do now, something that is ongoing/recurring, something that has been done, something that can be done whenever, etc

---

CNOT "the controlled NOT gate which XORs the first bit to the second bit and leaves the first bit unchanged" CCNOT "It has 3-bit inputs and outputs; if the first two bits are set, it inverts the third bit, otherwise all bits stay the same." and CSWAP the three-bit gate that swaps the last two bits if the first bit is 1.", two universal gates for classical computation that are notably generalized to quantum gates http://en.wikipedia.org/wiki/Quantum_gate

---

note: if we want to allow 3-element 'gates' like CCNOT and CSWAP, then i guess we still CAN use the special addr semantics for the dest argument, since in these gates the dest is always still potentially written to, it's just that sometimes it's also read from, and sometimes other addrs are also written to

note: i don't quite see how quantum circuits could be analagous to classical circuits; if you have a gate which swaps two of its operands, then that gate can't be continually active, can it, because in that case if you feed in a "1" and a "0" those'll be swapped infinitely fast. hmm.. but maybe that's the point; then we end up saying that the state of those nodes becomes a superposition of 0 and 1; i guess we're looking for a fixpoint for the circuit as a whole.

---

i guess boxed/unboxed might be another almost-'must have'

immediate mode doesn't make sense with the destination operand, so i was thinking of using that bit for something else, such as whether to start executing the current instruction now, or wait until it is needed.

but boxed/unboxed also doesn't make sense with immediate mode, so maybe i should combine one of the other bits into a 3-bit addr mode that includes boxed/unboxed-ness. but then the dest operand wouldn't have an 'immediate mode' bit free.

also, the maybe/option/exception tower

perhaps the maybe/option/exception tower could be combined with 'boxed' and with the 'indirection flag'

could mb want a per-operand bit for a synchronization/atomicity; eg each instruction guarantees atomicity with respect to those operands which have their synchronization/atomicity flag set

but maybe a per-instruction 'lock' prefix instruction (plus lock annotations for multi-instruction sections) are good enough; per-operand granularity might not be needed very badly

we could also have a 'threadlocal' bit. But this is just like saying that half of the memory, by address, is threadlocal, and adding one bit to the memory address width.

do we want to guarantee that some memory locations are threadlocal? my instinct is to guarantee the first 256 locs, or some portion thereof, and to reserve some of those for ordinary scratch use. this is closer and closer to real registers. if we have register-like things, we need at least 32 of them. and probably at least 4x32 and a 'register window' thingee so a supervisor process can swap them out easily while swapping threads.

also, although it's unnecessary, we may want a displacement addr mode, as it's supposedly very common (see Self:proj-plbook-plChAssemblyFrequentInstructions)

---

should memory location '0' be reserved to represent an undefined pointer, and cause an error when read or written to? or should it represent 'discard' when writing to it, and 0 when reading from it?

https://en.wikipedia.org/wiki/Tagged_pointer#Null_versus_aligned_pointer and https://en.wikipedia.org/wiki/Tagged_pointer#Disadvantages allege that treating memory location 0 as an invalid pointer is extremely common, to the point of being a standard

---

apparently on the VAX, using an immediate addr mode in the dest operand caused self-modifying code! [8] do we want that?

---

so some 'registers' we want are:

0 reg accumulator (for when more than 2 inputs or 1 output is needed) status flags pc 4 stacks do we need threadlocal regs? do we need more kinds of sentinel/invalid pointers?

---

some more support for using 12 bit addrs: in the ARM ISA, page size starts at 4k and goes up to 16mb 'super-sections' this is 12 bits to 24 bits [9]

---

this page is interesting: http://esolangs.org/wiki/User:Ian/Computer_architectures

---

i guess if you want more addressing modes, you can use one addr mode bit to mean 'split the operand in two, apply register addressing to one of them, and apply the specified addr mode to the other, and add the results'

---

in that case i think we'd have a complete system:

4 fields. First the the destination/output field, then the opcode field, then 2 input fields (some instructions may combine some of these fields).

for each field:

first 4 addr mode bits, then 12 operand bits

for each of the first 3 fields:

there are four basic addr modes, immediate, constant immediate, register, register indirect. Two bits are used for this; the first bit says whether or not immediate, and the second says the addr mode choice.

there is one bit that indicates whether or not the data item is 'boxed'. This may as well just be a 'custom addr mode' bit.

there is one bit that indicates whether or not the operand should be 'split' into two 6-bit pieces.

for the fourth, destination field:

same as above, except that there is no immediate or constant immediate modes, only register and register indirect. The bit that was used to indicate immediate (which is the first bit in every instruction) is put to another use. Probably one of:

whether or not the two input operands are combined
whether this instruction is meant to begin executing immediately, or wait
a generic 'mark' annotation for metaprogramming
an 'instruction format' bit that switches between this and another instruction (possibly more concise) encoding
reserved

the first choice, 'whether or not the two input operands are combined'; this is redundant, but allows the decoder to immediately begin decoding the opcode and input operands in parallel without wasting work (wasting work would be by decoding each input operand, assuming they are separate, and also decoding them together, and then choosing which one to use based on the opcode later). i guess avoiding that wasted work isn't terribly important, so maybe that would be a bad choice?

we can generalize this into a 'custom instruction operands format' bit. In other words, the operand field is interpreted normally, but the instruction gets to interpret the last 3 fields (48 bits) however it wants. This allows user-specified functions implement things like a packed representation of 4 no-operand instructions, etc. Which in turn means that Oot itself may as well have such things (but this opens up a small can of worms, namely, we'd really like some bits to specify how to link multiple instructions together with stacks, etc., but we don't appear to have any left; we could try to use this 'custom instruction operands format' itself to indicate that this 64-bit instruction is really 4 separate instructions, but then the second instruction couldn't be immediate/constant mode, which sucks)

if we DIDN'T generalize it into a 'custom instruction operands format' bit, and instead used it as a 'packed encoding' bit, we could pack 2 instructions with 4-bit instead of 12-bit operands into 64-bits (8 x (4 bit addr + 4 bit operand)) (and we'd still have a bit free from the immediate mode destination in the second packed instruction) (note that only the first 16 opcodes are accessible in this way, though; although we could make up a more elaborate 'packed' format to enable more opcode bits there; eg if we cannabalize the destination field of the first packed instruction, implicitly pushing it to the first stack, then we gain 8 bits and can use that to make both opcode fields 4+4=8 bits; however this can't quite work because the first destination field is the bit we were using; so place the packed instructions backwards and now the bit we used is in the second instruction's destination field, where it is okay; now we have 2 instructions with 8-bit opcodes (256 possible opcodes) and 4-bit operands; that seems pretty useful, actually, i bet a large percentage of code would fit into that; and a smart compiler or interpreter could optimize away the stack push in the first instruction if the second instruction pops from the same stack).

But that makes things more efficient but doesn't add expressivity.

or could use it for autoincrement. But that also makes things more efficient but doesn't add expressivity.

or we could use it to 'comment out' this instruction, and/or to use it as an annotation rather than an instruction. This could be important because currently the opcode field is first resolved using its addressing mode bits; so if annotation was just a range of opcodes, that would force us to do resolution even on annotations, and it would also leave the awkward possibility that an innocent-looking instruction could be an annotation depending on the dynamic state of registers (that might be useful though for some annotations like transactions :) but seriously it breaks the idea of annotation as static).

so far, annotation sounds like the best bet. in which case, yes, we'd want to move that field to the front so that this would be the first bit.

heck, let's keep it undecided for now, leaning towards annotation.

---

TAL

TALT (Toward a Foundational Typed Assembly Language, Karl Crary)

TALx86: A Realistic Typed Assembly Language

https://www.google.com/search?q=eli+gottlieb+programming+language&oe=utf-8&oq=eli+gottlieb+programming+language&gs_l=mobile-heirloom-serp.3..30i10.13666.16574.0.16726.23.11.6.0.0.1.313.1539.3j4j2j1.10.0.msedr...0...1c.1.34.mobile-heirloom-serp..13.10.681.tzB73Id3HX8

https://www.google.com/search?q=%22functional+assembly+language%22&oe=utf-8&oq=%22functional+assembly+language%22&gs_l=mobile-heirloom-serp.3..0i30j0i8i30.2229.4766.0.5044.2.2.0.0.0.0.281.443.0j1j1.2.0.msedr...0...1c.1.34.mobile-heirloom-serp..0.2.443.-xG-OiOU5_E

https://www.google.com/search?q=%22lazy+assembly+language%22&oe=utf-8&oq=%22lazy+assembly+language%22&gs_l=mobile-heirloom-serp.3...13629.16885.0.18865.14.12.0.0.0.1.328.1830.4j5j2j1.12.0.msedr...0...1c.1.34.mobile-heirloom-serp..7.7.988.mgxYJ5ga8lc

https://github.com/ppedemon/Bluejelly

---

so how are the fixed opcodes specified? i guess they would be immediates. But then how are fast functions specified? they should be immediates, too, right? so is the convention that for any given modules, the meaning of all 4096 immediates are defined in a fixed way for the whole module?

or is the convention that an immediate just actually refers to one of the first 4096 memory locations, and you execute the fn stored in that location?

in the latter case, mb put fixeed instructions at beginning of constant table, rather than immediate, to free up low registers? in that case, there may be a problem for split constant addr mode in operands, because taking one of these addresses and then adding something else is meaningless. or, define the opcode field to subtract 256, and then use the first 256 opcodes as fixed instructions, and the rest as references to whatever function is in register r - 256 (now high 256 regs are inaccessible as fast functions). and/or is immediate mode signed? if so, could just reserve half the addr space, and cram the stdlib in there

hmm, this seems too complex. What if we use the first convention; each module somehow fixes the definition of the 4096 'immediate' functions. Using the convention that a 'function' is specified by an entry address, some of these 4096 table entries could actually be pointers to lookup tables, allowing the split addressing modes to be useful (e.g. if the opcode field contains 'unboxed, split, immediate: 293, 3' that means 'unboxed: immediate 293 + register 3' which means 'to get the entry point of the function being called, take the per-module-fixed address #293 and add the contents of register 3 to it'. In other words, the per-module fixed address table will contain, at entry 293, a pointer to a lookup table, and register 3 contains an index into this table; the contents of the lookup table is pointers.

hmm. The dumb part of that is that now immediate mode is kind of redundant, because constant immediate mode is also some lookup table. Also, this cripples the use of split addressing; what if we want a constant index 293 into the table whose base address is in register 3? We'd like 'split, immediate: 293, 3' to mean that, not what we just said.

I guess what is simplest is neither of these two systems, but rather: 'a reference to a 'function' is a pointer to the entry point of that function; therefore, executing function 'immediate 293' means 'JMP 293', e.g. address 293 should contain the first instruction of the function, 294 should contain the second instruction, etc'. So in that case, the way to get to the fixed instructions would be the constant immediate addressing mode.

If we're worried about cluttering the low constant table (preventing the first 64 entries, which is all we get for the split addressing in constant immediate mode, from being used for anything useful), we could either say that there are two constant tables, one for the opcode field and one for the other fields, or we could say that it's the high constant table, not the low, that we reserve for fixed instructions. Only the latter of these allows the low constant table to be used by the program to create 64 constant lookup tables that can be used for register-selected fast functions (admittedly, this is probably a weird special case, but maybe it would be good for emulation?), so maybe the latter is better. In other words, reserve the last 2048 entries in the constant table, allowing the program to have only 2048 of them. This is making our 12-bit space look even more crowded, but i think we can handle it (if we want more than 2048 constants, we can load an immediate into a register, then use that register as an index into a table whose base address is in the constant table; giving us 2048*4096 total entries, eg 23 bits or 8MB of entries).

So, to sum up: an opcode is just like an assembly language CALL (put the return addr into a link register, then JMP) to the address indicated by the resolved value of the opcode field. Ie go to the effective address specified by the opcode field, take the value there, use that value as a pointer and CALL it (JMP to it). So if the field says 'immediate 3', then actually JMP to the sequence of instructions held in register 3, register 4, etc, that is, set the PC to 3.

To access fixed instructions, use 'constant immediate' mode.

---

hmm, maybe a better idea: something else we could do to help with emulation: treat immediate mode in the opcode field specially, indicating that we should only execute a single instruction, and then return; so immediate mode 3 doesn't mean 'JMP to 3', but rather 'execute the instruction in register 3 in emulation'.

---

or, how about: to resolve the problems with cluttering up the low constant table, use a different constant table for split addressing and normal addressing.

and to resolve the problems with split immediate, just define 'immediate non-split' to be special, to reference the fixed instructions.

---

what does 'boxed' in the opcode field mean? i think it means the normal thing; the referent is not a plain pointer, but some sort of thunk or nullable/maybe/optionType thunk, whose base type is pointer. We can also work in indirection/nested/chained pointer dereferencing here, e.g. a boxed pointer could be a thunk that says 'go to address X, fetch value x from it, and then do an LEA on the value of x eg as if x were the field'.

---

so, to sum up:

a difference constant table is used for split and non-split addressing
the typical case for the opcode field is that it is resolved in the usual way, the referent is assumed to be a pointer to the entry point of a fast function, and then something like a CALL is performed to this referent.
non-split immediate unboxed mode in the opcode field is a special case. non-split immediate boxed mode in the opcode field is another special case
non-split immediate unboxed mode in the opcode field is how you specify the fixed opcodes (so we have 4096 fixed opcodes; more than enough to stuff not just the base opcodes, but the whole Oot stdlib, into)
non-split immediate boxed mode in the opcode field is currently unspecified, but my idea is to use it for emulation (note that emulation sorta takes 5 arguments; the EMU opcode, the emulated opcode, the emulated destination, and the two emulated inputs; so this would be appropriate)
it's unclear if immediate boxed mode in the other operands is at all useful, or if it should be replaced by another special case
it's unclear if immediate constant boxed mode in the other operands is at all useful, or if it should be replaced by another special case

---

is it easy/worthwhile to have one or two one-instruction restricted CASE statement? Seems like you could reasonably do this with 4 cases; since you have 48 operand bits to work with, use 4 of them for an address mode, now you have 44 bits, that's 11 bits per case entry point (PC-relative, i assume). You could maybe also have 8 cases with 5 bits per case entry point (again, PC-relative).

---

seems like a goal of Oot assembly is shaping up to be:

a linear, fixed-width (or at least fixed-width packs) instruction set architecture that reads more like a high-level language than usual, thereby allowing the preservation of some 'intent' and less need for inessential register transfers; with limited customization/parameterization points (customizable operand decoding, customizable unboxing).

---

so the current proposal is:

4 fields. First the the destination/output field, then the opcode field, then 2 input fields (some instructions may combine some of these fields).

for each field:

first 4 addr mode bits, then 12 operand bits

for the destination/output field:

the first three bits determine whether this is an extraordinary instruction (000 or 010; immediate or constant immediate, un-split), or an ordinary one (otherwise). I am not quite sure what to do if the extraordinary bit is set, but i'm leaning towards annotation.
if this is an ordinary instruction, then the field is interpreted as usual

for the opcode field:

if the addressing mode bits are 0000 (non-split unboxed immediate), then the operand refers to a language-specified instruction opcode
maybe: if the addressing mode bits are 0001 (non-split boxed immediate), then ???? (maybe something to do with emulation?)
otherwise, the operand is resolved in the typical way, and the result is a pointer to a 'fast function' which is effectively 'CALLd'

for all fields, except for the special cases given above, the typical interpretation of 4 addr mode bits is:

if the addr mode is 0010 (immediate unboxed split) or 0011 (immediate boxed split), then see below
first two bits: 00: immediate, 01: immediate constant table, 10: register indirect, 11: register
3rd bit: non-split or split operand
4th bit: unboxed or boxed referent

immediate means that the referent is the operand. immediate constant table means that the referent is the entry in the static per-module constant table indexed by the operand. register means that the effective address is the memory location pointed to by the operand. register indirect means that the effective address is the value in the memory location pointed to by the operand. In general, if an effective address is given, the referent is the value in the memory location indicated by the effective address.

If the operand is unsplit, then the operand is 12-bits and is resolved as indicated above. If the operand is split, it is composed of two 6-bit operands. The first of these is resolved using the given addressing mode. The second is resolved using 'register' addressing mode. These two values are then added together, and the result is taken as the effective address. If the operand is split and the mode is immediate constant, a different constant table is used than for the unsplit case.

If the addr mode is 0011 (immediate boxed), a special case obtains (todo, split operand with other kind(s) of parens in register indirect addr mode; see 6502 X and Y registers and associated addressing modes? but wait then we can't specify boxedness of the result; so let's do something else).

Some instructions may choose to reparse/specially parse their 48 bits of input and destination fields, for example, in order to take 1 input with 28 operand bits instead of 2 inputs with 12 operand bits each.

in other words, useless addressing modes that can be used for other things include:

unsplit immediate or unsplit constant immediate, in destination field only
- current idea: annotation
possibly: unsplit immediate boxed, in any field (and definitely in the opcode field)
- ideas:
  - nested instructions (execute an instruction, the result is the referend (or even the effective addr), 8 bit opcode, 2 bits per input to say whether it comes from stack or acc)
  - ipc, channels, threadlocs
  - something weird with time or state versioning
  - emulated/indirect addressing, e.g. the register indicated by the operand holds 4 addr mode bits and operand bits, process these into an effective address
  - some sort of traversal path across data structures, akin to Nock's indexing into binary trees, but possibly more general (e.g. runs of zeros in between ones indicate numbers; so 001000110100 would indicate 2,3,0,1,2.
  - annotation

---

ok wait, actually unsplit immediate boxed totally makes sense, that's boxed literals. This allows us to create small boxed literals without referencing the constant table. So we need to specify an encoding for the 12-bit operand.

How about: 1 bit null/not null, 3 bits for the type, 8 bits for the payload. So we have 8 types (all subtypes of graph, of course, but:): graph, string, list, hashtable (?), int, real, type, fn (do we need a DYN here? or is 'graph' that? (i think the latter)); do we need a separate 'uint' and 'sint'?. If the 'not null' bit is set, then the 8-bit payload specifies an element of the type; in most cases this is a list of one element, a string of one character, an int, etc, but in the case of eg 'type' and 'graph' we still have to make up a standard encoding. If the 'null' bit is set, then the 8-bit payload can specify (a) the maybe/option says 'None', not 'Just'; or, (b) this is an empty list, an empty string, etc; or (c) 253 other things tbd.

also, i think i've decided that the spare bit in the destination operand should mean 'annotation or extraordinary instruction'.

also, we need to move 'split/unsplit' addr mode bit to the first one, so that annotations/extraordinaries can be recognized as '00' in the first two bits of the whole thing

also, perhaps we should say that all jumps/branches will be ordinary arithmetic instructions operating upon the PC register? This allows most (but not all) of them to be recognized without decoding simply by scanning the code for instructions with the PC register in the destination operand in 'register' mode. This could be helpful for optimized interpreters that decode future instructions in parallel and do pipeliney stuff to partially pre-execute them. This would also eliminate the need for an instruction form that combines all 3 operands; but otoh it also forbids that in the case of branches, meaning that branch distance would be limited to 28 bits instead of 44 bits. Also, if we allow GOTOs and the like, which we probably do, then an optimized interpreter would have to conservatively assume that any CALL might contain a further branch. And i guess it wouldn't be that much extra trouble for an optimized interpreter to compare each immediate-mode opcode to a lookup table of branching instructions. So i guess right now i'm against this idea.

---

copied the current proposal to ootAssemblyThoughts

---

why are stacks fasteer than heaps?

i think the speed difference is mostly due to memory management paradigm. A new object is allocated on the stack with a single instruction (the cost being that the total size of the stack is fixed). In the heap, this is not always the case, particularly if some sort of garbage collection or defragmentation is triggered upon allocation.

Some citations about this: http://stackoverflow.com/questions/161053/c-which-is-faster-stack-allocation-or-heap-allocation http://bytes.com/topic/c-sharp/answers/244034-why-stack-faster http://www.programmerinterview.com/index.php/data-structures/difference-between-stack-and-heap/ http://gribblelab.org/CBootcamp/7_Memory_Stack_vs_Heap.html

---

so far oot asm is turning out to be not that different from ordinary assembly language. I suspect that fixed-width instructions and the inclusion of pointers are two big factors that end up affecting/determining a lot of the design. One difference is that our memory location aren't really memory locations (they can 'hold' arbitrarily large data objects), and another is that we'll probably avoid pointer arithmetic (or at least, writing pointer arithmetic into the spec) by changing the defn of the 'split' addressing mode to be getattr rather than addition.

---

todo: http://paulkoerbitz.de/posts/Understanding-Pointers-Ownership-and-Lifetimes-in-Rust.html looks good even though it is out of date (eg some gems: "Owned and borrowed pointers fit a lot of use cases, but sometimes they are not enough. With these pointer types, each piece of memory must have an ultimate owner. This means that the ownership of all objects on the heap must be representable as a directed acyclic graph. Up to version 0.8, Rust offered managed pointers with the syntactic form @. These are currently being removed and will be replaced by the Rc and Gc pointers in the standard library"; lifetimes; a borrowing must have a shorter lifetime than the owner; things become immutable (to the owner, yes, but also to the borrower? i'm not sure) while borrowed)

http://doc.rust-lang.org/0.12.0/guide-lifetimes.html#lifetimes seems to say:

(1) lifetimes names are opaque identifiers; you can say that a function returns a lifetime, that a variable takes a lifetime, etc; there is also the predefined 'static' lifetime. Lifetimes are written with a prefix apostrophe, eg 'lifetimename. (2) lifetimes are also labels that can be jumped to in some cases (e.g. in 'breaks' out of loops)

---

memory locs are supposed to be like variables with numbers for names instead of letters. could just say that all 'memory locations' are kept for the life of the program, and only boxed objects are on the heap, and program predeclares number of memory locs it needs.

if memory locs are like var names, then that explains why we dont want to allow pointer arith with them

but then we couldnt 'mmap' arbitrarily long arrays into linear stretches of this memory. so mb let the program dynamically expand and contract the max memory loc? the intepreter might demand that the os give it a single, contiguous chunk, howeever, which may make having this region be too large be infeasible. also, mmapping in this way is only useful if we do pointer arithmetic, which we may not want to do anyways

if we disallow ptr arith, and if memory numbers are like variable names, then we cant represent 'gensym' (by keeping track of a max variable number and adding 1 to it when we hand it out), so gensym must then be a primitive (an oot assembly instruction)

also, if there is a stuct-within-a-struct, wouldnt it appear to the program that the parts of this struct exist at ordinary, though unknown and arbitrarily large, 'memory addresses'? the gc convention though could be that these language-provided, 'opaque' memory addresses are garbage-collected, not roots.

the contents of the cells ('memory locs') is actually in them if unboxed and small, or a pointer otherwise. if large unboxed values are possible (mb they shouldnt be). note that if a cell might be something small or it might be something large, and thee program puts in something small and then reads expeecting something large, there is an invalid ptr error. so maybe force boxed dynamic type in that case. does oot asm type cells, or do we trust the prrogram not to segfault in thsi way? if large things are a ptr anyways, mb we should just go ahead and box them; then the typing is just boxed or unboxed int or unboxed real or unboxed (some other small type).

---

to rephrase the last section:

top-level memory locs are supposed to be like variables with numbers for names instead of letters. Which explains why 'pointer arithmetic' seems unnecessary. However, within a struct, you also have a sub-array of memory locations representing the fields within the struct. And if the structure is itself of type array, then some field within it, sub-location within it, contains the array's actual data. The array's actual data will itself be represented as an array of sub-memory locations. And 'pointer arithmetic' on these locations makes total sense, because we are just doing arithmetic on the array index.

(and btw, in case i change my mind and we do disallow ptr arithmetic, that would mean that 'gensym' would need to be a primitive instruction provided by the platform, because we wouldn't be able to keep track of some max variable number and add 1 to it when we hand it out; which means that we should probably provide gensym in any case)

as for memory management:

if the top-level memory locs are acting like local variable and/or like registers, then they are acting like things which in other platforms are not heap-allocated and which would not be garbage-collected. So perhaps we just say that all of the top-level memory locs which are accessed directly by the program are keps forever for the life of the program. So what is on the heap? First, we could say that all boxed objects would be stored on the heap (when a boxed object is 'in' a directly-accessed memory loc, what is actually there is really just a ptr). Second, we could say that if the program goes inside of a boxed object and fetches the address of one of its fields, then even though this address is conceptually some integer number top-level memory location, actually it refers to memory on the heap, and is garbage-collected. Similarly, if a program does 'malloc', it gets a pointer that conceptually appears to it to be just an integer that refers to some memory loc, but actually this integer refers to something on the heap.

Furthermore, when the program takes the address of something inside a boxed structure, or of something returned by malloc, this address is conceptually guaranteed to be an 'infinitely large' number, so that it is guaranteed not to conflict with any memory location that the program could ever access by other means. In other words, the conceptual structure of the integers representing the addresses of the top-level memory locations is a non-standard model of the integers, in which there is one island of integers before infinity, and many islands of integers after infinity that are infinitely far away from all other islands. Under the covers, of course, such addresses are merely pointer into real memory, in contrast with the integer locations of ordinary top-level 'memory locations'. Note that pseudo-arbitrarily large memory locs == locations not constructed by the program but returned from elsewhere == boxed objects == memory which is garbage-collected. Ordinary, 'small' memory locs == memory locs constructed by the program == holds unboxed objects and pointers to boxed objects == memory which is not garbage collected (but which forms the roots for garbage collection tracing).

in addition, this 'non-standard model of arithmetic' allows getconstant(x) to represent a path, not just a single number; e.g. we are pretending that a reference to someobject.somesubobject.somefield is just a reference to an ordinary yet very large integer index of the 'memory cells' (when in fact it's a pointer into main memory).

so, the contents of the cells are actually stored in them if the contents are unboxed, or if the contents are boxed, the cell actually contains a pointer to the boxed value in the heap.

also, instead of allowing the program to construct arbitrarily large memory locs, could just limit this to 4096 memory locs; at which point this would become in some sense simply a 4096-register architecture. However, one could at the same time allow integer references to memory locs to be interchangable with pointers to the heap, as above, to maintain the fiction that a single memory loc can 'contain' an arbitrarily large composite object; so in this sense it would still be a memory-memory architecture, not a register one (eg you could still do 'mov' between two memory locs directly, without copying the contents to be moved to a register in the middle). However, this would imply that you can't do things like copy an arbitrarily long array directly into the top-level memory locs and then iterate thru it there; i doubt that would be a big loss though.

---

indirect addressing here is essentially a form of metaprogramming, namely it's like providing a variable name as a string to a getattr or setattr function, and getting (or setting) the value of that variable , except instead of the string we have the numeric id.

may make typechecking harder..? well, not if the indirect address is not resolving to another top-level address, but rather to a heap address. Hmm.. maybe actually make that a requirement?

---

i guess actually it's fine for 'split' addressing to be:

x + Y getconstant(#x) + Y X + Y (X) + Y

or if we use . (getattr) instead of +:

x + Y getattr(getconstant(#x), Y) getattr(X, Y) getattr((X), Y)

(note that getattr doesn't make sense for #x + Y)

but now that i realized that we need to address arrays too, maybe + would actually be useful.

if we flipped the order of split addressing, and used getaddr:

getattr(Y, #x) getattr(Y, getconstant(#x)) getattr(Y, X) getattr(Y, (X))

if we used plus and typechecking ensured that when you 'add' two things, only one of them can be a pointer, and the other must be an offset, and then it's like getaddr(the_one_which_is_a_pointer, the_one_which_is_an_offset), that could be interesting; then we have the union of the above:

getattr(getconstant(#x), Y) getattr(X, Y) getattr((X), Y) getattr(Y, #x) getattr(Y, getconstant(#x)) getattr(Y, (X))

where to 'getattr' of an array's data field is to index into the array

but it seems like if we wanted that, we'd really want to explicitly represent it in a bit, rather than use polymorphism.

now we could just say "whatever, pointer + a constant = an address, we don't have to use typechecking to know that that's what's happening, we just add them". But the problem with that is that we are trying to abstract away from field size here, so the runtime can't actually just add some number to a pointer if the number refers to 'number of fields' instead of 'number of bytes', instead either the runtime has to know what sort of structure that pointer is referring to in order to know how to translate number of fields into number of bytes.

which reminds me of the problems of having this 'a memory location holds things of any size' in the first place. Are we really spending 64 bits to store a small integer or a single character? I guess there's only 4096 * 8 bytes = 32k memory locations that we are offering (and that's assuming we don't allow eg 'mmapping' an array into the top-level memory locs), but still, this seems a little excessive.

what does java do about this? i think they just bite the bullet and say that even small values occupy 64 bits: "If the size of an object is not a multiple of 8 bytes, it is rounded up to next multiple of 8." -- https://en.wikipedia.org/wiki/Java_performance#Memory_usage

one answer to this could be to have a different pool of memory cells for things of different sizes, but now we're getting complicated.

even we did have uneven cell sizes, then what about cell locations whose type changes during runtime?

also, what use are 4096 registers anyways when we may need stack frames that persist as we call subroutines and then return? or ARE these stack frames?

they certainly are not the entire stack; eg the Java default stack size is much bigger than 4k: http://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/jrdocs/refman/optionX.html#wp1024112

i think the best answers so far are:

the 4096 cells are NOT registers or globals, but are per-stack frame, but also upvals; so if you put a 3 into cell #15, and then recursively call yourself, you have a new stack frame and that 3 is inaccessible; if you say the equivalent of Lua "local a; function b() a = 1 return a end;", then within function b, one of its cells (say, #23) is actually a reference to local a.
- but then how in the world is the runtime supposed to know that as it executes instructions within b and sees a reference to cell #23? i think what we have to do is something like what Lua does; in the bytecode, when the closure is created, there is a 'CLOSURE' instruction, but this is followed by a specially-encoded table that says which local variables in the enclosing scope correspond to which upvalues in the child scope. Seems to me we can make this a tiny bit simpler and just have a list of single numbers, rather than a list of pairs of numbers, and just preserve all of the upvalues where they are.
- so now when the runtime sees an integer reference to a cell, it must:
  - see if this is an upvalue by looking thru a list of a cells which are upvalues; if it is, then compute the address in the stack where the original value is actually stored (may as well place a copy of that address in the current stack frame to make it easy)
  - if this is not an upvalue, compute the address in the stack corresponding to this cell # (if the stack is all 64-bit entries, this is easy; o/w must consult some sort of stack map)
  - it sure would be nice if the bytecode had a bit to indicate whether this is an upval, since the compiler surely knows. However, instead, may as well require a reference to upvals to be boxed; then one of the boxed object types could be 'upval'; we need something like that for symlinked/bound variables anyhow.

so now we have added another function to boxed things; binding (eg binding a 'local' stack frame cell to a source upval). So boxing

as for 'split' addressing, we could just destroy the uniformity and then it's easier; just choose 4 out of the 6 things above (and swap X and Y as needed to make it easy to remember): i'd probably choose:

getattr(X, #y) getattr(getconstant(#x), Y) getattr(X, Y) getattr((X), Y)

these are useful because eg:

getattr(X, #y): get a specific field from the object held in X getattr(getconstant(#x), Y): reference a computed element from some long chain of namespaces, e.g. argv_array getattr(X, Y): good for array indexing X[Y] getattr((X), Y): X holds a pointer to an array which resides deep in some heap structure. (X) is the array; we are referencing index Y of that array

hmm.. actually, X + getconst(#y) (that is, getattr(X, getconst(#y)) might be (much) better iff getconst(#y) can represent a whole path (query) instead of a single choice of a field; for example if getconst(#y) can represent .b.c.d, and X is representing a, then we could do a.b.c.d in instruction this way (because the representation of the path/query .b.c.d is encoded in the constant table).

the idea from the end of the previous section is important enough that i wanted to emphasize it in its own section; instead of representing only the choice of a field, 'memory offsets' should represent a whole PATH, that is a choice of a field and then optionally of a subfield, etc. We might interchangably think of a PATH and of a QUERY. Eg .b.c.d (apply this offset/path/query to variable 'a', and you get a.b.c.d); eg [b][c] (apply this offset/path/query to 2-d array 'a', and you get a[b][c]); eg x y (apply the function f to these two arguments, and you get (f x y)).

note that with regard to function application, we are reminded of the 0-ary function issue, which we saw elsewhere also corresponds to the distinction between applying a complete slice index to a table and getting back a table with one element (for consistency with partial slices, which return eg columns of 2-d tables) vs. applying a complete slice index to a table and getting back a non-table value. And also of the issue of __set needing a 'path' through an object, as opposed to __get just needing a value.

And also of the connection between adding offsets to pointers, and getattr (doing a single field selection), and having offsets able to represent paths (nested field selection). In C adding an offset to a pointer would work as a nested field selection iff the fields before the selected field were of known length; but in our case we want to abstract away from that byte length of fields. You might call paths 'queries', which also implies that more complicated, 'dynamic' methods of selection might obtain; basically, anything computation that walks the graph and returns a pointer to a destination node. We see a similar distinction here as between effective address and referent in addressing mode. However 'a pointer to a destination node' is not really a 'path', so we may need to rethink calling these 'paths'; otoh b/c of metaprogramming like __get, __set, in some cases it will be important which nodes you pass thru on the way to another node, because the node you finally reached may be only a virtual node simulated by running __gets and __sets on nodes that you went thru to get here.

since we have pointer values floating around here, this brings up the possibility of segfault crashes (eg if you store an unboxed float (that is, a double, since it's 64 bits) into a memory cell, and then try to read it back with the boxed bit set, that will try to dereference the float as if it were a pointer, but it's not, so probably segfault). We would like not to crash (the VM segfaulting is worse than the program crashing, because the VM loses control and can't propagate an exception). How does the JVM cope with this? i looked into it and it seems the JVM actually verifies the bytecode to make sure it typechecks during classloading, before executing anything. I would have thought that would make class loading way too slow, but apparently the verifier is only a small proportion of the cost of loading a classfile in the JVM (evidence: 40% speedup of the verifier only translated to a 5% speedup of large program startup time [10]) (apparently the cost is mostly I/O, and also in many cases (eg Clojure startup) initialization routines provided by the code in the classfile).

so heck, if it's so cheap, let's do that.

so if we statically typecheck everything, that makes it harder to have dynamism and run-time metaprogramming. We would rather have something typecheck to DYN, meaning that their type is not known statically (at least not to our theorem-prover), and so runtime typechecking is needed. But we're out of bits. So let's just use the boxing bits for that, too (typechecking only on input, i guess?).

if we really need to do stuff with pointers without the speed cost of dynamically typechecking at runtime, and the verifier can typecheck this stuff at load time, then we could always just interact with the pointers using unboxed instructions that accept pointers.

so, the verifier checks everything which is unboxed (eg manipulated with instructions without their 'boxed' bit set; this might include manipulation of boxed structures directly), and the runtime dynamically typechecks (and bounds checks) everything which is boxed.

so what is boxing doing now? a lot:

maybe/option/nullable types
lazy/promises
binding/symlinking variables (which is also used for upvals)
typechecking needed
any value that does not fit in one cell (64-bit cells, i guess)

we should also add:

__get and __apply (eg only implicitly call these when the 'boxed' bit is set)
and maybe locking?

in other words 'boxing' seems to be the default for our inefficienct, high-level, safe, dynamic, lazy, symlink-having language; and 'unboxing' is like a special mode that you can use to do more CTM stuff for tight loops, etc.

note that the typechecking verifier will probably have to know about the concept of an array in order to make array access efficient. Java bytecode knows about arrays.

---

Perl's runtime actually has a number of stacks, toread:

http://docstore.mik.ua/orelly/perl/prog3/ch18_03.htm

operand stack

    That's the stack we already talked about.save stack

    Where localized values are saved pending restoration. Many internal routines also localize values without your knowing it.scope stack

    The lightweight dynamic context that controls when the save stack should be "popped".context stack

    The heavyweight dynamic context; who called whom to get where you are now. The caller function traverses this stack. Loop-control functions scan this stack to find out which loop to control. When you peel back the context stack, the scope stack gets peeled back appropriately, which restores all your local variables from the save stack, even if you left the earlier context by nefarious methods such as raising an exception and longjmp(3)ing out.jumpenv stack

    The stack of longjmp(3) contexts that allows us to raise exceptions or exit expeditiously.return stack

    Where we came from when we entered this subroutine.mark stack

    Where the current variadic argument list on the operand stack starts.recursive lexical pad stacks

    Where the lexical variables and other "scratch register" storage is kept when subroutines are called recursively.

And of course, there's the C stack on which all the C variables are stored. Perl actually tries to avoid relying on C's stack for the storage of saved values, since longjmp(3) bypasses the proper restoration of such values. "

---

should we have scope annotations to say when we're done with a particular variable/resource? should we have for loops, while loop annotations or instructions? is that too ASTy for bytecode (even if it's linearized)?

---

i don't understand what this mean, but apparently the JVM had a problem verifying its jsr/ret instructions: [11] (see [12] section The issue with JSR/RET). WOULDNT THE SAME PROBLEMS APPLY TO MY 'FAST FUNCTIONS'?

so still not clear on:

if we are using that bit to say whether something should start evaluating now (even if it's a promise, not something being strictly evaluated), or wait until its called for
how to indicate lazy/strict
how to annotate whether something is worth doing in parallel

---

well now that i've thought of a bunch more flags that i want, maybe we should cut things down from 12 bits to 10 bits, freeing up 2 bits per field, or 8 bits overall

1024 registers is still a lot, right guys? (yeah)

1024 constants per module is still a lot, right guys? (yeah)

1024 functions per module and imported modules is still a lot, right guys (umm.. no, but i guess it'll do, since we can always fall back to GETCONSTANT which will combine input fields to get 20 bits)

---

if we had more flags, we could also have one for how to deal with failure, eg whether to have (a maybe-null, or a straight nullable type), or to raise an exception

---