proj-oot-ootAssemblyNotes3

i've been reading about Lua's VM from

http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf

which is an excellent, easy-to-read document. I'm interested in the VM because Lua seems somewhat small, and the VM seems somewhat small, and i am looking at it as a basis set of operations for Lua (a core language). Seen for this purpose, there are some unnecessarily complex parts of the VM, doubtless for efficiency, such as:

so i guess an assembly for Oot would want to be less complex. Other thoughts:

---

so, addressing modes:

(note: in this scheme there is no stack addressing mode!)

an alternative:

an alternative:

1 bit says: constant if constant: one bit says: immediate constant or constant table constant if non-constant: one bit says: direct or indirect stacks, and the top-two elements of each stack, are memory-mapped (so, as described above, with 4 stacks, that's 16 memory-mapped locations; one bit is whether this is a straight read/write or a pop/push, and one bit is whether this is to the TOS or the second-element-of-stack, and two bits are which stack) could also have one bit for whether we are accessing 'special' memory, like the memory-mapped stacks, or 'ordinary' memory; 'special' memory has the properties that (a) when you read from a special memory location, you might get a value different from what you last wrote to it, (b) 'reads' might cause mutation (e.g. when they are interpreted as POPs). We might extend this to say that special memory is 'volatile' and then accesses to it should be executed immediately and non-reordered because they might have arbitrary side-effects. We might also say that special memory is always thread-local and never shared.

also, as noted above, we need an 'emulation'/'cascaded addressing mode' bit (at least for direct and indirect addressing, do we also need it for constants?). do we need PC-relative addressing? do we need table (eg displacement) addressing?

if we have: constant bit, direct/indirect bit, emulation bit, special bit, that's 4 bits, so if we have 16-bit words, we have 12 bits left, ie 4k. note that in this scheme, addresses can be up to 4k but constants can be up to 16k, because constants don't need 'emulation' or 'special' bits.

(Instead, could use the ARM-ish barrel-shift-ish encoding where the 12 low-order bits of immediate constants are interpreted straighforwardly, but the two high-order bits shift the others, e.g. if the two high order bits are 00 then the value is the value given by the 12 low-order bits, if the two high order bits are 01 then the value is 2*(the value given by the 12 low-order bits), if 10 then 4*(the value given by the 12 low-order bits), and 11 is multiplication by 8, allowing representation of all values up to 4k, and some values with trailing zeros up to 32k. This sounds useful/efficient but is slightly more complicated, so i probably won't do it)

---

in [1] i wrote:

" Note that Haskell is implemented a bit differently from many languages; this stems from Haskell's properties:

It is therefore assumed that Haskell programs will, at runtime, very frequently be dealing with unevaluated thunks, with closures which represent unevaluated or partially evaluated functions, and with the application of functions whose arity was unknown at compile time. "

due to these considerations, the GHC compiler appears at first rather complicated (for example: in reading about it i heard about 'spineless tagless G-machines' and i found a paper describing that and read about a weird system in which to evaluate a function whose arity is only known at runtime you put a bunch of potential arguments on the stack and then jump into code for the function and then IT decides how many arguments to consume, and if it generates another function as a result, it calls that function directly to consume the rest of the arguments, etc; then in later papers i found that they dropped this weird 'push/enter' system for the simpler 'eval/apply' system that i would have expected (the caller looks up the arity of function at runtime and then calls the function with only the arguments it will use, then calling the output of the function if appropriate), and also that they also later switched from a tagless system to one with tags (but not type tags, rather a tag for whether an object has been evaluated yet, and for which outermost constructor was used to generate the object, which is type-specific because different types have different sets of constructors, so we kind of have a type-within-a-type and the type is known at compile-time but the choice of constructor, the type-within-a-type, is not!). And then there are info tables describing each object, and the garbage collector has to read these, and the garbage collector also apparently can sometimes help in propagating tags for some reason, and it all sounds very complicated)

but now that i've read about that, i wonder if i can't boil down the basic operations that the runtime is trying to do into an intermediate VM language, and leave questions like push/enter vs. eval/apply, and of how/where you store the stuff that Haskell puts in type tags and info tables, to the VM implementation (compiler or interpreter). Then Oot Core could do all this Haskelly stuff but the reference implementation could be comprehensible. Of course, the optimized reference implementation could compile this intermediate language into something more efficient, like Haskell does.

Now, what is the difference between GHC Core and the thing i am talking about? I guess it's that i'm talking about an intermediate VM language for what instructions are being executed at runtime, rather than a representation of the source code that is being compiled.

operations like: APPLY (what else)

also, other 'runtime'-y things, such as COW (copy-on-write)

http://www.haskell.org/haskellwiki/Ministg

so, this suggests the place of expressions are included in oot assembly; not only because we could have an AST instead of a linear sequence of VM instructions, but also because (for the same reason that lazy languages are like a reduction of expression graphs), thunks must contain expressions. also, should think about how to generalize/unify this with dynamically passing around references to imperative code.

one piece of the puzzle that is missing is how to combine delimited continuation and friends (eg metaprogramming based on direct manipulation of the call stack) with haskell-style graph reduction this may also be related to the puzzle of how to have tail calls but also support good stack traces; as ministg points out, "Lazy evaluation means that the evaluation stack used by the STG machine is not so useful for diagnosing uncaught exceptions. The main issue is that the evaluation context in which thunks are created can be different to the evaluation context in which thunks are evaluated. The same problem does not occur in eager languages because the construction of (saturated) function applications and their evaluation happens atomically, that is, in the same evaluation context. ". In fact, you might say that in such a language there are not 2, but 4, stack contexts for each function: the lexical (source code) ancestors of the function definition in the AST, the lexical (source code) ancestors of the function application in the AST, the ancestors in the call stack at the runtime time of the function's creation, and the ancestors in the call stack at the runtime time of the function's calling. should also look at di franco's effort to combine call-stack metaprogramming with constraint propagation

people say 'synchronize' and have target language instructions like SYNC but what they really mean is often 'WAIT until the other thread gets here, or WAIT until the other thread does this', so i should call that opcode WAIT.

the idea of haskell data constructors as a type-within-a-type is interesting and possibly a fundamental way to generalize/simplify. So now we seem to have at least 3 types of types: atomic types (Hoon's odors), composite types (C structs, Python objects, Python and Haskell and Lisp lists, Python dicts, Hoon's structural typing), and the choice of data constructor within a Haskell type (eg is this List a Cons or a NilList?? is this Maybe a Nothing or a Just?) in addition, there are interface types vs representation types, which may be an orthogonal axis

why is a g machine graph not tree (recursion, but why?)? http://stackoverflow.com/questions/11921683/understanding-stg says "Haskell's pervasive recursion mean that Haskell expressions form general graphs, hence graph-reduction and not tree-reduction" but i'm not sure i understand that, wouldn't we just be working with a lazily generated (possibly infinite) tree?

another primitive that we may need to deal with is fexprs, that is, (first-class) functions that don't evaluate their arguments first: https://en.wikipedia.org/wiki/Fexpr (how is this different from any function in the context of a lazy language? should we just generalize this to strictness annotations on each argument?)

APPLY COW WAIT GETTABLE? SETTABLE? CPUID -- args in reg, first call CPUID(0) to get highest arg supported: http://en.m.wikipedia.org/wiki/CPUID, vendor id, processor id, feature bits, serial number, core topology, cache config, virtual and physical address sizes GENSYM (see below) MALLOC, MEMFREE UNDEF (done with the contents of this register) VARTYPE PERSPECTIVE PUSH POP GET SET CPY (deep copy; value copy, actually COW?) MOV FILL-IN-TEMPLATE (like python's '%' on strings, but with graphs) MATCH (graph regex) REPLACE (graph regex) and stuff about grammars (parsing and producing?) and stuff about logic and oberon stuff (while loops, etc)? SHALLOWCPY ALIAS, GETALIAS (see [2]) DIRECTGET, DIRECTSET (non-indirection versions of array indexing; take the array stride as arguments; 1d arrays only?) (or, just GET and SET with modalities, but that would decrease code density) ENTER, LEAVE (enter and leave block scopes; something similar may be/seems to be present in Ruby YARV bytecode and Perl55555) PUTSELF (both Lua and Ruby Yarv appear to have an instruction to reference Self/This)

note: list of ops copied to and continued at [3]


from [4]: some core operations from Guile's VM:


a new (to me, not to the world) interpretation of Nock's core instructions:

select, identity, nock, isLeaf, successor, isStructurallyEqual:

in other words:

(note: no mention of stack manipulation/continuations/jumping here, except for 'run a function within an environment')

(note: we can construct functions because we can build an AST as data and then execute it as code using Nock)

(note: why is this 'functional'? maybe b/c: (a) operations are defined in terms of macros ie combinators, (b) to use eg the environment you copy it, you dont mutate it)

this interpretation shows the similarity between even Nock and various other core languages:


so some stack ops:

DROP DUP SWAP OVER ROT DUP2 DROP2 PICK TUCK

DRIP: a b c ... -> a a c ... (MOV reg1 reg2) DIP: a b c ... -> b b c ... (MOV reg2 reg1) NUP: a b c ... -> a b b c ... (PUSH reg2 onto the (rest of the) stack) NIP: a b c ... -> a c ... (POP the (rest of the) stack into reg2) TUCK: a b c ... -> a b a c ... (PUSH reg1 onto the (rest of the) stack) TAKE: a b c ... -> c b ... (POP the (rest of the) stack into reg1)

PUSH POP OVER UNDER PEEK

cow, fold, map as core ops

---

some things u see in a lot of languages:

APPLY (call a function, possibly with partial arguments or currying) APPLYTAIL return GOTO/JMP conditional JMP comparisons, esp. equality a way to make a composite data structure a way to index into or destructure that composite data structure arithmetic and other 'primitive functions' looping constructs variable assignment a way to make a closure (a function with an environment of variables to be used in that function) switch/jump table (special case: 'case' of an ADT) try, raise, catch call FFI

and some have: call/cc, delimited continuations, and other 'first class stack frame metaprogramming' first-class, executable representation of code polymorphic dispatch the nock operator (do something to the environment, then do something to the code, then evaluate the resulting code upon the resulting environment)

some popular data structures: tuples linked lists (used as binary trees in the Lisp or Nock way, e.g. first/rest) random access/packed lists/arrays/vectors dicts graphs? functions, closures, objects, tagged unions/ADTs

stacks seem to be used mostly for keeping track of evaluating an expression, which is shaped like a tree, linearly (eg in the AST you don't need the stack, but in bytecode you do). the call stack can be thought of as an extension to this


could have calls with arbitrary many arguments; have the second operand have a sentinel that means 'composite argument', which is found outside of the fixed-width bytecode (are the composite args embedded following the instruction, or in a special constant table?)

---

---

it should be made clear that the use of <16-bit wide operands does not imply a word size of 16-bits. In fact, the word size is (unspecified? or specified to be infinite?). The <16-bit operands merely impose (a) an upper limit on the number of registers, (b) an upper limit on the number of entries in the constant table for a given code segment (but you can always go to a new code segment with a new constant table), (c) an upper limit on the size of immediate values encoded right in the operand, rather than in a constant table. Note that i say <16-bit operands because (reverse search for 'addressing mode'), some of the bits in the operand field are used to select the addressing mode (so we only have between about 12 and 14 bits left for the actual values, depending on the addressing mode used, and on how we decide to encode addressing modes).

---

analysis of 8-bit and 16-bit video game consoles:

(note: often i say K when i mean KiB?)

I looked at the tech specs of the NES (8-bit), the SNES (16-bit) and the Sega Genesis (sometimes called "16 bit" but appears to have a 32-bit processor with 24-bit addressing and a 16-bit wide data bus; this is sometimes called 16/32-bit). See data at [5], sections "8-bit video game consoles" and "16-bit video game consoles".

My main objectives were to determine how much memory was addressed, and what the instruction sets were. I am particularly interesting in the question of if the <16-bit operand field sizes that i am contemplating in Oot Assembly would be in the ballpark of the amounts of memory needed for systems like these.

Copying from the summary i wrote over there:

" The instruction sets were 6502 (and 65C816, a 6502 extension) (NES, SNES) and 68000 (Genesis). The address sizes were 16-bits (NES), 24-bit (SNES, Sega Genesis). The builtin RAM was 2 K (NES), 128 K (SNES), 72 K (Genesis). In-cartridge memory was typically up to 8 K, but could go up to 64K with 8K banks (NES; unsure about the other 2). (Program) ROM sizes were cheaply 32K (NES), typically up to 256K (NES) or 4 MiB? (SNES), max 512K (NES), in banks of 32K (NES).

Builtin graphic RAM was 2 K (NES) or 64K (SNES, Genesis), cartridge-expandable in banks of 8 KiB? (NES). Max typical CHR RAM was ~128 KiB? (NES, MMC1). Critical numbers regarding max # of sprites were 8 (NES), 64 (NES). "

How does this square with my <16-bit operand size limit proposal?

16 bits would be enough to directly address:

16 bits would NOT be enough to directly address:

But now because of addressing mode, i have less than 16 bits. So let's say i have 13 bits (8k addressable).

13 bits is enough to directly address:

13 bits would NOT be enough to directly address:

12 bits would not be enough to address any of these items.

How does this square with parallelism? The NES sprite critical numbers 8 and 64 seem relevant.

Conclusions:

---

so, if 13 bits is underpowered, maybe we should take the addressing mode bits out of the operands and put them in the first of the 4 16-bit fields, reducing the number of opcodes. We had too many opcodes anyways. If we only need 32k, that's just 15 bits, but since that's close there may be some benefit to having the 3 operand field lengths be an even 16 bits instead of a non-power-of-2. Perhaps on some architectures we'll have to execute less operations in order to emulate/execute OOM code if each of the 3 operands are an even 16 bits (not sure about that b/c we still have to unpack the same bits out of the first field).

Also, come to think of it, if Nintendo provides for 32K of ROM and that's too little for most cartridges so bankswitching is common, then i guess that even a small program on an 8-bit system is often >32kb, so Oot will surely be at least this big. And i guess it's kind of weird and potentially inefficient to have to put program offsets into a constant table when the program is the inner core of an interpreter. And i could be wrong about all these being the right numbers, so at least having 64k to address, when i think 4k to 8k to 32k is needed, future proofs it a tiny bit.

A disadvantage to that is that if we need 3 addressing mode bits per operand, that's 9 new bits that we now stuff into the first field, leaving only 7 bits. If the first field also has an emulation bit, then there's only 6 bits left, e.g. 64 opcodes, the minimum that we contemplated.

Otoh, we are trying to be minimalistic here, so maybe that's okay.

as i said earlier, as with the Dalvik JVM implementation on Android, we expect that there will be faster VM languages to implement Oot in, and that implementations bent on efficiency might compile OOT VM (OVM, i guess) to an idiosyncratic or platform-specific faster language. So not having enough opcodes isn't really a big problem.

Also, we're going to have to have a Jets-like system to replace at least anything in the standard library, and maybe anything in the standard distribution, with optimized platform-specific FFI calls, so we don't need to put eg. every flavor of addition into OotAssembly? (ovm? osm? oosm? oovm?)

Now, if we are doing this, we really can't go below 6 bits for the opcode, so the last scheme described above, in which each operand had 4 addressing mode bits (constant bit, direct/indirect bit, emulation bit, special bit), is too expensive. Even if you left those in the operands, that would mean the operand values would be only 12 bits, and the analysis above calls for at least 13 bits.

The obvious thing to do is to drop the 'special bit' and rely on a slightly dirtier (but more frugal with memory) 'zero page'-ish memory-mapped specials architecture for the stacks, and rely on more traditional mechanisms for threadlocals and volatiles (which makes some sense, because the amount of these that you'll want or have will depend on the platform and the application, and will not always correspond to approximately half of main memory).


so to summarize, the current scheme appears to be:

undefined word size

64-bit instruction size

instructions divided into 4 16-bit fields: ctrl, op1, op2, op3.

ctrl is a 16-bit field divided into smaller fields as follows:

hmm... the emulation here seems a bit unbalanced; what's the use of op1, op2, op3 getting to address 16 bits when emulating if ctrl can only address 6 bits? So let's just have an EMU instruction and give up on the emulation bits. so now the current scheme appears to be:

ctrl is a 16-bit field divided into smaller fields as follows:

it's tempting to take those 4 bits back and add custom opcodes and custom addressing modes...

but also, don't we need an immediate mode? so how about 3 bits for addressing mode:

and, as noted above, reserve one bit out of the remaining 7 bits for 'custom opcodes'. Leaving 6 bits for non-custom opcodes (64 opcodes).

---

this doesn't leave any bits for some of my other oot-y ideas, such as:

if we were to do any of these, we'd need to either use prefixes (annotations), or increase the instruction length. Instruction length is currently 64 bits (4 16-bit units), so the next power of 2 is 128 bits, giving us 4 more units. If we did that we could:

as noted elsewhere above, putting the type info separate slows down execution because each opcode implementation must dispatch on type; it's more efficient to just have a separate opcode for each operation-type combination.

arguments against this:

arguments for:

---

i probably had better ideas above, but this makes me wonder again, what a 32-bit instruction length would look like:

1 form bit 1 reserved bit 6 addr mode bits (2 addr mode bits x 3 operands) 6 opcode bits 18 operand bits (3 6-bit operands)

again, however, if we want to fix the length (to 128 bits, since that's what the longer format needs), then maybe these would come in packets of 4:

1 form bit 7 reserved bits 4*30 = 120 bits

also, if we are getting all complicated here, perhaps we should make the 'long' format shorter by just giving up on the 16-bits (!) and say, go ahead and use the constant table more? No, because we want 16-bit indicies into the constant table

well wait, what if we only had 11-bit indices into the constant table? we'd save 20 bits in the long format. we already have 16 unused bits, so that's 36 bits. we can give up 12 of the type info bits, for 48 bits saved. we can give up 12 of the flag bits, for 60 bits saved. we can give up 4 opcode bits, and now we're back to a 64-bit instruction length, albeit one with non-power-of-2 field lengths. In other words:

note the same argument about the EMU addr mode bit being useless if the length of the opcode field is shorter than the operand fields applies here. So instead:

the four flag bits could then be: lazy, map, maybe, typecheck/boundscheck

some comments:

---

an adder has 3 inputs and two outputs; two normal inputs and a carry input, and a normal output and a carry output. This may have implications for what the ideal arity of assembly instructions is.

In traditional assembly, the status register is used to provide hidden inputs and hidden outputs for arithmetic instructions. This is a particular case of the same general idea as accumulator registers, wherein we make instruction encoding shorter by hardcoding certain inputs and outputs to/from special registers. One could imagine that instead of having a fixed status register, every arithmetic instruction would take an additional input , and an additional output saying which register to use as the status register for output. Clearly in the majority of cases this wouldn't help much over just having a fixed status register, so it would waste a lot of space in the instruction encoding (plus, make the execution more complex and hence slower). From the point of view of functional programming, however, this is abusing side effects where you really should be just taking inputs and returning outputs.

In a higher-level language, the need for a carry input could possibly be obviated; for example, the second input could be a tuple which is a pair of (second input, carry). All you really need is two inputs (because you need to have a pair() primitive to create tuples; there's probably a way to get around this with one input, but it would be ugly).

In a higher-level language, the carry output seems similar to an error code.

In a higher-level language, one might think the carry output could be eliminated by using an Option type (a type with special sentinel values, e.g. IEEE floating point's +Inf, -Inf, NaN?); carry could be expressed by returning +Inf or -Inf. This almost does the trick, however, this can't express the difference between adding 1 (first input) + 1 (second input) + 1 (carry input), which yields 11 (1 at the carry output, and 1 at the value output), vs. adding 1 + 0 + 1 or 0 + 1 + 1 or 1 + 1 + 0, all of which yield 10; if you just used a carry sentinel, then both of these cases would yield +Inf (the carry sentinel).

In a higher-level language, two outputs could be expressed using tuples (in this case, a pair); the single output would be (carry, value).

I don't really have any firm conclusions here, those are just thoughts.

---

i guess there's a very simple argument to be made for 16-bit addressing:

16 bits is the smallest power of 2 that could possibly comfortably work; 8 bits is only 256 memory locations, which is clearly too small for many very simple applications

---

more evidence for 16-bit addressing, and more citations for which 16-bit or less CPU families were popular/notable:

6502 (8-bit; Apple I, Atari, Apple II, Nintendo Entertainment System, Commodore 64, BBC Micro) 65C02 (8-bit backwards-compatible extension; Apple IIc; Apple enhanced IIe; BBC Master; ) 65C816 (16-bit backwards-compatible extension; SNES; Apple IIgs)

8080 (various CP/M systems) Z80 (backwards-compatible with 8080) 8086, 8088 (16-bit, backwards-compatible ("source" but not binary) with 8080; IBM PC) 80286 (16-bit, predecessor of 32-bit 386 which is the foundation of present-day x86)

6800 (8-bit, Altair 8800) 6809 (8-bit, backwards-compatible (source-compatible) with 6800; TRS-80) 68HC11 (8-bit embedded, based on 6800) 68HC12 (16-bit MCU, mostly backwards-compatible with the 68HC11, 16-bit program counter) 68HC16 (16-bit MCU, based on the 68HC11)

68000 (16-bit; Sun, Mac, Amiga, Atari ST, Sega Genesis, Sinclair QL, early laser printers, various MCUs)

MSP430 (16-bit MCU)

PDP-11

"The Z80 and its derivatives and clones made up one of the most commonly used CPU families of all time, and, along with the MOS Technology 6502 family, dominated the eight-bit microcomputer market from the late 1970s to the mid-1980s." -- http://en.wikipedia.org/w/index.php?title=Zilog_Z80&oldid=634577740

The Z80 ISA was an extension of the Intel 8080 ISA (binary compatible with 8080 code). The Z80 was an 8-bit architecture with 16-bit addressing [6]. It had 8-bit registers which were in pairs, so the pair could be used as a 16-bit register.

" New syntax

Because Intel had a copyright on their assembly mnemonics,[20] a new assembly syntax had to be developed. This time a more systematic approach was used:

    All registers and register pairs are explicitly denoted by their full names
    Parentheses are consistently used to indicate "memory contents at" (indirection, or pointer dereferencing) with the exception of some jump instructions.[21]
    All load and store instructions use the same mnemonic name, LD, for LOAD (a return to the simplistic Datapoint 2200 vocabulary); other common instructions, such as ADD and INC, use the same mnemonic regardless of addressing mode or operand size. This is possible because the operands themselves carry enough information.

These principles made it straightforward to find names and forms for all new Z80 instructions, as well as orthogonalizations of old ones, such as LD BC,(1234).

Apart from naming differences, and despite a certain discrepancy in basic register structure, the Z80 and 8086 syntax are virtually isomorphous for a large portion of instructions. Only quite superficial similarities (such as the word MOV, or the letter X, for extended register) exist between the 8080 and 8086 assembly languages, although 8080 programs can be assembled into 8086 object code using a special assembler or translated to 8086 assembly language by a translator program.[22][23] Instruction set and encoding

The Z80 uses 252 out of the available 256 codes as single byte opcodes ("root instruction"); the four remaining codes are used extensively as opcode prefixes:[24] CB and ED enable extra instructions and DD or FD selects IX+d or IY+d respectively (in some cases without displacement d) in place of HL. This scheme gives the Z80 a large number of permutations of instructions and registers; Zilog categorizes these into 158 different "instruction types", 78 of which are the same as those of the Intel 8080[24] (allowing operation of 8080 programs on a Z80). The Zilog documentation further groups instructions into the following categories:

    8-bit arithmetic and logic operations
    16-bit arithmetic
    8-bit load
    16-bit load
    Bit set, reset, and test
    Call, return, and restart
    Exchange, block transfer, and search
    General purpose arithmetic and CPU control
    Input and output
    Jump
    Rotate and shift

No multiply instruction is available in the original Z80.[25] Different sizes and variants of additions, shifts, and rotates have somewhat differing effects on flags because the flag-influencing properties of the 8080 were copied. Load instructions do not affect the flags (except for the special purpose I and R register loads). "

-- [7]

" The first widely adopted 8-bit microprocessor was the Intel 8080, being used in many hobbyist computers of the late 1970s and early 1980s, often running the CP/M operating system; it had 8-bit data words and 16-bit addresses. The Zilog Z80 (compatible with the 8080) and the Motorola 6800 were also used in similar computers. The Z80 and the MOS Technology 6502 8-bit CPUs were widely used in home computers and second- and third-generation game consoles of the '70s and '80s. Many 8-bit CPUs or microcontrollers are the basis of today's ubiquitous embedded systems. " -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098

" The address bus is typically a double octet wide (i.e. 16-bit), due to practical and economical considerations. This implies a direct address space of only 64 KB on most 8-bit processors. " -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098

the http://en.wikipedia.org/wiki/Motorola_6800 also had a 16-bit address bus

wikipedia also lists "12-bit" in http://en.wikipedia.org/w/index.php?title=Template:Computer_architecture_bit_widths&oldid=633925183 but this isn't a power of 2 and isn't as popular as 8-bit and 16-bit

" The IBM System/360 introduced byte-addressable memory with 8-bit bytes, as opposed to bit-addressable or decimal digit-addressable or word-addressable memory, although its general purpose registers were 32 bits wide, and addresses were contained in the lower 24 bits of those addresses." -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098

4-bit processors often have 10 to 20-bit addressing:

" The first commercial microprocessor was the binary coded decimal (BCD-based) Intel 4004,[2][3] developed for calculator applications in 1971; it had a 4-bit word length, but had 8-bit instructions and 12-bit addresses." -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377

" The TMS 1000, the world's first single-chip microprocessor, was a 4-bit CPU; it had a Harvard architecture, with an on-chip instruction ROM with 8-bit-wide instructions and an on-chip data RAM with 4-bit words. " -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377 . It had 1k bytes of ROM (addressed by 10 bits, in a 6-bit registers and a 4-bit register) and 32 bytes of RAM (addressed as 256 bits of RAM by 2 4-bit registers).

" The Saturn processors, used in calculators such as the commonly used HP-48 scientific calculator, are 4-bit machines; as the Intel 4004 did, they string multiple 4-bit words together, e.g. to form a 20-bit memory address, and most of its registers are 64 bits, storing 16 4-bit digits." -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377

" Very often, when referring to the word size of a modern computer, one is also describing the size of address space on that computer. ... But this does not always hold. Computers often have memory addresses larger or smaller than their word size. For instance, almost all 8-bit processors, such as 6502, supported 16-bit addresses—[citation needed] if not they would have been limited to a mere 256 byte memory. The 16-bit Intel 8088 had only an 8-bit external memory bus on early IBM PCs, and the 16-bit Intel 8086 supported 20-bit addressing,[citation needed] allowing it to access 1 MiB? rather than 64 KiBs? of memory. Popular Intel Pentium processors[which?] since introduction of Physical Address Extensions (PAE) support 36-bit physical addresses,[citation needed] while generally having only a 32-bit word. " -- http://en.wikipedia.org/wiki/Memory_address#Unit_of_address_resolution

" Other notable 16-bit processors include the Intel 8086, the Intel 80286, the WDC 65C816, and the Zilog Z8000. The Intel 8088 was program-compatible with the Intel 8086, and was 16-bit in that its registers were 16 bits long and arithmetic instructions, even though its external bus was 8 bits wide. " -- http://en.wikipedia.org/wiki/16-bit

" The 16/32-bit Motorola 68000 and Intel 386SX

The Motorola 68000 is sometimes called 16-bit because its internal and external data buses were 16 bits wide, however it could be considered a 32-bit processor in that the general purpose registers were 32 bits wide and most arithmetic instructions supported 32-bit arithmetic. The MC68000 was a microcoded processor with three internal 16-bit ALU units. Only 24-bits of the Program Counter were available on original DIP packages, with up to 16 megabytes of addressable RAM. MC68000 software is 32-bit in nature and forward-compatible with other 32-bit processors in the same family.[7] The MC68008 was a version of the 68000 with 8-bit external data path and 1 megabyte addressing. Several Apple Inc. Macintosh models; e.g., LC series, used 32-bit MC68020 and MC68030 processors on a 16-bit data bus to save cost.

Similar analysis applies to Intel's 80286 CPU replacement called the 386SX which is a 32-bit processor with 32-bit ALU and internal 32-bit data paths with a 16-bit external bus and 24-bit addressing of the processor it replaced.

The 68000 processor of the Sega Mega Drive was a highly advertised feature of the video game system. Due to the saturation of this advertising, the 1988-1995 era (fourth generation) of video game consoles is often called the 16-bit era. " -- http://en.wikipedia.org/wiki/16-bit

"The intel 80286 had a 24-bit address bus and was able to address up to 16 MB of RAM, compared to 1 MB for its predecessor. However cost and initial rarity of software using the memory above 1 MB meant that 80286 computers were rarely shipped with more than one megabyte of RAM.[8] Additionally, there was a performance penalty involved in accessing extended memory from real mode, as noted below." -- http://en.wikipedia.org/wiki/Intel_80286

"The protected mode which debuted in the 286 was extended to allow the 386 to address up to 4 GB of memory." -- http://en.wikipedia.org/wiki/Intel_80386

" Offset, a 16 or 32-bit displacement referring to a memory location (using any addressing mode). Pointer, a 16-bit selector together with a 16 or 32 bit offset. " -- http://en.wikipedia.org/wiki/Intel_80386

" Just as in the 80386, the 32-bit offset registers (x86-terminology for normal CPU registers) allowed a simple flat 4 GB memory model, by setting all segment registers to zero. " -- http://en.wikipedia.org/wiki/Intel_80486

---

with 3-operand assembly instructions, each instruction is like a node in a graph with 3 arcs, directed arcs, two incoming, and one outgoing, such that the value of the outgoing arc is fixed by the node type (the opcode/opcode label; note: this suggests that 'type' may be a special kind of annotation, perhaps separate from 'label') and the values on the incoming arcs.

When all values are binary, this is a boolean logic gate. How many such functions are there? There are 2*2 possible incoming value patterns, and 2 possible outgoing value patterns to be chosen for each input, so 2^(2*2) = 16 possible binary logic gate types.

if we allow the node to also store a state (a single bit of memory), then there are (2*2)^(2*2*2) = 65536 possible stateful binary logic gate types.

i suppose another fundamental computation model is one in which the arcs are undirected, e.g. the value of every arc depends on the value of every other. I suppose this is like a cellular automata; at each step, the node's state updates, depending on what its previous state was and also on the values at all 3 arcs. However, unlike cellular automata, the gates are not arranged in a regular low-dimensional lattice, but rather in an arbitrary topology.

If the node didn't have any state separate from its inputs (which are all 3 arcs, now), then we'd have 2*2*2 output patterns to assign to each of the 2*2*2 input patterns, so (2*2*2)^(2*2*2) = 64 gate types. If there was an internal state, then its (2*2*2*2)^(2*2*2*2) = a very big number (16^16) of gate types. If we worked in trinary rather than binary, with no internal state, that's (3*3*3)^(3*3*3) = a very big number (27^27) of gate types.

Another model we could restrict this to reversible computation; this is like the previous model, except that we demand that, given the values for all three arcs of a gate, and its internal state (if any), as of time t, we should be able to deduce what all three values and the internal state were at time t-1. That is, the mapping from time t-1 to time t must be invertible and therefore injective. This is quite restrictive; i think it means that the mapping is just a permutation. I guess this isn't a model for computation in general, but is an important special case.

---

so some paradigms to be respected/used:

---

need to find a general way to do things like applying "map" to another function, then doing it

e.g. how to represent the following in Oot assembly:

map (+1) list

---

i guess if we have 16-bit opcodes, and this refers to the same memory space as the operands, then this is the key (to applying higher-order functions). We can start with a function, then transform it by operating on it with a higher-order function, and put the resulting function somewhere in memory, then apply it simply by issuing an instruction whose opcode is the memory storage location of the transformed function.

i guess this only works for functions that take no more than 2 inputs and give no more than 1 output, but that's ok.

we also need a way to do partial function application for these, so that we can eg be give f, and then say "g = map f", even if 'map' is really a function that takes two arguments. i suggest either a flag or a sentinel operand value. If we are having 128-bit instructions then there are plenty of flags.

---

things with more than 2 args can just pass a pointer to the argstruct

---

note: the 64k constant table is per-module

note: code cant fit in 64k. Code involves jumping/branching to many destinations, probably (at the assembly levels) more than 64k in larger programs/modules. Perhaps one solution is:

---

if spare bits are provided in the instruction encoding for 'custom addressing modes' or 'custom instructions'/'fast functions', some platform extensions might go ahead and fix the definition of these, causing programs that actually have their own custom addr mode or instruction tables to malfunction. One soln is to double the number of spare bits, e.g. one addr mode bit for platform use, and one addr mode bit for per-program (dynamic) use. But that's a lot of bits (6 mostly unused bits per instruction; and if the opcode similarly has 1 spare bit for platform use and one for per-program/dynamic use, that's 2 more for a total of 8 bits, leaving only 8 bits for predefined opcodes). Another is to define all of these as per-platform, and then define a special 'platform emulation platform' that allows redefinition; this has the benefit that support for the complex custom addr mode remapping stuff is not part of std oot assembly, and so doesnt have to be supported by most implementations; this requires only 4 reserved bits per instruction (3 addr mode, and 1 opcode; but we can just say '4 reserved bits' and let the platform use those as it will; this leaves us 12 non-reserved bits in the first of the 4 fields, of which 6 is taken up by addressing mode bits, leaving 6 bits for opcodes, or 64 opcodes; this is the tightest we would contemplate, as even the Lua VM has over 32, and for the bloated x86 ISA i elsewhere counted about 678 mnemonics). Then there is a notion of a 'std oot program' which is one that doesnt use any platform-specific functionality.

i still sometimes flirt with the idea of having tagged memory, e.g. having the last 3 fields (and every memory loc) split into 12 bits or so of data, and 4 bits or so of metadata, and perhaps the first field, too, if we want to use a standard addressing mode on it. Or tagged pointers (descriptors?) This would allow nested lookups. My arguments against this were (a) efficiency loss, as implementations will natively deal with 8-bit and 16-bit quantities but not e.g. 12 bit ones, so each of the 4 fields must now be 'decoded' into an eg 12-bit subfield and a 4-bit subfield, (b) implementation complexity.

(b) is negated if the same or similar scheme is used identically on all 4 fields, since we have to split up the first field in any case.

if we could restrain ourselves to only 2 addr mode bits per field, then we could stuff them all into 8 bits in the first field, negating (a) and leaving space for 8-bit opcodes (256). This doesn't leave us with any reserved bits, however, nor does it allow for nested lookups.

using std addressing mode lookup on the first field would have various benefits:

also, we should probably fix which memory locations will be mapped into normal memory, and which are 'special memory' (e.g. stacks etc), even if most of the locs in special memory are currently reserved. The above suggests that we reserve the first 256 locs as 'special', although many of these are ordinary 'registers'.

---

we probably want a mechanism to attach __GET to memory locations. The obvious mechanism would be tagged memory. Another obvious one is only to recognize __GET when using special 'fetch' opcodes. Another one is to use the virtual MMU functionality to trap reads from certain memory locs. The first and third of these solutions imposes an overhead on all or many memory accesses.

---

should we provide MMU functionality?

consists of: page table, possibly with 'nested page table' structure (for virtualization); do we want to provide for lookup via handler for some entries in the table, or just specify the page table data structure (even if there is a handler, can assume that 'hardware' will cache into TLB instead of calling the handler each time). potential per-page metadata in the table: no_execute, dirty, cachable, memory concurrency guaranteees (eg write-combining), if we have nested page tables; do we want to in addition allow some nodes to trap into a handler via Oot __GET?

benefits of an MMU: this would seem to be needed for efficient emulation costs: this would seem to impose an overhead on every memory read or write perhaps the program could specify that all memory locations below (or would above be better?) a certain value are 'real memory' (except for the 'special' low memory, in the case that we choose 'below' earlier in this sentence), and don't have to be looked up via the MMU? This would allow programs to opt-out of most of the virtual MMU overhead (there may still be the small overhead of the extra compare and branch to determine if it's real memory if we choose 'below'), or to accept the overhead for only some memory accesses.

---

on tagged pointers:

the info table thing makes me think, hey, instead of trying to stuff everything into tagged pointers, perhaps just support something like an info table at the Oot Assembly level and let the issue of what information is stored with the pointer and what isn't be a platform-dependent implementation issue

todo toread:

---

to make room for one more addressing mode, we could combine immediate and constant table immediate, saying that immediate only goes to 32k, and a value above 32k is a constant table immediate reference. But do register and register indirect really need more than 32k more than immediate and constant table? Why not just take away a bit uniformally if we're going to do that?

i suppose really we only need one addressing mode, register, and immediate, constant immediate, and register indirect could be replaced by special instructions (loadi, loadci, load/store). This would create a lot of unnecessary register traffic, however.

y'know, even with only 4 addressing modes, we won't be doing many 'fast functions' if we allow addressing in the first field, because that's already 2*4 = 8 addr mode bits, so only 8 bits are left for the opcode 'operand'

which makes me seriously want to consider doing everything in 12 bits, not 16, in which case we'd have 8 bits left over. ... but 4k might not be enough to cover all variables and functions likely to appear in a module? and we might need more than 4k constants to cover all of the jumps over 4k likely to appear in a module?

otoh Lua has a stack size of 256, a limit of 200 locals, and 60 upvals: http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf and we could always have jump and branch instructions that combine the two operands, giving 24 bit jumps instead of 12 in fact, we could also have a "load_immediate_constant" instruction that combines the two operands, allowing access to a 24-bit constant table! which is alot.

could always increase the instruction size over 64-bits. But the next obvious thing to do is double it, and that's some big instructions! and we don't really need another 64 bits, we only need another 8 or 16 or so. But 72 or 80-bit instructions? i dunno if it's a good idea for the instruction size to not be a power of 2.

so this is looking attractive. But what about the inefficiency of unpacking every operand? well, Lua does it, so it can't be that bad.

otoh 'nm /usr/lib/x86_64-linux-gnu/libc.a

otoh nm /usr/lib/x86_64-linux-gnu/libc.a
wc' gives about 16k symbols already (as does nm /usr/lib/x86_64-linux-gnu/libc.a grep -v '^$'grep -v '.o:'wc )
grep -v '^$'grep -v '.o:'grep -v '_'wc gives about 2.8k symbols

---

in addition to the commands that squash the two operands together, could also have an LEA-like command that evaluates the first operand, then indexes into it using the value of the second operand on that base.

so, even if the operands were limited to 4k in size, if each module only exported less than 4k functions, and if you there were less than 4k modules total imported into the program (including submodules of imported modules), then without analyzing stuff, you could use this instruction to load a reference to any accessible function. if there were more than 4k total imported/accessible modules, you could still load each of them into a 24-bit constant table (it's unlikely there would be more than 64k accessible modules).

and with analysis, you could identify the 2k or so most used variables and 0-ary or unary or binary functions, and load references to them into the constant table

(the way Lua does it is to load function references into registers with GETGLOBAL, and then call a function reference in a register with CALL; GETGLOBAL implicitly uses LOADK to load from the constant table, and then looks up the string loaded in the globals table; GETGLOBAL has two operands, and 18-bit global identifier, and the destination register; so this would be like the two-step procedure given above as a fallback; load a reference to the function via the 24-bit constant table into a 'register' (a local variable, of which we have 4k), then execute it from the 'registers')

---

after sleeping on it, i still like the 12-bit thing. While a little icky, the efficiency cost of shuffling a few bits for each field for each instruction is probably relatively small; it's not like it's an additional memory access or anything. Increasing the instruction size beyond 64 bits might be an appreciable cost, as (a) 64-bit machines, which are currently standard, are probably great at fetching 64-bit words, and they might be slower at fetching, say, 72-bit words (b) 64-bit (8-byte) instructions are already 'bloated' and just for the sake of having a design constraint it seems like it would be good to be reluctant to push that further (c) i don't understand the cost, if any, of having fixed non-power-of-two instruction widths.

The 4k limits will not catch everything, as 64k would, but it'll work for the vast majority of cases, the 80/20 rule but probably more like 99% than 80%.

I like being able to address the first field, and to use this for single-instruction calls to first-class functions with no more than 2 inputs. This should let many Haskell-ish algorithms be expressed in a natural way in Oot assembly. Because sometimes we'll hit the 4k limits, sometimes we'll have to fallback to multistep function calls, but at least this will be less common.

I REALLY like that i have a spare 2 bits per-field now. Whether laziness or error-handling or 'execute this' addressing modes or custom addressing modes or nested indirection or emulation or reversible computing or quantum computing or something else (somehow combining combinatory logic, lambda calculus, logic programming, and turing machines?), i bet i'll find some use for these.

---

also, one note: we are back to having some a bit, or at least two spare addressing modes, in the destination operand, because there's no need for immediate and constant addressing modes there

(except: we do need a way to specify that the return argument is discarded)

---

interesting discussion of an ISA proposal:

https://fail0verflow.com/blog/2012/dcpu-16-review.html

the only thing i think i really learned from it that matters here that i didn't already know is the need for "instructions that add in (the overflow register) in one go, much like “add with carry” and “subtract with carry” in other CPUs. This allows for much more compact code that can add/sub/shift wider ints at one instruction per word."

---

otoh if we WANTED to find an excuse to exceed 64-bit instruction, we could have a go at my 5-based proposal.

So now if each field's payload is x bits, we have 5 fields, for 5*x bits so far, and if each field's addressing modes and other per-field metadata were y bits, we'd have

5*(x+y)

so far. Plus we'd probably want z bits of overall flags, so 5*(x+y) + z. So if x = 16, y = 4, z = 16, we already have:

116 bits

which is only 12 bits from 128 bits.

if x = 12, y = 4, z = 12, we'd have:

92 bits

which is 4 bits from 96 (a multiple of 12), and 36 bits from 128.

---

we could indicate a throwaway result by writing it to memory location (register) #0. This could be a 'zero register'.

---

what would 'the fifth element' in the 5-element system be? tenses, whether or not the statement is 'hypothetical'/quoted? which turing-equivalent machine type the instruction is written for? which adjacent instructions the instruction is connected to (branching, conditional execution, or some sort of weird neural connectionist/graph edges thing)? reified graph or hypergraph stuff or cons cells or parentheses? 'looking outward' to represent position within an expression tree (instead of explict stack ops etc to represent a computation path through an expression tree linearly) (perhaps either or both of absolute identifiers or offsets to adjacent nodes)? annotation (type or name or (provenance or probability or trust) or otherwise)? annotation of runtime information/runtime scratchpad, such as if the result of this node has been memoized and if so, where? oot views? 'levels'? whether addresses are to follow bindings and __GETs or to act directly/concretely on their stated target even if it's a symlink? whether or not metaprogramming overrides of instructions are to be respected? something else regarding layers of indirection? which language the instruction is written in? data flow routing between operands or between instructions? concurrency/parallelization stuff? mutable state domains, state versions, 'worlds'? transactions this instruction is inside? list of 'code regions' this transaction is inside / list of metaprogramming directives it's subject to? error handing/'backwards execution'? 'receptors'/component architecture? instruction encoding format? more than one of the above?

body of assumed knowledge (logic programming) (like state version?), or mb body of permitted inference rules?

memory ordering consistency model, reordering time (relation to state version) ?

primary key (annotation; redundant in that memory loc already provides a primary key) could memory loc already be an implicit 5th field? perhaps...

(note: this is mainly in the context of possibly using an extra field for this, but also in the context of possibly using the spare addr mode bits for it)

---

my current thought is that the best 'fifth field' would be a primary key/unique identifer for this statement (a form of 'meta'), so that arbitrary annotations could be attached to it elsewhere, however as noted above the memory location at which the statement is found serves as an implicit primary key, so we don't really need an explicit 5th field to have this in every instruction.

so, for now let's stick with 4 fields.

which means we can fit it in 64 bits. which means we should. so if we want the first field to be more than 8 bits, so that we can practically use 'fast functions' as such and use memory-to-memory first-class functions with some headroom for lots of in-scope variables, and we also want at least 4 addr modes then we should use the 12/12/12/12 bit format, not the 8/16/16/16 bit format. so we have 9 spare bits; 2*4 + 1; 2 spare addr mode-ish bits per field, plus one spare bit for the destination field, because both the immediate and the constant addr modes are useless for that one; really 3 spare bits if you assume that the addr mode bits must be used the same way for all 4 fields.

i feel joy knowing looking forward to getting to put those spare bits to use

---

some ideas on using these:

well, any sort of per-field (per-argument) annotation

lazy/strict

quoted symlink vs follow the symlink implicitly; or similarly, quoted __get and __set vs access the field directly

parallelism annotations, eg 'it's worth it to spawn a thread to do this sub-computation'

'execute what you find there and give me the result' (orthogonal) address mode

async annotation, eg 'start computing this argument now and sleep this thread until it's done' (or if async is the default, blocking annotation)

error handling mode, eg 'return the null in a nullable type if there's an error, vs. raise an exception if there's an error'

copy-on-write vs. shallow-copy (alias) a reference

stack addressing modes (to make it easy for the reader to 'paste together' various linear instructions into an expression tree, eliminating the intermediate output registers)

16 address modes instead of 4

take the addr of (the '&' prefix in C) (the opposite of the register indirect addr mode, which is the '*' prefix in C)

note that the above are just shorthand for doing things that you could do by pre-processing the argument or post-processing the result (in some cases 'by default', eg if you want __gets and __sets and/or async and/or strictify to be the default) (eg; strictify (if lazy is the default) or thunkify (if strict is the default) with a separate instruction as a preprocessing step, taking from a register and storing back into that register; similarly, follow a symlink (if that's not the default) or fetch the quoted (potential) symlink (if following is the default); execute the function you find in the register; launch a parallel or async process; process errors in the argument; COW or shallow copy the data; etc). That's okay because aside from 'register', the other 3 of our 4 core addressing modes are also like that, eg instead of using an immediate addr mode you could precede this instr with a 'LOADI' instruction, and instead of register indirect addressing mode you could precede this instruction with an LEA instruction. But we're (tentatively) allowing those 3 addressing modes because it reduces the need for very temporary, statically introduced/eliminated variables (registers), which makes the code harder to read, and requires a dataflow analysis to precede various optimizations. So the above are the same idea (otoh maybe we actually should get rid of those 3 addr modes! the large number of things in the above list suggest that we won't be able to really get rid of such temporary variables anyways, so if the reader/automatic code analyzer already has to do dataflow/reconstruct an expression tree in any case, why bother to minimize temporary variables at all?)

others ideas that aren't like that (or at least, might not be like that):

quote/antiquote inside macros (i think what i mean here is quasiquote, unquote, splice: see http://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Quoting.html ; note that a quasiquote is just a quote that recognizes unquote and splice inside of it)

implicit map, eg 'if this is a list, implicitly map this over the list and return a list, vs. operate on the whole list'

provide cardinal directions for how to locate this instruction vs. its arguments in space, e.g. 'this argument is South of this instruction'

(or, 'input' vs 'output' as a direction, whatever that means) mb for asking questions, ie inverses eg "C = ADD(A,B)" is like "? = ADD(A,B)" but you could also ask "C = ADD(A,?)", which is equivalent to subtraction

something about the 'reverse computation' mode of bubbling up exceptions, etc

something about locking, coloring arguments the same way if they need to be synchronized, or ordering argument evaluation (or saying they can be unordered) or something like that

even if things are async by default, whether the current instruction should begin executing immediately and immdiately return a promise, or should be encapsulated in a thunk and lazily wait until its result is needed (a third alternative, which is what normally happens in ordinary, synchronous languages, would be to block until the current instruction completes; but this can be achieved by doing either of the above and then afterwards executing a special block/strictify instruction on the result location

custom addr mode/instruction wrapper

language/universal machine paradigm selector

as a generic boundary/mark used by other metaprogramming features

in general one could look at these either as arbitrary colored 'boundaries' in the instruction graph, or as a modification to the way that the operand is interpreted that changes the destination of the edge we draw between the node representing this instruction and the node representing its argument.

note: the above ideas also give some ideas to the sorts of 'defaults' that we want in the oot library ecosystem; things like lazyness, __get and __set by default, symlink-ish binding, immutable data, async by default, inverses; these are things that are 'just' syntactic sugar for common patterns, but that you want to be in the language in order to encourage the library ecosystem to default to them

---

in the previous, we think again about the comparison between a linear language representing an expression by writing intermediate results to registers (or to the stack), vs an expression tree. Consider the idea of annotations being cardinal spatial directions; really, the registers or stack locations form 'directions' in that, if one is just doing an expression evaluation, the expression tree can be reconstructed by pasting the bits of code together that read/write to the same register (or stack location); eg "return 1 + (2*3)" could be "t = 2*3; return 1 + t", and you can reconstruct the original tree by substituting in t.

When it doesn't work like that is when the intermediate variable will be used multiple times (hmm, echos of linear logic?), or when its evaluation causes side-effects; "t = 2*3; return 1 + t + 2*t" is more efficient than "return 1 + (2*3) + 2*(2*3)"

Which suggests another use for annotations; to say 'this variable is just a one-time temporary variable and you can throw it away and reconstruct the expression tree". Note that 'strict stacks' with only PUSH and POP and no peeking enforce the same constraint, given the additional constraint that each expression is only permitted to add exactly one thing to the stack (i think... i havent thought it through, that might not be true). This leads to the interesting idea of having a special stack register, or addressing mode, that allows the compiler to make its intent clear when it is just creating a linearization of an expression tree.

Related to this is the idea of having a special stack register or addressing mode upon which a JVM-esque guarantee of stack layout static predictability is enforced.

---

out of the above, the only ideas that jump out at me as almost a must-have are:

---

more ideas for using those 3 bits of flags:

atomicity uniqueness typing anotations reference counting boxing (boxed vs. unboxed values) something relating to capabilities whether or not runtime type checking is needed whether or not runtime bounds checking is needed 'who' is running this code (and what to do/who to notify if there is an error, eg what is the 'robustness paradigm' (is it 'continue going no matter what' or 'if the correct answer is uncertain, refuse the temptation to guess'? do we neeed to run this operation multiple times and compare results to be sure?)) evaluation strategy (call by name, call by need, call by value, call by refererence, etc) variable sigil annotations, eg to mark some variable within a probability expectation expression as a random variable hmm i guess this adds some color to the idea of switching computing modes between turing-equivalent paradigms; eg maybe in a Turing mode "C = ADD(X,Y)" is an imperative (with one-directional flow of information, and where X and Y and C are registers), whereas in a Logic mode it's a fact, where X and Y and C are just 'variables'. so could have imperative vs logic vs quantum/fixpoint-of-gates vs combinatorial/callByName vs matching graph structures (SATisfaction?) (= logic? = assertions?) eg C = ADD(X,Y) imperative: "take X and Y and add them and put the result into C" logic/satisfaction/matching/assertion: "assume that it is a true fact that C = ADD(X,Y)" or maybe "put the fact ADD(X,Y) into C" quantum/fixpoint-of-gates: "assume that it is a true fact that C = ADD(X,Y), where this is a dataflow that is being continuously re-evaluated" combinatorial/callByName: "take the program code X and the program code Y, combine them under an ADD node, and put the result into C" mb characterize in terms of time? something is: statically universally true (a truth), something that the computer should do now, something that is ongoing/recurring, something that has been done, something that can be done whenever, etc

---

CNOT "the controlled NOT gate which XORs the first bit to the second bit and leaves the first bit unchanged" CCNOT "It has 3-bit inputs and outputs; if the first two bits are set, it inverts the third bit, otherwise all bits stay the same." and CSWAP the three-bit gate that swaps the last two bits if the first bit is 1.", two universal gates for classical computation that are notably generalized to quantum gates http://en.wikipedia.org/wiki/Quantum_gate

---

note: if we want to allow 3-element 'gates' like CCNOT and CSWAP, then i guess we still CAN use the special addr semantics for the dest argument, since in these gates the dest is always still potentially written to, it's just that sometimes it's also read from, and sometimes other addrs are also written to

note: i don't quite see how quantum circuits could be analagous to classical circuits; if you have a gate which swaps two of its operands, then that gate can't be continually active, can it, because in that case if you feed in a "1" and a "0" those'll be swapped infinitely fast. hmm.. but maybe that's the point; then we end up saying that the state of those nodes becomes a superposition of 0 and 1; i guess we're looking for a fixpoint for the circuit as a whole.

---

i guess boxed/unboxed might be another almost-'must have'

immediate mode doesn't make sense with the destination operand, so i was thinking of using that bit for something else, such as whether to start executing the current instruction now, or wait until it is needed.

but boxed/unboxed also doesn't make sense with immediate mode, so maybe i should combine one of the other bits into a 3-bit addr mode that includes boxed/unboxed-ness. but then the dest operand wouldn't have an 'immediate mode' bit free.

also, the maybe/option/exception tower

perhaps the maybe/option/exception tower could be combined with 'boxed' and with the 'indirection flag'

could mb want a per-operand bit for a synchronization/atomicity; eg each instruction guarantees atomicity with respect to those operands which have their synchronization/atomicity flag set

but maybe a per-instruction 'lock' prefix instruction (plus lock annotations for multi-instruction sections) are good enough; per-operand granularity might not be needed very badly

we could also have a 'threadlocal' bit. But this is just like saying that half of the memory, by address, is threadlocal, and adding one bit to the memory address width.

do we want to guarantee that some memory locations are threadlocal? my instinct is to guarantee the first 256 locs, or some portion thereof, and to reserve some of those for ordinary scratch use. this is closer and closer to real registers. if we have register-like things, we need at least 32 of them. and probably at least 4x32 and a 'register window' thingee so a supervisor process can swap them out easily while swapping threads.

also, although it's unnecessary, we may want a displacement addr mode, as it's supposedly very common (see Self:proj-plbook-plChAssemblyFrequentInstructions)

---

should memory location '0' be reserved to represent an undefined pointer, and cause an error when read or written to? or should it represent 'discard' when writing to it, and 0 when reading from it?

https://en.wikipedia.org/wiki/Tagged_pointer#Null_versus_aligned_pointer and https://en.wikipedia.org/wiki/Tagged_pointer#Disadvantages allege that treating memory location 0 as an invalid pointer is extremely common, to the point of being a standard

---

apparently on the VAX, using an immediate addr mode in the dest operand caused self-modifying code! [8] do we want that?

---

so some 'registers' we want are:

0 reg accumulator (for when more than 2 inputs or 1 output is needed) status flags pc 4 stacks do we need threadlocal regs? do we need more kinds of sentinel/invalid pointers?

---

some more support for using 12 bit addrs: in the ARM ISA, page size starts at 4k and goes up to 16mb 'super-sections' this is 12 bits to 24 bits [9]

---

this page is interesting: http://esolangs.org/wiki/User:Ian/Computer_architectures

---

i guess if you want more addressing modes, you can use one addr mode bit to mean 'split the operand in two, apply register addressing to one of them, and apply the specified addr mode to the other, and add the results'

---

in that case i think we'd have a complete system:

4 fields. First the the destination/output field, then the opcode field, then 2 input fields (some instructions may combine some of these fields).

for each field:

first 4 addr mode bits, then 12 operand bits

for each of the first 3 fields:

there are four basic addr modes, immediate, constant immediate, register, register indirect. Two bits are used for this; the first bit says whether or not immediate, and the second says the addr mode choice.

there is one bit that indicates whether or not the data item is 'boxed'. This may as well just be a 'custom addr mode' bit.

there is one bit that indicates whether or not the operand should be 'split' into two 6-bit pieces.

for the fourth, destination field:

same as above, except that there is no immediate or constant immediate modes, only register and register indirect. The bit that was used to indicate immediate (which is the first bit in every instruction) is put to another use. Probably one of:

the first choice, 'whether or not the two input operands are combined'; this is redundant, but allows the decoder to immediately begin decoding the opcode and input operands in parallel without wasting work (wasting work would be by decoding each input operand, assuming they are separate, and also decoding them together, and then choosing which one to use based on the opcode later). i guess avoiding that wasted work isn't terribly important, so maybe that would be a bad choice?

we can generalize this into a 'custom instruction operands format' bit. In other words, the operand field is interpreted normally, but the instruction gets to interpret the last 3 fields (48 bits) however it wants. This allows user-specified functions implement things like a packed representation of 4 no-operand instructions, etc. Which in turn means that Oot itself may as well have such things (but this opens up a small can of worms, namely, we'd really like some bits to specify how to link multiple instructions together with stacks, etc., but we don't appear to have any left; we could try to use this 'custom instruction operands format' itself to indicate that this 64-bit instruction is really 4 separate instructions, but then the second instruction couldn't be immediate/constant mode, which sucks)

if we DIDN'T generalize it into a 'custom instruction operands format' bit, and instead used it as a 'packed encoding' bit, we could pack 2 instructions with 4-bit instead of 12-bit operands into 64-bits (8 x (4 bit addr + 4 bit operand)) (and we'd still have a bit free from the immediate mode destination in the second packed instruction) (note that only the first 16 opcodes are accessible in this way, though; although we could make up a more elaborate 'packed' format to enable more opcode bits there; eg if we cannabalize the destination field of the first packed instruction, implicitly pushing it to the first stack, then we gain 8 bits and can use that to make both opcode fields 4+4=8 bits; however this can't quite work because the first destination field is the bit we were using; so place the packed instructions backwards and now the bit we used is in the second instruction's destination field, where it is okay; now we have 2 instructions with 8-bit opcodes (256 possible opcodes) and 4-bit operands; that seems pretty useful, actually, i bet a large percentage of code would fit into that; and a smart compiler or interpreter could optimize away the stack push in the first instruction if the second instruction pops from the same stack).

But that makes things more efficient but doesn't add expressivity.

or could use it for autoincrement. But that also makes things more efficient but doesn't add expressivity.

or we could use it to 'comment out' this instruction, and/or to use it as an annotation rather than an instruction. This could be important because currently the opcode field is first resolved using its addressing mode bits; so if annotation was just a range of opcodes, that would force us to do resolution even on annotations, and it would also leave the awkward possibility that an innocent-looking instruction could be an annotation depending on the dynamic state of registers (that might be useful though for some annotations like transactions :) but seriously it breaks the idea of annotation as static).

so far, annotation sounds like the best bet. in which case, yes, we'd want to move that field to the front so that this would be the first bit.

heck, let's keep it undecided for now, leaning towards annotation.

---

TAL

TALT (Toward a Foundational Typed Assembly Language, Karl Crary)

TALx86: A Realistic Typed Assembly Language

https://www.google.com/search?q=eli+gottlieb+programming+language&oe=utf-8&oq=eli+gottlieb+programming+language&gs_l=mobile-heirloom-serp.3..30i10.13666.16574.0.16726.23.11.6.0.0.1.313.1539.3j4j2j1.10.0.msedr...0...1c.1.34.mobile-heirloom-serp..13.10.681.tzB73Id3HX8

https://www.google.com/search?q=eli+gottlieb+programming+language&oe=utf-8&oq=eli+gottlieb+programming+language&gs_l=mobile-heirloom-serp.3..30i10.13666.16574.0.16726.23.11.6.0.0.1.313.1539.3j4j2j1.10.0.msedr...0...1c.1.34.mobile-heirloom-serp..13.10.681.tzB73Id3HX8

https://www.google.com/search?q=%22functional+assembly+language%22&oe=utf-8&oq=%22functional+assembly+language%22&gs_l=mobile-heirloom-serp.3..0i30j0i8i30.2229.4766.0.5044.2.2.0.0.0.0.281.443.0j1j1.2.0.msedr...0...1c.1.34.mobile-heirloom-serp..0.2.443.-xG-OiOU5_E

https://www.google.com/search?q=%22lazy+assembly+language%22&oe=utf-8&oq=%22lazy+assembly+language%22&gs_l=mobile-heirloom-serp.3...13629.16885.0.18865.14.12.0.0.0.1.328.1830.4j5j2j1.12.0.msedr...0...1c.1.34.mobile-heirloom-serp..7.7.988.mgxYJ5ga8lc

https://github.com/ppedemon/Bluejelly

---

so how are the fixed opcodes specified? i guess they would be immediates. But then how are fast functions specified? they should be immediates, too, right? so is the convention that for any given modules, the meaning of all 4096 immediates are defined in a fixed way for the whole module?

or is the convention that an immediate just actually refers to one of the first 4096 memory locations, and you execute the fn stored in that location?

in the latter case, mb put fixeed instructions at beginning of constant table, rather than immediate, to free up low registers? in that case, there may be a problem for split constant addr mode in operands, because taking one of these addresses and then adding something else is meaningless. or, define the opcode field to subtract 256, and then use the first 256 opcodes as fixed instructions, and the rest as references to whatever function is in register r - 256 (now high 256 regs are inaccessible as fast functions). and/or is immediate mode signed? if so, could just reserve half the addr space, and cram the stdlib in there

hmm, this seems too complex. What if we use the first convention; each module somehow fixes the definition of the 4096 'immediate' functions. Using the convention that a 'function' is specified by an entry address, some of these 4096 table entries could actually be pointers to lookup tables, allowing the split addressing modes to be useful (e.g. if the opcode field contains 'unboxed, split, immediate: 293, 3' that means 'unboxed: immediate 293 + register 3' which means 'to get the entry point of the function being called, take the per-module-fixed address #293 and add the contents of register 3 to it'. In other words, the per-module fixed address table will contain, at entry 293, a pointer to a lookup table, and register 3 contains an index into this table; the contents of the lookup table is pointers.

hmm. The dumb part of that is that now immediate mode is kind of redundant, because constant immediate mode is also some lookup table. Also, this cripples the use of split addressing; what if we want a constant index 293 into the table whose base address is in register 3? We'd like 'split, immediate: 293, 3' to mean that, not what we just said.

I guess what is simplest is neither of these two systems, but rather: 'a reference to a 'function' is a pointer to the entry point of that function; therefore, executing function 'immediate 293' means 'JMP 293', e.g. address 293 should contain the first instruction of the function, 294 should contain the second instruction, etc'. So in that case, the way to get to the fixed instructions would be the constant immediate addressing mode.

If we're worried about cluttering the low constant table (preventing the first 64 entries, which is all we get for the split addressing in constant immediate mode, from being used for anything useful), we could either say that there are two constant tables, one for the opcode field and one for the other fields, or we could say that it's the high constant table, not the low, that we reserve for fixed instructions. Only the latter of these allows the low constant table to be used by the program to create 64 constant lookup tables that can be used for register-selected fast functions (admittedly, this is probably a weird special case, but maybe it would be good for emulation?), so maybe the latter is better. In other words, reserve the last 2048 entries in the constant table, allowing the program to have only 2048 of them. This is making our 12-bit space look even more crowded, but i think we can handle it (if we want more than 2048 constants, we can load an immediate into a register, then use that register as an index into a table whose base address is in the constant table; giving us 2048*4096 total entries, eg 23 bits or 8MB of entries).

So, to sum up: an opcode is just like an assembly language CALL (put the return addr into a link register, then JMP) to the address indicated by the resolved value of the opcode field. Ie go to the effective address specified by the opcode field, take the value there, use that value as a pointer and CALL it (JMP to it). So if the field says 'immediate 3', then actually JMP to the sequence of instructions held in register 3, register 4, etc, that is, set the PC to 3.

To access fixed instructions, use 'constant immediate' mode.

---

hmm, maybe a better idea: something else we could do to help with emulation: treat immediate mode in the opcode field specially, indicating that we should only execute a single instruction, and then return; so immediate mode 3 doesn't mean 'JMP to 3', but rather 'execute the instruction in register 3 in emulation'.

---

or, how about: to resolve the problems with cluttering up the low constant table, use a different constant table for split addressing and normal addressing.

and to resolve the problems with split immediate, just define 'immediate non-split' to be special, to reference the fixed instructions.

---

what does 'boxed' in the opcode field mean? i think it means the normal thing; the referent is not a plain pointer, but some sort of thunk or nullable/maybe/optionType thunk, whose base type is pointer. We can also work in indirection/nested/chained pointer dereferencing here, e.g. a boxed pointer could be a thunk that says 'go to address X, fetch value x from it, and then do an LEA on the value of x eg as if x were the field'.

---

so, to sum up:

---

is it easy/worthwhile to have one or two one-instruction restricted CASE statement? Seems like you could reasonably do this with 4 cases; since you have 48 operand bits to work with, use 4 of them for an address mode, now you have 44 bits, that's 11 bits per case entry point (PC-relative, i assume). You could maybe also have 8 cases with 5 bits per case entry point (again, PC-relative).

---

seems like a goal of Oot assembly is shaping up to be:

a linear, fixed-width (or at least fixed-width packs) instruction set architecture that reads more like a high-level language than usual, thereby allowing the preservation of some 'intent' and less need for inessential register transfers; with limited customization/parameterization points (customizable operand decoding, customizable unboxing).

---

so the current proposal is:

4 fields. First the the destination/output field, then the opcode field, then 2 input fields (some instructions may combine some of these fields).

for each field:

first 4 addr mode bits, then 12 operand bits

for the destination/output field:

for the opcode field:

for all fields, except for the special cases given above, the typical interpretation of 4 addr mode bits is:

immediate means that the referent is the operand. immediate constant table means that the referent is the entry in the static per-module constant table indexed by the operand. register means that the effective address is the memory location pointed to by the operand. register indirect means that the effective address is the value in the memory location pointed to by the operand. In general, if an effective address is given, the referent is the value in the memory location indicated by the effective address.

If the operand is unsplit, then the operand is 12-bits and is resolved as indicated above. If the operand is split, it is composed of two 6-bit operands. The first of these is resolved using the given addressing mode. The second is resolved using 'register' addressing mode. These two values are then added together, and the result is taken as the effective address. If the operand is split and the mode is immediate constant, a different constant table is used than for the unsplit case.

If the addr mode is 0011 (immediate boxed), a special case obtains (todo, split operand with other kind(s) of parens in register indirect addr mode; see 6502 X and Y registers and associated addressing modes? but wait then we can't specify boxedness of the result; so let's do something else).

Some instructions may choose to reparse/specially parse their 48 bits of input and destination fields, for example, in order to take 1 input with 28 operand bits instead of 2 inputs with 12 operand bits each.

in other words, useless addressing modes that can be used for other things include:

---

ok wait, actually unsplit immediate boxed totally makes sense, that's boxed literals. This allows us to create small boxed literals without referencing the constant table. So we need to specify an encoding for the 12-bit operand.

How about: 1 bit null/not null, 3 bits for the type, 8 bits for the payload. So we have 8 types (all subtypes of graph, of course, but:): graph, string, list, hashtable (?), int, real, type, fn (do we need a DYN here? or is 'graph' that? (i think the latter)); do we need a separate 'uint' and 'sint'?. If the 'not null' bit is set, then the 8-bit payload specifies an element of the type; in most cases this is a list of one element, a string of one character, an int, etc, but in the case of eg 'type' and 'graph' we still have to make up a standard encoding. If the 'null' bit is set, then the 8-bit payload can specify (a) the maybe/option says 'None', not 'Just'; or, (b) this is an empty list, an empty string, etc; or (c) 253 other things tbd.

also, i think i've decided that the spare bit in the destination operand should mean 'annotation or extraordinary instruction'.

also, we need to move 'split/unsplit' addr mode bit to the first one, so that annotations/extraordinaries can be recognized as '00' in the first two bits of the whole thing

also, perhaps we should say that all jumps/branches will be ordinary arithmetic instructions operating upon the PC register? This allows most (but not all) of them to be recognized without decoding simply by scanning the code for instructions with the PC register in the destination operand in 'register' mode. This could be helpful for optimized interpreters that decode future instructions in parallel and do pipeliney stuff to partially pre-execute them. This would also eliminate the need for an instruction form that combines all 3 operands; but otoh it also forbids that in the case of branches, meaning that branch distance would be limited to 28 bits instead of 44 bits. Also, if we allow GOTOs and the like, which we probably do, then an optimized interpreter would have to conservatively assume that any CALL might contain a further branch. And i guess it wouldn't be that much extra trouble for an optimized interpreter to compare each immediate-mode opcode to a lookup table of branching instructions. So i guess right now i'm against this idea.

---

copied the current proposal to ootAssemblyThoughts

---

why are stacks fasteer than heaps?

i think the speed difference is mostly due to memory management paradigm. A new object is allocated on the stack with a single instruction (the cost being that the total size of the stack is fixed). In the heap, this is not always the case, particularly if some sort of garbage collection or defragmentation is triggered upon allocation.

Some citations about this: http://stackoverflow.com/questions/161053/c-which-is-faster-stack-allocation-or-heap-allocation http://bytes.com/topic/c-sharp/answers/244034-why-stack-faster http://www.programmerinterview.com/index.php/data-structures/difference-between-stack-and-heap/ http://gribblelab.org/CBootcamp/7_Memory_Stack_vs_Heap.html

---

so far oot asm is turning out to be not that different from ordinary assembly language. I suspect that fixed-width instructions and the inclusion of pointers are two big factors that end up affecting/determining a lot of the design. One difference is that our memory location aren't really memory locations (they can 'hold' arbitrarily large data objects), and another is that we'll probably avoid pointer arithmetic (or at least, writing pointer arithmetic into the spec) by changing the defn of the 'split' addressing mode to be getattr rather than addition.

---

todo: http://paulkoerbitz.de/posts/Understanding-Pointers-Ownership-and-Lifetimes-in-Rust.html looks good even though it is out of date (eg some gems: "Owned and borrowed pointers fit a lot of use cases, but sometimes they are not enough. With these pointer types, each piece of memory must have an ultimate owner. This means that the ownership of all objects on the heap must be representable as a directed acyclic graph. Up to version 0.8, Rust offered managed pointers with the syntactic form @. These are currently being removed and will be replaced by the Rc and Gc pointers in the standard library"; lifetimes; a borrowing must have a shorter lifetime than the owner; things become immutable (to the owner, yes, but also to the borrower? i'm not sure) while borrowed)

http://doc.rust-lang.org/0.12.0/guide-lifetimes.html#lifetimes seems to say:

(1) lifetimes names are opaque identifiers; you can say that a function returns a lifetime, that a variable takes a lifetime, etc; there is also the predefined 'static' lifetime. Lifetimes are written with a prefix apostrophe, eg 'lifetimename. (2) lifetimes are also labels that can be jumped to in some cases (e.g. in 'breaks' out of loops)

---

memory locs are supposed to be like variables with numbers for names instead of letters. could just say that all 'memory locations' are kept for the life of the program, and only boxed objects are on the heap, and program predeclares number of memory locs it needs.

if memory locs are like var names, then that explains why we dont want to allow pointer arith with them

but then we couldnt 'mmap' arbitrarily long arrays into linear stretches of this memory. so mb let the program dynamically expand and contract the max memory loc? the intepreter might demand that the os give it a single, contiguous chunk, howeever, which may make having this region be too large be infeasible. also, mmapping in this way is only useful if we do pointer arithmetic, which we may not want to do anyways

if we disallow ptr arith, and if memory numbers are like variable names, then we cant represent 'gensym' (by keeping track of a max variable number and adding 1 to it when we hand it out), so gensym must then be a primitive (an oot assembly instruction)

also, if there is a stuct-within-a-struct, wouldnt it appear to the program that the parts of this struct exist at ordinary, though unknown and arbitrarily large, 'memory addresses'? the gc convention though could be that these language-provided, 'opaque' memory addresses are garbage-collected, not roots.

the contents of the cells ('memory locs') is actually in them if unboxed and small, or a pointer otherwise. if large unboxed values are possible (mb they shouldnt be). note that if a cell might be something small or it might be something large, and thee program puts in something small and then reads expeecting something large, there is an invalid ptr error. so maybe force boxed dynamic type in that case. does oot asm type cells, or do we trust the prrogram not to segfault in thsi way? if large things are a ptr anyways, mb we should just go ahead and box them; then the typing is just boxed or unboxed int or unboxed real or unboxed (some other small type).

---

to rephrase the last section:

top-level memory locs are supposed to be like variables with numbers for names instead of letters. Which explains why 'pointer arithmetic' seems unnecessary. However, within a struct, you also have a sub-array of memory locations representing the fields within the struct. And if the structure is itself of type array, then some field within it, sub-location within it, contains the array's actual data. The array's actual data will itself be represented as an array of sub-memory locations. And 'pointer arithmetic' on these locations makes total sense, because we are just doing arithmetic on the array index.

(and btw, in case i change my mind and we do disallow ptr arithmetic, that would mean that 'gensym' would need to be a primitive instruction provided by the platform, because we wouldn't be able to keep track of some max variable number and add 1 to it when we hand it out; which means that we should probably provide gensym in any case)

as for memory management:

if the top-level memory locs are acting like local variable and/or like registers, then they are acting like things which in other platforms are not heap-allocated and which would not be garbage-collected. So perhaps we just say that all of the top-level memory locs which are accessed directly by the program are keps forever for the life of the program. So what is on the heap? First, we could say that all boxed objects would be stored on the heap (when a boxed object is 'in' a directly-accessed memory loc, what is actually there is really just a ptr). Second, we could say that if the program goes inside of a boxed object and fetches the address of one of its fields, then even though this address is conceptually some integer number top-level memory location, actually it refers to memory on the heap, and is garbage-collected. Similarly, if a program does 'malloc', it gets a pointer that conceptually appears to it to be just an integer that refers to some memory loc, but actually this integer refers to something on the heap.

Furthermore, when the program takes the address of something inside a boxed structure, or of something returned by malloc, this address is conceptually guaranteed to be an 'infinitely large' number, so that it is guaranteed not to conflict with any memory location that the program could ever access by other means. In other words, the conceptual structure of the integers representing the addresses of the top-level memory locations is a non-standard model of the integers, in which there is one island of integers before infinity, and many islands of integers after infinity that are infinitely far away from all other islands. Under the covers, of course, such addresses are merely pointer into real memory, in contrast with the integer locations of ordinary top-level 'memory locations'. Note that pseudo-arbitrarily large memory locs == locations not constructed by the program but returned from elsewhere == boxed objects == memory which is garbage-collected. Ordinary, 'small' memory locs == memory locs constructed by the program == holds unboxed objects and pointers to boxed objects == memory which is not garbage collected (but which forms the roots for garbage collection tracing).

in addition, this 'non-standard model of arithmetic' allows getconstant(x) to represent a path, not just a single number; e.g. we are pretending that a reference to someobject.somesubobject.somefield is just a reference to an ordinary yet very large integer index of the 'memory cells' (when in fact it's a pointer into main memory).

so, the contents of the cells are actually stored in them if the contents are unboxed, or if the contents are boxed, the cell actually contains a pointer to the boxed value in the heap.

also, instead of allowing the program to construct arbitrarily large memory locs, could just limit this to 4096 memory locs; at which point this would become in some sense simply a 4096-register architecture. However, one could at the same time allow integer references to memory locs to be interchangable with pointers to the heap, as above, to maintain the fiction that a single memory loc can 'contain' an arbitrarily large composite object; so in this sense it would still be a memory-memory architecture, not a register one (eg you could still do 'mov' between two memory locs directly, without copying the contents to be moved to a register in the middle). However, this would imply that you can't do things like copy an arbitrarily long array directly into the top-level memory locs and then iterate thru it there; i doubt that would be a big loss though.

---

indirect addressing here is essentially a form of metaprogramming, namely it's like providing a variable name as a string to a getattr or setattr function, and getting (or setting) the value of that variable , except instead of the string we have the numeric id.

may make typechecking harder..? well, not if the indirect address is not resolving to another top-level address, but rather to a heap address. Hmm.. maybe actually make that a requirement?

---

i guess actually it's fine for 'split' addressing to be:

  1. x + Y getconstant(#x) + Y X + Y (X) + Y

or if we use . (getattr) instead of +:

  1. x + Y getattr(getconstant(#x), Y) getattr(X, Y) getattr((X), Y)

(note that getattr doesn't make sense for #x + Y)

but now that i realized that we need to address arrays too, maybe + would actually be useful.

if we flipped the order of split addressing, and used getaddr:

getattr(Y, #x) getattr(Y, getconstant(#x)) getattr(Y, X) getattr(Y, (X))

if we used plus and typechecking ensured that when you 'add' two things, only one of them can be a pointer, and the other must be an offset, and then it's like getaddr(the_one_which_is_a_pointer, the_one_which_is_an_offset), that could be interesting; then we have the union of the above:

getattr(getconstant(#x), Y) getattr(X, Y) getattr((X), Y) getattr(Y, #x) getattr(Y, getconstant(#x)) getattr(Y, (X))

where to 'getattr' of an array's data field is to index into the array

but it seems like if we wanted that, we'd really want to explicitly represent it in a bit, rather than use polymorphism.

now we could just say "whatever, pointer + a constant = an address, we don't have to use typechecking to know that that's what's happening, we just add them". But the problem with that is that we are trying to abstract away from field size here, so the runtime can't actually just add some number to a pointer if the number refers to 'number of fields' instead of 'number of bytes', instead either the runtime has to know what sort of structure that pointer is referring to in order to know how to translate number of fields into number of bytes.


which reminds me of the problems of having this 'a memory location holds things of any size' in the first place. Are we really spending 64 bits to store a small integer or a single character? I guess there's only 4096 * 8 bytes = 32k memory locations that we are offering (and that's assuming we don't allow eg 'mmapping' an array into the top-level memory locs), but still, this seems a little excessive.

what does java do about this? i think they just bite the bullet and say that even small values occupy 64 bits: "If the size of an object is not a multiple of 8 bytes, it is rounded up to next multiple of 8." -- https://en.wikipedia.org/wiki/Java_performance#Memory_usage

one answer to this could be to have a different pool of memory cells for things of different sizes, but now we're getting complicated.

even we did have uneven cell sizes, then what about cell locations whose type changes during runtime?

also, what use are 4096 registers anyways when we may need stack frames that persist as we call subroutines and then return? or ARE these stack frames?

they certainly are not the entire stack; eg the Java default stack size is much bigger than 4k: http://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/jrdocs/refman/optionX.html#wp1024112

i think the best answers so far are:

so now we have added another function to boxed things; binding (eg binding a 'local' stack frame cell to a source upval). So boxing


as for 'split' addressing, we could just destroy the uniformity and then it's easier; just choose 4 out of the 6 things above (and swap X and Y as needed to make it easy to remember): i'd probably choose:

getattr(X, #y) getattr(getconstant(#x), Y) getattr(X, Y) getattr((X), Y)

these are useful because eg:

getattr(X, #y): get a specific field from the object held in X getattr(getconstant(#x), Y): reference a computed element from some long chain of namespaces, e.g. argv_array getattr(X, Y): good for array indexing X[Y] getattr((X), Y): X holds a pointer to an array which resides deep in some heap structure. (X) is the array; we are referencing index Y of that array

hmm.. actually, X + getconst(#y) (that is, getattr(X, getconst(#y)) might be (much) better iff getconst(#y) can represent a whole path (query) instead of a single choice of a field; for example if getconst(#y) can represent .b.c.d, and X is representing a, then we could do a.b.c.d in instruction this way (because the representation of the path/query .b.c.d is encoded in the constant table).


the idea from the end of the previous section is important enough that i wanted to emphasize it in its own section; instead of representing only the choice of a field, 'memory offsets' should represent a whole PATH, that is a choice of a field and then optionally of a subfield, etc. We might interchangably think of a PATH and of a QUERY. Eg .b.c.d (apply this offset/path/query to variable 'a', and you get a.b.c.d); eg [b][c] (apply this offset/path/query to 2-d array 'a', and you get a[b][c]); eg x y (apply the function f to these two arguments, and you get (f x y)).

note that with regard to function application, we are reminded of the 0-ary function issue, which we saw elsewhere also corresponds to the distinction between applying a complete slice index to a table and getting back a table with one element (for consistency with partial slices, which return eg columns of 2-d tables) vs. applying a complete slice index to a table and getting back a non-table value. And also of the issue of __set needing a 'path' through an object, as opposed to __get just needing a value.

And also of the connection between adding offsets to pointers, and getattr (doing a single field selection), and having offsets able to represent paths (nested field selection). In C adding an offset to a pointer would work as a nested field selection iff the fields before the selected field were of known length; but in our case we want to abstract away from that byte length of fields. You might call paths 'queries', which also implies that more complicated, 'dynamic' methods of selection might obtain; basically, anything computation that walks the graph and returns a pointer to a destination node. We see a similar distinction here as between effective address and referent in addressing mode. However 'a pointer to a destination node' is not really a 'path', so we may need to rethink calling these 'paths'; otoh b/c of metaprogramming like __get, __set, in some cases it will be important which nodes you pass thru on the way to another node, because the node you finally reached may be only a virtual node simulated by running __gets and __sets on nodes that you went thru to get here.


since we have pointer values floating around here, this brings up the possibility of segfault crashes (eg if you store an unboxed float (that is, a double, since it's 64 bits) into a memory cell, and then try to read it back with the boxed bit set, that will try to dereference the float as if it were a pointer, but it's not, so probably segfault). We would like not to crash (the VM segfaulting is worse than the program crashing, because the VM loses control and can't propagate an exception). How does the JVM cope with this? i looked into it and it seems the JVM actually verifies the bytecode to make sure it typechecks during classloading, before executing anything. I would have thought that would make class loading way too slow, but apparently the verifier is only a small proportion of the cost of loading a classfile in the JVM (evidence: 40% speedup of the verifier only translated to a 5% speedup of large program startup time [10]) (apparently the cost is mostly I/O, and also in many cases (eg Clojure startup) initialization routines provided by the code in the classfile).

so heck, if it's so cheap, let's do that.

so if we statically typecheck everything, that makes it harder to have dynamism and run-time metaprogramming. We would rather have something typecheck to DYN, meaning that their type is not known statically (at least not to our theorem-prover), and so runtime typechecking is needed. But we're out of bits. So let's just use the boxing bits for that, too (typechecking only on input, i guess?).

if we really need to do stuff with pointers without the speed cost of dynamically typechecking at runtime, and the verifier can typecheck this stuff at load time, then we could always just interact with the pointers using unboxed instructions that accept pointers.

so, the verifier checks everything which is unboxed (eg manipulated with instructions without their 'boxed' bit set; this might include manipulation of boxed structures directly), and the runtime dynamically typechecks (and bounds checks) everything which is boxed.


so what is boxing doing now? a lot:

we should also add:

in other words 'boxing' seems to be the default for our inefficienct, high-level, safe, dynamic, lazy, symlink-having language; and 'unboxing' is like a special mode that you can use to do more CTM stuff for tight loops, etc.


note that the typechecking verifier will probably have to know about the concept of an array in order to make array access efficient. Java bytecode knows about arrays.

---

Perl's runtime actually has a number of stacks, toread:

http://docstore.mik.ua/orelly/perl/prog3/ch18_03.htm

"

operand stack

    That's the stack we already talked about.save stack
    Where localized values are saved pending restoration. Many internal routines also localize values without your knowing it.scope stack
    The lightweight dynamic context that controls when the save stack should be "popped".context stack
    The heavyweight dynamic context; who called whom to get where you are now. The caller function traverses this stack. Loop-control functions scan this stack to find out which loop to control. When you peel back the context stack, the scope stack gets peeled back appropriately, which restores all your local variables from the save stack, even if you left the earlier context by nefarious methods such as raising an exception and longjmp(3)ing out.jumpenv stack
    The stack of longjmp(3) contexts that allows us to raise exceptions or exit expeditiously.return stack
    Where we came from when we entered this subroutine.mark stack
    Where the current variadic argument list on the operand stack starts.recursive lexical pad stacks
    Where the lexical variables and other "scratch register" storage is kept when subroutines are called recursively.

And of course, there's the C stack on which all the C variables are stored. Perl actually tries to avoid relying on C's stack for the storage of saved values, since longjmp(3) bypasses the proper restoration of such values. "

---

should we have scope annotations to say when we're done with a particular variable/resource? should we have for loops, while loop annotations or instructions? is that too ASTy for bytecode (even if it's linearized)?

---

i don't understand what this mean, but apparently the JVM had a problem verifying its jsr/ret instructions: [11] (see [12] section The issue with JSR/RET). WOULDNT THE SAME PROBLEMS APPLY TO MY 'FAST FUNCTIONS'?


so still not clear on:

---

well now that i've thought of a bunch more flags that i want, maybe we should cut things down from 12 bits to 10 bits, freeing up 2 bits per field, or 8 bits overall

1024 registers is still a lot, right guys? (yeah)

1024 constants per module is still a lot, right guys? (yeah)

1024 functions per module and imported modules is still a lot, right guys (umm.. no, but i guess it'll do, since we can always fall back to GETCONSTANT which will combine input fields to get 20 bits)

---

if we had more flags, we could also have one for how to deal with failure, eg whether to have (a maybe-null, or a straight nullable type), or to raise an exception

---