proj-oot-ootAssemblyNotes2

--

so, Oot Assembly may be fixed-length, not for efficiency, but for ease of implementation. Code density is not a concern because we expect that a compiler will compile Oot Assembly to a target system. We do, however, wish to preserve 'intent' so as to aid optimizations on a target system.

we'll probably have arbitrary word size, and allow the storage of complex data structures in one 'memory location'.

We might stick to things addressable by 1,2,4,16 bits. When to use what (um, this is kinda obvious..):

ok, to add some meat to that:

note that, in order to capture enough 'basis sets of primitives' to preserve intent across various computation paradigms, we want more than 16 opcodes. So, according to the above scheme (which conspicuously omits 8 bits/256), we should have the opcode field be 16 bits. Note that this gives us room for 'custom opcodes', and for using custom opcodes for subroutines, Forth-like.

PL book: Stack Machines has comments by two authors saying that 32 and 16-32 is a good size for an on-chip stack buffer (and that this must be at least 2). It also has comments that say that 2 or 3 stacks is enough (so we'll have 4), and that the average stack size on some benchmarks is 4. I get the sense that many systems allow direct access to at least the second element on the stack, and certainly common stack primitives include OVER and ROT, which involve 2-3 elements.

I'd like to provide access to elements beyond the top of the stack, so how many? Either 2 or 4 or 16. My general principal is to take what seems to be a generous-but-conceivably-necessary number and then go to the next number up to allow for maximal flexibility, so either we say 4 seems like a good number, so we go one up to get to 16, or we say many other systems appears to use 2, so we should use 4.

At one point, http://www.eecg.utoronto.ca/~laforest/Second-Generation_Stack_Computer_Architecture.pdf references Chapman's 'stack quarks', defined here: http://clubweb.interbaun.com/~rc/Timbre/ContentPages/Timbre/SQP/StackQuarksPaper.html#pgfId-916641 , which appear to be a basis set of operations for dealing with a stack when the top two elements of the stack are held in registers (e.g. the stack is implemented as two registers and then a stack). The basis set is:

DRIP: a b c ... -> a a c ... (MOV reg1 reg2) DIP: a b c ... -> b b c ... (MOV reg2 reg1) NUP: a b c ... -> a b b c ... (PUSH reg2 onto the (rest of the) stack) NIP: a b c ... -> a c ... (POP the (rest of the) stack into reg2) TUCK: a b c ... -> a b a c ... (PUSH reg1 onto the (rest of the) stack) TAKE: a b c ... -> c b ... (POP the (rest of the) stack into reg1)

hmm.. if we have an auxiliary field of 4 bits, how many stack operations can we support? If we want to pick the top 2 items out of the top 2 items, that's 2^2 = 4 ops (a b -> aa, ab, ba, bb) (but note that one of those is the identity, NOP). For 3 items, we have 3^3 = 27: abc -> aaa, aab, aac, aba, abb, abc, aca, acb, acc, bxx etc. So this doesn't nicely fit into 16.

if we wanted to include the 'stack quarks', we'd probably also want at least: PUSH, POP, SWAP, DUP, DROP. And maybe also ROT, OVER. So that's 13 already. You might also include "PUSH starting with reg2 (outside value -> reg2)" and "POP from reg2 to outside". And maybe UNDER. Is there some way to organize all these cleanly into 16 choices? Maybe try pairs of stack quarks and then eliminate duplicates? But that doesn't really help, may as well have different opcodes if that's what ur doing.

Anyways, that particular architecture locks us in to using "2" as the available stack lookahead, not 4 and certainly not 16 (so by the above rule bout generous-but-conceivably-necessary++, maybe we should use 4!).

hmm, how about: 0: reg1, 1: reg2, 2: rest-of-stack, 3: something else?.

Or how about: 0: read/write reg1, 1: read/write reg2, 2: push/pop reg1, 3: push/pop reg2. So that's 2 bits. The other two bits can specify which stack, out of 4 stacks.

And, instead of making the stack stuff an addressing mode, we could memory-map the first 16 memory locations.

So instead of having 16 symmetric GPRs, we effectively have 8 GPRs which are arranged spatially as the top-2 elements of 4 stacks!

DRIP, DIP, PUSH, POP, DUP (MOV reg1 -> reg2(push)), OVER (mov reg2 -> reg1(push)), SWAP (reg2(pop) -> reg1(push)), NUP (reg2 -> reg2(push)), NIP (reg2(pop) -> trash) are then primitive ops. TUCK, TAKE, ROT, (and maybe UNDER, what is UNDER?) are composite ops.

the benefit of having stacks (and/or memory->memory instructions) is that intent is preserved; eg if u push the result of g(x) onto the stack and then later pop it and feed that to f, you know that's just the expression f(g(x)); if you had put it in a register you'd have to do a liveness analysis to figure out that that's the last use of that value before it gets overwritten later.

a crucial point is that the program cannot assume that these special stacks are memory-mapped (and therefore that it can directly access items buried deep in the stack). Except maybe for one of them. This makes it easy to translate these programs to true 'stack machines' which really do not allow direct access to items buried in the stack.

the philosophy, then, is that, by following various constraints, a compiler could produce a Oot Assembly program suited immediately running on a variety of architectures. E.g. by preferring memory locs below 16, it could be suitable for a register machine, by preferring the stacks it could be suitable for a stack machine, etc; and furthermore, optimization of arbitrary Oot Assembly code for these sorts of things can take place in the form of a transformation from Oot Assembly to Oot Assembly. That is to say, Oot Assembly is able to express a variety of these paradigms.

could just put a stack 'under' every memory location, rather than making the first 16 memory locations special. But i suspect that's not the right way to go, because there probably won't be any hardware implementations that match it.

the PC, and the stack pointer for the mmaped stack, could be memory-mapped, or we could have a 'meta object' which is memory mapped, with the PC etc as fields within it

i guess there's no good reason to allow the PC to be written to directly as the result of eg ADD for relative JMP. That lets us have less opcodes, but the implementation is just as complex, in fact possibly more so on some architectures in which not all manipulations of the PC are supported.

why memory-map the tops of the stacks, and memory-map the distinction between MOV and PUSH? Because we want every instruction to be able to directly push or pop from any of the stacks. Why? Because this allows us to preserve intent; if you have temporary registers, you have to do (escape? lifetime?) analysis later if you want to connect inputs to outputs during optimization passes. Why not just add extra fields to each operand, so each operand has both a 16-bit space, and a 4-bit space, and a selector within the addressing mode that says which of these we are using? We could do that i guess; it seems like a gratuitious waste of at least 5 bits though, since at least one of the 16 bits or the 4 bits will always be empty. on that note, though, it doesn't seem like such a waste to just add 1 bit to the addressing mode to toggle whether the operand is interpreted as a memory address, or as a 'special address'. Now the stacks, the stack pointer for the 4th stack, etc are all 'special addresses', and the entire 64k of normal memory can be addressed cleanly, without having to subtract 16 or 32 or whatever first. Yeah, i like that better.

---

note: to allow 'emulation', could have some of the bits in the opcode field actually be an addressing mode for the operand. The reason i say to use bits within the opcode field instead of having another field for the opcode addressing mode is that this way, you can have more than one layer, e.g. if the top level opcode field says 'refer to memory location 433 to get the real opcode', then memory location 433 can itself say 'refer to memory location 824 to get the real opcode'.

This is https://en.wikipedia.org/wiki/Addressing_mode#Multi-level_memory_indirect (other keywords? nested, cascaded). William Stallings. Computer Organization and Architecture(?). Chapter 11 page 390 says "There does not appear to be any particular advantage to this approach, and its disadvantage is that three or more memory references could be required to fetch an operand". One can see that this adds overhead to even 'normal' indirect addressing fetches (because we have to check the result for additional indirection), and also adds one bit of overhead to each memory location; but these disadvantages could be small if we have a separate cascaded indirect addressing mode rather than overload the normal indirect mode with this; and, if cascading indirection is only checked for if the previous address itself indicated such. The advantage that i see is that this could allow the instruction to make it clear when it is using indirection (for eg emulation), which preserves more intent, compared to having an emulator manually load the parts of the instruction to be emulated into some register and then run the instruction.

We could extend this scheme, not just to the opcode field but to every field, at the cost of putting the addressing mode bits 'inside' the operand field (the low-order bits, so as not to constrain the word size). So which addressing modes do we need?

0: immediate 1: direct 2: indirect 3: special (e.g. stack ops, top of stacks, etc) 4: cascading indirect

yarg so that's at least 3 bits. We might just want to make the third bit a 'cascading' bit, to allow things like 'POP on stack 2, cascading'.

so then we have the unpleasant choice of either reducing our immediate word size from 16 to 13, or increasing our 'natural' platform word size requirement from 16 to 19. I think the former is preferable. Now we can only directly address 8k. (note: x86 supports 8-bit, 16-bit, or 32-bit relative offsets in jumps: http://x86.renejeschke.de/html/file_module_x86_id_147.html )

---

this makes the case where we need to load a constant in order to access something bigger than our fixed-length direct addressing field allows more common. let's think about that for a second:

---

note: when we want to use a constant longer than the fixed-length size of our operand fields, what do we do? Some options:

i guess the last of these is best. Self-modifying code is complex to think about and may not be efficiently supported by many platforms. Making every instruction variable-length complicates decoding. A constant table is more orderly but this removes locality of reference, hurting caching (ie you'd want a constant from the table to be in the cache even if the code using that constant is distant from the table, but otoh if you require every constant bigger than the fixed-length operand size to be in the table, then there will be a lot of constants that are only used once, so the size of the constant table will be really big, but since caching is continuous, lots of cache space will be wasted).

so, how shall this work? shall the embedded constants be loaded into a little dynamic 'constant table'? But then an optimizer would have to perform live analysis (what i was previously calling 'lifetime analysis'; but i just looked up the proper name) on this constant table. So shall we just load the constants into memory? But then, in the commmon case where the constant is only used once, again the optimizer ends up performing extra live analysis. So shall we just make this an arbitrary-length prefix for the next instruction? Yes, i think so.

So, in the end we do have a form of arbitrarily-long variable length instructions, but a very regular one that is easy to decode.

Whatever the normal instruction length is, the constant-setting prefix is a chain built out of building blocks of the same length. Each building block is in fact just a special kind of instruction. The chain is both-C-string-and-Pascal-string-style, ie the first memory of the chain describes its length, and there is also a terminator at the end (or, in this case, i guess we don't need a terminator, because we just have an ordinary instruction at the end).

Now that we have this, should we just make the basic instruction operand field size really small? Like, 4 or 8 bits instead of 16? This allows us to more easily embed on 8-bit processors. If we use 8 bits and then use 3 of those for addressing mode, we have 5 bits left for data. This would allow us 32 direct opcodes, which is tantalizingly close to the ideal, and also fits with the order of magnitude of the number of registers available in x86. Having a small size like this also means that little space is wasted when we have a sequence of instructions that don't need larger operands; i know i said i don't want to care about efficiency here, but if we've already made the choice to have arbitrary-length prefixes, then what we have here is just a choice of bit-width which only affects efficiency. So ideally we would benchmark this later and vary it to whatever turns out to be optimal, but we still have to make an initial guess (and realize that probably we won't actually bother to benchmark it until it is too hard to change it, so we may be stuck with this choice).

So what about trying 8 bits? We have 3 operands, after all, so the instruction size still gets bigger than 16 bits already (although now we are taking the addressing mode out of those 8 bits, so that's less than before). The only thing that annoys me about this choice is that the ASCII alphabet starts at 65 and goes thru 122, so we'd need 7 bits of data (not 5) just to put character constants in our code without prefixes.

so, in this manner, we might even just have a 32-bit instruction size, ie 8 bits of opcode operand and 3 8-bit operands. No rooms for flags or types or auxiliary arguments or anything else though. 'Instruction formats' would be indicated by reserved opcodes instead of bits. But maybe we would make prefixes for flags and types, too; just treating them as general 'annotations' of code; this makes that more extensible/customizable, i guess, which is what we wanted.

This is spartan, but appealingly simple, almost minimalistic (but not minimalistic in a code density sense, since we do have 3 operands instead of 0,1, or 2).

So how do the constant prefixes work? I guess they could specify which operand they are giving, and then give it. I guess they could have a normal opcode field, and then the 3 operand fields are contributed towards specifying the operand. But are the actual values in the operand field in the actual instruction disregarded? Maybe it's better to use those as the 5 lowest-order bits; this way, the 3 addressing mode bits are always in the same place (assuming that this encoding is 'big endian', eg the MSB (Most Significant Bits :) ) are in the prefix and the LSB is in the actual instruction at the end). The downside is that now we have constants that can be of odd bit-width; e.g. if we use one prefix instruction, which contributes 24 bits, then we have 24+5 = 29 bits. Oh well. I guess we already had that with the 5 data bits in the normal instructions.

note that the constant can only be used once, in the next actual instruction. It is not saved.

note that since the prefixes are ordinary instructions, the emulation machinery works well with them.

should we have a separate opcode for the prefix length (which itself is a variable-length prefix?) Or is this mixed in with the other prefix opcodes? Or maybe by default the prefix is of length 1, but if its longer, the special prefix-lengh opcode is needed? I like that third option.

so now we need a bunch of opcodes for prefixes. Such as:

PREFIX-LENGTH-HEADER CONSTANT-PREFIX FLAG-PREFIX TYPE-PREFIX CUSTOM-ANNOTATION-PREFIX TRANSACTION-PREFIX HINT-PREFIX (can be ignored by platforms that don't understand them; maybe all custom annotations are hints? or maybe not?)

or perhaps we need to say which field we are targetting in the CONSTANT-PREFIX opcode? So

OPCODE-CONST-PREFIX SRC1-CONST-PREFIX SRC2-CONST-PREFIX DEST-CONST-PREFIX

note that using opcodes above 32 always require the use of OPCODE-CONST-PREFIX.

It is an error to have a PREFIX-LENGTH-HEADER which is not followed by the indicated number of PREFIX instructions.

since one instruction may have multiple prefixes, should the PREFIX-LENGTH-HEADER indicate the distance to the actual instruction, or just the length of THIS prefix?

also, should we use suffixes instead of prefixes? if so, then instead of adding the 5 data bits in the actual instruction, we have a 'see prefix' sentinel (probably the highest value, e.g. 31). But then what about flags, transactions, arbitrary annotations? Maybe those must remain as prefixes.

so now we also need a SUFFIX-LENGTH-HEADER. And maybe its 3 operands say how many instructions of each type of suffix there is (again, with the highest value (is this 255 or 31?) a sentinel indicating a longer length; how would we encode that, exactly?). Allowing the implementation to load them in parallel, and also to know where the next instruction begins.

--

also, for annotation, we need:

ie

ANNOTATE-POINT(property = label) ANNOTATE-REGION-BEGIN(property = label) ANNOTATE-REGION-END(property = label) ANNOTATE-REGION-BEGINEND(property = label)

eg

ANNOTATE-POINT(GOTOLABEL = 'A') a (some random instruction) ANNOTATE-REGION-BEGIN(LEXICAL_TRANSACTION = 'A') b ANNOTATE-REGION-BEGINEND(SOURCE_LINE_NO = 33) c ANNOTATE-REGION-BEGIN(LEXICAL_TRANSACTION = 'B') d ANNOTATE-REGION-END(LEXICAL_TRANSACTION = 'A') e ANNOTATE-REGION-BEGINEND(SOURCE_LINE_NO = 34) f ANNOTATE-REGION-END(LEXICAL_TRANSACTION = 'B') g

in this example, when you GOTO label A, you start out at instruction 'a'. LEXICAL_TRANSACTION 'A' goes from instructions b thru d. LEXICAL_TRANSACTION 'B' goes from instructions d thru f (overlapping LEXICAL_TRANSACTION 'A'). instructions c,d,e derive from source code line #33, and instruction f and g derive source code line #34.

--

note: since we now have cascading indirect memory addressing, we have to worry about the possibility of indirection loops. We could have a fixed cutoff in the number of cascades (i dont like that), or we could have a stack tracking the cascade, and check it for repeats (but to make this efficient, we have to limit its size anyhow, so isnt this the same? no, b/c at least with that, we could spill to memory).

also, we should still consider just having an EMU opcode (or, at least, having all operands or none of them be cascading) instead of permitting cascading addressing for each operand separately. But this seems somewhat complex to encode..

---

if we did want the possibility to reuse these embedded constants (suffixes), we could have a LOADK opcode that addressed a suffix using PC-relative addressing.

so mb a better way of thinking about it is, there are these embedded constants, and as an optimization for use-once constants, instructions can use an immediately following embedded constant without a LOADK.

so if that's the way it is, then the SUFFIX-LENGTH-HEADER should just be for a single constant, not specifying 4 constants at once and what they are for. And we don't need to distinguish OPCODE-CONST-PREFIX, SRC1-CONST-PREFIX, SRC2-CONST-PREFIX, DEST-CONST-PREFIX, SUFFIX-LENGTH-HEADER; we just have a CONST-HEADER. If multiple suffix CONSTs are used for a single instruction, then they are expected to be found in the same order as the fields in the instruction (which is OPCODE, SRC1, SRC2, DEST). It is an error if there are not enough of them. This is okay because that can be statically checked.

(how does emulation interact with this? if you have an indirect opcode which resolves to CONST-HEADER? if you need to put a constant in emulated code? how do you trap within an emulation? do you have virtual memory within an emulation? i'm guessing: yes, you optionally trap, yes, you optionally have VM. Btw isn't VM just like __get, __set?!? it seems to me that the code being emulated will need to have CONSTs in it, so yes, an indirect opcode can resolve to CONST-HEADER. The other way around, i'm not so sure; an immediate (hardcoded) opcode of CONST-HEADER whose operands are indirect? I think the 'operands' in CONSTs have no addressing modes, e.g. they are always immediate)

---

i guess constants with a value greater than 30 but less than 2^24 will be quite common, so maybe have opcodes for each of:

(allow CONST-HEADER to specify length 0? Sure, why not)

(why not just have CONST-BODY double as CONST-BODY-SINGLETON? That's fine if you are always being told where a constant begins. But if you want to be able to look at an arbitrary place in the middle of code and quickly figure out where the previous and next instructions are, with CONST-BODY-SINGLETON you immediately know that the prev 32-bit word is an instruction and the next word is an instruction, but without it if you land on a CONST-BODY you just have to go backwards until you find either another instruction or a CONST-HEADER, and you just have to go forwards until you find another instruction. Come to think of it, that's not much worse. Still, i feel there's something to be said for being able to look at the places where CONST-HEADER and CONST-BODY-SINGLETON appear and then know where the constants are and where the instructions are and how long the constants are, as opposed to having to consider CONST-BODYs in context and count them. By this reasoning, we do need PREFIX-LENGTH-HEADERs also.)

---

i guess the design goal of "preserving intent" implies "don't map a specific operation S to a special case of a general operation G if, by looking at the result in the form of G, you cannot discover that this is actually S without doing non-local analysis". A special case of this criterion is: "If something is being loaded and then only used once, don't just load it into a register or memory location, and then use it, because then liveness analysis, a non-local form of analysis, is needed to infer that it is only used once". (this is because the scope of an operation affects its 'type', and lexical scoping (e.g. using something once) is treated as a general case of dynamic scoping; although that's a screwy way of thinking about it i guess)

---

" Fundamentally, x86 is quite similar to RISC architectures in a number of dimensions. ALU operations are largely the same; there are only so many ways to do addition, subtraction and multiplication. However, the front-end is quite different and one of the most challenging aspects of modern x86 CPUs. The instruction caches are kept coherent with data caches, and the variable length instructions make decoding quite complex. x86 instructions range in size from 1-15 bytes, with length-changing prefixes, inconsistent operand positions and complex microcoded instructions. Since the P6, these instructions have been transformed into more tractable fixed length uops that can be tracked by an out-of-order core. "

i don't quite understand what it means to keep icache and dcache coherent with each other, and i'm not sure i understand enough about decoding complexity to know if the scheme i described for large constants makes it worse.

---

note: perhaps, as with the JVM, sizes/positions in the stack should be invariant when you are at the same point in Oot Assembly in the same depth in the call stack? This helps with the sorts of analyses that i detailed above (e.g. if you do g(x), push the result, then later pop it and feed it to f(), that's f(g(x)); we want the optimizer to be able to easily statically tell, given an item on the stack, who produced it and who will consume it).

--

maybe, in addition to emulation, do 'subexpressions' via an addressing mode?

e.g. how to represent A * (B + x)?

instead of:

PUSH x CALL L MUL (A, POP)

L: x = POP y = ADD(B,x) PUSH y RETURN

want something like:

MUL (A, SUBEXPR L(x))

L: x = POP y = ADD(B,x) PUSH y RETURN

(or maybe assume that functions return one result, and use 'RETURN y' instead of 'PUSH y')

this may be too much, because what if L takes more than one input? Also, you'd need an extra operand to handle both specifying both L and x. So want:

PUSH X MUL (A, SUBEXPR L)

L: x = POP y = ADD(B,x) PUSH y RETURN

not sure if that is really of any use, though.. i guess really what i want is:

MUL (A, SUBEXPR SE1)

SE1: PUSH X CALL L END-SUBEXPR (or just RETURN?)

L: x = POP y = ADD(B,x) PUSH y RETURN

so do we add an addressing mode for SUBEXPR? Or somehow combine it with the SPECIAL addressing mode? Or what?

there appears to be lots of room in SPECIAL; even if we only have 5 operand bits in one instruction, we're only using about 20 of those so far.

another way to go is to literally put the code in there, like an S-EXPR:

z = MUL (A, (SUBEXPR: PUSH X CALL L ) )

L: x = POP y = ADD(B,x) PUSH y RETURN

which would actually be encoded:

MUL (A, BEGIN-SUBEXPR-SENTINEL) PUSH X CALL L END-SUBEXPR-SENTINEL

that's less efficient to execute, but again, it preserves intent even better (we're getting closer to a straight serialization of the AST... which might be what we want).

---

also need a PEEK opcode to peek deeply into a stack (note that this may be inefficient on some machines)

--

we saw with Nock how crystalline it can be to have binary trees instead trees with arbitrarily many children at each node.

maybe, since it seems that Oot Assembly instructions have 4 parts, here we want 4-ary (quaternary) trees.

e.g.

z = MUL A * / \ \ \ ret = ADD B x
\ \

seems like the 'ret = ' is unnecessary.. but what if we had multiple return args?

--

i guess multiple return args is no longer 'expression-like'. So we have trinary trees:

z = * /

\
    MUL A   *
          /  \  \ 
        ADD  B   x

i guess prefixes would be annotations on these trees?

note: we could also think of the opcode as the annotation on the link

note: by that logic, we might demand quanternary trees, so that we can annotate both the link and the node, or so that we could have an opcode and also a position label.

" Named Graphs and quad stores - DBTune blog blog.dbtune.org/post/2010/12/21/Named-Graphs-and-implementations Dec 21, 2010 - Now, most triple stores implement Named Graphs by, in fact, storing quads. They store tuples of the form (subject, predicate, object, graph).

Virtuoso Open-Source Wiki : RDF Triple Store FAQ virtuoso.openlinksw.com › Dashboard › Main Virtuoso Universal Server In actuality, many Triple Stores are in fact Quad Stores, due to the need to maintain RDF Data provenance within the data management system. " -- https://www.google.com/search?client=ubuntu&hs=Fcg&channel=fs&q=triple+store+quad+store&spell=1&sa=X&ei=m6YHVI7IJM3MigKl8IGwBQ&ved=0CBwQBSgA&biw=900&bih=1286

note: the 4th element in the quad seems to usually be used to indicate 'context', e.g. a namespace, rather than something more like this. Also, note that the way we are using this, the 4th element isnt a unique id either, which is what some people propose adding to quadstores to make quint stores. i searched for RDF tuplestores that use quints, and found Allergograph, where the 4th element and 5th are context (graph name / provenance) and unique id (label) ( http://www.palladiumconsulting.com/tag/graph-databases/ ).

seems to me that 'context' and 'id' can be collapsed, and we just need 4 things. but maybe we need 5 things because we may want: opcode (relation), src1 (object), src2 (no equiv in RDF? i guess you'd make src1 composite and pack a pair into src1), dest (subject), annotation (4th and 5th elements in RDF-land)

a assertional non-computation example might be: "Bob is located to the left of the fire hydrant". Dest is 'Bob', src1 and src2 are 'the fire hydrant' and 'to the left of', relation/opcode is 'isLocated', and we attach a fifth element to allow us to identify/annotate this sentence. But you could clearly collapse src1 and src2 into one object, 'to the left of the fire hydrant'. Is there a better example?

see also '5-polarity' way above

i kinda like having both src1, src2. It provides an interesting bit of symmetry/asymmetry to a 5-polarity model.

i guess when you really need src2 in natural language, you are getting into things that are clumsy to think about in English. Such as:

"Bob's score in Rock-paper-scissors depended upon whether his move dominated his opponent's move."

compiles into two quints:

dest: bob's score relation: depended on src1: quint #2 (no src2) 5: the id of this quint is #1

dest: ('return'; the thing that bob's score depends upon) src1: his move src2: his opponent's move relation/opcode: whether src1 dominated src2 5: the id of this quint is #2

Interestingly, each of these has one empty field, but they are not the same. 'dest' in quint #2 is just a reverse link to 'src1' in quint #1. There is no src2 in quint #1.

--

" "Computers from the second generation keep their stacks separate from main memory and usually on-chip. This has the advantage of causing no memory traffic, but typically limits access to the topmost two to four elements since the stack is no longer addressable. This limitation requires that the topmost elements be permuted to bring a value to the top of the stack (Section 7.3.3). "

so we might want to provide direct addressing to the top 4 elements, not just the top 2? That would take up all 32 special addresses, though

i'm inclined not to do that. But it would have the advantage of guaranteeing that even if all 4 fields of an opcode used indirect addressing via a POP, those 4 values would already be in registers (but i thought the result field could only PUSH, not indirect via POP?)

but if we say that the result field cannot POP, only PUSH, and we neglict opcode indirection (let it take longer), then we only need 2. But if we have trinary trees maybe it's best to have 3 values available? Or 4?

--

if all 4 fields of an instruction manipulate the stack, there's an implicit ordering. It is:

OPCODE, SRC1, SRC2, DEST

---

now that i feel more comfortable with the simple and inefficient yet extensible instruction encoding, it doesn't matter as much what the bit-width of the operands are, because they are extensible, and because anything extra no longer needs to fit in the instruction, but is in the prefixes.

still, i want to remark that i am having second thoughts about the 8-bit initial proposal for operand bit-width, as the 30 items may not be enough for memory-memory addressing much of the time, also it doesnt jive with the 1,2,4,16 thing. So let's propose 16-bit operands (64-bit instructions) initially. After taking out the 2 addressing bits and the one cascading bit, we have 13 bits (8k items addressed).

compare to Lua: Lua has 32-bit instructions with fields of about 8 bits (some are 9) for operands (and 6 bit opcodes). The single-operand format (for JMP etc) has a signed 18-bits, so "jumps and control structures cannot exceed a jump distance of about 131071". "By default, Lua has a maximum stack frame size of 250. This is encoded as MAXSTACK in llimits.h. The maximum stack frame size in turn limits the maximum number of locals per function, which is set at 200, encoded as LUAI_MAXVARS in luaconf.h. Other limits found in the same file include the maximum number of upvalues per function (60), encoded as LUAI_MAXUPVALUES, call depths, the minimum C stack size, etc."

so Lua chose 8. But then, Lua didn't burn 3 bits per operand on addressing mode. Remember, we're doing that to preserve intent. Also, Lua doesn't have our complex optional suffix for large constants (instead, they have (per-function?) constant tables).

If we really want to preserve intent, shouldn't we have a displacement addr mode too, to reduce the frequency of:

(LEA) addr = LEA base, offset (MOV) addr = value

more on x86 LEA:

example: LEA ESI, [EBX + 8*EAX + 4]

"Load Effective Address calculates its src operand in the same way as the mov instruction does, but rather than loading the contents of that address into the dest operand, it loads the address itself. lea can be used not only for calculating addresses, but also general-purpose unsigned integer arithmetic (with the caveat and possible advantage that FLAGS are unmodified). This can be quite powerful, since the src operand can take up to 4 parameters: base register, index register, scalar multiplier and displacement, e.g. [eax + edx*4 -4] (Intel syntax) or -4(%eax, %edx, 4) (GAS syntax). The scalar multiplier is limited to constant values 1, 2, 4, or 8 for byte, word, double word or quad word offsets respectively. This by itself allows for multiplication of a general register by constant values 2, 3, 4, 5, 8 and 9, as shown below (using NASM syntax):

lea ebx, [ebx*2] ; Multiply ebx by 2 lea ebx, [ebx*8+ebx] ; Multiply ebx by 9, which totals ebx*18 "

[1], [2]

ARM has no LEA (!): " Because the ARM has no `load effective address' instruction the assembler provides ADR, which will always assemble to produce ADD, SUB, MOV or MVN instructions to generate the address. The syntax is: ... The expression can be register-relative, program-relative or numeric. ADR must assemble to one instruction, whereas ADRL allows a wider range of effective addresses to be assembled in two instructions" -- http://hackipedia.org/Platform/3D0/html,%203DO%20SDK%20Documentation/Type%20A/tktfldr/augfldr/7augb.html

so if we provided an LEA as good as x86's, we'd either have special registers, or up to four operands. (note however that since our 'memory locations' are objects, not memory locations, we probably don't need eg the multiplication by a scalar constant; and also, it's unclear if our field references will be by integer (ie translate field names to keywords) or by some absurdly slow hashtable lookup based on field name (for full dynamicity), or some combo)

but if we did this, we still couldn't represent chained __gets, eg "bob.hisarray[2].c"

if the goal is just to make it easy to see that a particular bit of computation is being done only for the purpose of getting an address that will then be used exactly once in an upcoming instruction, a better way of doing this is to make it a subexpression.

which suggests that we really want an AST, not a linear sequence of bytecode.

on the question of bytecode vs. AST:

here's a good stackoverflow on compiling vs. executing the AST: http://stackoverflow.com/questions/20674854/advantages-of-compiling-a-language-vs-executing-the-ast-as-soon-as-it-is-constru

" Interpreting an AST is normally much slower than running machine code that does the same thing. A factor of 20 is typical. ... Compiling to bytecode is a middle ground ... few modern language systems use pure AST interpreters. (A notable exception is command shells, but most of them were developed long ago, and speed benchmarks are not common.) They're just too slow. My context is lisp interpreters. I've implemented a couple. Here for example is one set of Scheme benchmarks. The columns corresponding to AST interpreters are pretty easy to pick out. ... Another rough benchmark: Perl uses a heavily optimized AST interpreter. To add 10 million floats in a tight loop on my machine requires about 7 seconds. Compiled C (gcc -O1) takes about 1/20th of a second. ... There are many shades between "pure" AST interpeters and compiled machine code. Depending on the language, it may be possible to compile symbol offsets into the AST. This is sometimes called "fast links". The technique typically speeds things up by a factor or 2 or more. Then there are "compile-to-bytecode and go" systems like Python, PHP, Perl, Ruby 1.9+. Their bytecode is effectively threaded code (opcodes can cause very complicated things to occur), so they are closer to ASTs than machine code. Then there are the JIT bytecode interpreters I mentioned above. the factor of 20 pure AST interpreter is one bookend and machine code is the other. In the middle there are many variants, each with advantages and disadvantages. "

" One clear dividing line between interpreter and compiler is precomputed addresses or frame offsets for symbols. In a "pure" interpreter there are none. So adding 4 numbers requires 4 lookups in the runtime environment, normally a hash table - at least 100 instructions. In good compiled code, adding 4 integers on an x86 needs 2 instructions and one more to store the result. " -- http://stackoverflow.com/questions/20674854/advantages-of-compiling-a-language-vs-executing-the-ast-as-soon-as-it-is-constru

so if we really want subexpressions, we may want an AST, which means that perhaps we are coming to the end of an 'assembly'-like Oot Assembly (otoh, above we saw that we might want to embed mini-ASTs, for expressions, inside imperative linear sequences of instructions; i think that's what Python's AST does). Some features of an 'assembly' seem to be:

Python's AST (example, modified, from http://eli.thegreenplace.net/2009/11/28/python-internals-working-with-python-asts/ ):

In [1]: import ast

In [2]: node = ast.parse( ...: favs = ['berry', 'apple'] ...: name = 'peter' ...: ...: for item in favs: ...: print '%s likes %s' % (name, item) ...: ) ...:

In [3]: ast.dump(node) Out[3]: "Module(body=[Assign(targets=[Name(id='favs', ctx=Store())], value=List(elts=[Str(s='berry'), Str(s='apple')], ctx=Load())), Assign(targets=[Name(id='name', ctx=Store())], value=Str(s='peter')), For(target=Name(id='item', ctx=Store()), iter=Name(id='favs', ctx=Load()), body=[Print(dest=None, values=[BinOp?(left=Str(s='%s likes %s'), op=Mod(), right=Tuple(elts=[Name(id='name', ctx=Load()), Name(id='item', ctx=Load())], ctx=Load()))], nl=True)], orelse=[])])"

this isn't to say that we can't also have bytecode, for efficiency and ease of porting. But that's an implementation decision. Here, recall, we have four motivations (grep for 'four motivations' above to get the long version):

1) learn the basics of the (imperative) assembly computation paradigm 2) learn the 'flavor' of assembly 3) work 'from the bottom up' to create Oot Core 4) a portable target language, which could be efficiently run in a VM/interpreter if necessary 4a) minimal primitives 4b) minimal loss of higher-level intent 4c) cross-platform preprocessing and optimization should have been done already

how minimalistic is a typical scripting language AST? Here's Python's:

http://svn.python.org/projects/python/tags/r32b1/Parser/Python.asdl

also, regarding the above on hash table lookup to resolve names in interpreters: if someone does "bob.hisarray[2].c" we certainly do want to be at least theoretically able to produce a compiled version that does not look up hash tables at runtime, unless one of bob, bob.hisarray, or bob.hisarray[2].c is known to be possibly dynamically modified at runtime; we also want to statically check for type errors in this lookup, e.g. to statically know whether bob.hisarray exists and if its elements have a '.c' subelement, again unless these things are known to be special-case dynamics.

also, if we go to an AST, i'm not sure that there's any reason to have a fixed max number of operands per primitive (children of each AST node)

---

https://en.wikipedia.org/wiki/Lisp_machine#Technical_overview

" The processor did not run Lisp directly, but was a stack machine with instructions optimized for compiled Lisp. The early Lisp Machines used microcode to provide the instruction set. For several operations type checking and dispatching was done in hardware at runtime. There was for example only a single addition operation that could be used with various numeric types (integer, float, rational and complex numbers). The result was a very compact compiled representation of Lisp code.

The following example uses a function that counts the number of elements of a list for which a predicate returns 'true'.

(defun example-count (predicate list) (let ((count 0)) (dolist (i list count) (when (funcall predicate i) (incf count)))))

The disassembled machine code for above function (for the Ivory microprocessor from Symbolics):

Command: (disassemble (compile #'example-count))

  0  ENTRY: 2 REQUIRED, 0 OPTIONAL      ;Creating PREDICATE and LIST
  2  PUSH 0                             ;Creating COUNT
  3  PUSH FP|3                          ;LIST 
  4  PUSH NIL                           ;Creating I
  5  BRANCH 15
  6  SET-TO-CDR-PUSH-CAR FP|5
  7  SET-SP-TO-ADDRESS-SAVE-TOS SP|-1
 10  START-CALL FP|2                    ;PREDICATE 
 11  PUSH FP|6                          ;I 
 12  FINISH-CALL-1-VALUE
 13  BRANCH-FALSE 15
 14  INCREMENT FP|4                     ;COUNT 
 15  ENDP FP|5
 16  BRANCH-FALSE 6
 17  SET-SP-TO-ADDRESS SP|-2
 20  RETURN-SINGLE-STACK

"

"

Optimizations typical of Lisp machines include:

" -- http://www.faqs.org/faqs/lisp-faq/part2/section-20.html

http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/development-environment/index.html

" In the early 1970’s several groups of researchers utilized two novel hardware technologies to improve the efficiency of Lisp: tagged architectures and micro- programming. Tagged architectures store type information with each data word, tracking dynamically changing types (a prominent feature of the Lisp language) with essentially no overhead. Micro-program- ming allows a simpler compiler by imple- menting complex instructions that more closely match the high-level semantics of the Lisp language. Experimenting with these techniques eventually led to commer- cial introductions of “Lisp Machines

Other novel features of these machines include:

The “stack cache”— Execution of Lisp is stack-oriented. By caching the top elements of the stack, efficiency approaching that of register-based execution is achieved without complex register allocation algorithms in the compiler. The stack cache also supports fast function call and return, as if the function arguments and values are passed in registers. • Garbage-collection support— Lisp’s storage management system is often implemented as a “garbage collector”, which automatically reclaims unused objects. The Lisp Machine virtual memory hardware, in concert with the type tags, tracks object-reference loads and stores so the software can determine quickly which objects are in use and efficiently reclaim those that are not. • Instruction emulation— When the Lisp Machine hardware encounters an exceptional situation (for example, an integer arithmetic operation that exceeds the hardware imposed implementation limit or an operation on a software- defined type) the hardware traps out to a software “emulator” subroutine. This subroutine implements the full semantics of the operation, as specified by Lisp, as though it is being handled by the machine instruction.

Early Lisp Machines implemented their micro-programmed architectures with a writable control store, which meant the instruction set, and to a certain extent other architectural features of the machine, could be changed by simply writing, compiling, and loading new micro-code. This flexi- bility led to further experimentation and evolution of the hardware support for Lisp. ...

Today, most RISC hardware architectures support either large register sets or register windows, which bear great similarity to a stack cache. Close examination of some of the newest RISC architectures will reveal support for tagged data, to efficiently implement generic arithmetic operations. Architecture research papers continue to evaluate the merits of read and write “barriers”, pioneered by Lisp Machine garbage collectors to track object references and speed automatic storage reclamation, and “fast traps” to allow expeditious han- dling of exceptional conditions in a manner similar to the Lisp Machine instruction emulation. " -- http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/literature/LispM.pdf

" Why the Need for Machines Optimized for LISP? In Tom Knight’s thesis, he points out three primary reasons that motivated his efforts, starting in 1974, to develop the original Lisp machines at the MIT AI Lab. He asserts in his introduction that the main difficulties are: • Inadequate virtual address space for large user programs. • Inadequate computing power for development if intelligent programming tools. • Inefficient information coding of compiled instructions.

...

Design of the LISP Machine The primary design criterion for the Lisp machine was to run Lisp quickly enough to allow efficient development of AI programs. Russell Noftsker asserts that a developer would be 10 times as productive on the Symbolics platform as he or she would be on an alternative platform. An extended list of the design criteria, according to the Symbolics Technical Overview, includes each of the following tenets: • Fast Lisp execution • Dedicated personal computer and console • Tagged architecture (run-time data-type checking and generic instructions) • Virtual memory • Integrated local area network • Interactive, high-resolution, bit-mapped graphics Conspicuously absent from this list is time-sharing, which was considered at the time to be a necessity for any successful computing environment. Tom Knight and Richard Greenblatt did not believe in making a time-sharing system (although the system did support an interesting alternative scheme of multiplexing between users that will be discussed later).

...

Core Instruction Set and Microcoded Extensions From a hardware perspective, the instructions of the Lisp processor architecture are chosen carefully to make certain functionalities easy for the compiler to implement. The core instruction set of the Lisp processor is augmented by a microcoding capability that allows macros for commonly used Lisp functions to run extraordinarily efficiently by downloading macros for the commonly used Lisp functions into a cache-like structure on the processor. Together, the core instruction set and the microcoded functions effectively implement Lisp in hardware. One example of a functionality that uses these codes effectively is the ephemeral-object garbage collector written by David A. Moon at Symbolics (well after the original Lisp machines were made). While most garbage collectors cause noticeable pauses in program execution, Moon’s garbage collector was so cleverly written (according to Russell Noftsker) that it would actually cause the system to run faster with garbage collection than it would without garbage collection. This garbage collection scheme was facilitated by the design of the instruction set, which included instructions that would only be used by a garbage collector. (Interestingly, Symbolics at one time was working with Intel to build a development platform based on the 386, which led to the inclusion of an extra instruction in the final 386 architecture to facilitate garbage collection). Data Tags The Lisp machines dedicated certain bits of the architecture as data-type bits. In the original 32-bit architecture, 4 bits of every data chunk would indicate the type of the data in the other 28 bits. This is ideal for Lisp because Lisp does not have strong typecasting like C, Fortran, or Cobol do. In a list representation, a pointer to data can point to any type of data (either numerical, ascii, or perhaps another list). This eliminates the need for data type declarations in programs and also catches bugs at runtime, which dramatically improves system reliability. Data tags represent a significant advance beyond the state of the art. Virtual Memory Virtual memory addressing allows the machine to run larger programs while abstracting memory issues from the programmer. This feature is essential to writing large programs such as expert systems. Error Correction Codes With error correction codes (ECC), memory or disk corruption could be detected and fixed at runtime. This is essential for building a large, reliable program in an environment where physical matters (such as floods or flying bullets) could affect the hardware. In addition, memory and disks were inherently unreliable at that time and therefore ECC was highly desirable even at the cost of 4 or 8 extra bits for every 32, 36, or 40 bit word (depending on the generation of Lisp machine).

"

-- http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/Symbolics.pdf

---

separate opcodes for CPY (copy, by which we mean copy for unboxed small values, and copy-on-write for references/structs) and MOV (move, and guarantee that the old version no longer is accessible; eg useful for unique pointers)

---

so here's the design rationale for the current instruction encoding proposal:

goals:

1) learn the basics of the (imperative) assembly computation paradigm 2) learn the 'flavor' of assembly 3) work 'from the bottom up' to create Oot Core 4) a portable target language, which could be efficiently run in a VM/interpreter if necessary 4a) minimal primitives 4b) minimal loss of higher-level intent 4c) cross-platform preprocessing and optimization should have been done already

therefore:

numberology aesthetic decisions:

practically:

aesthetically:

therefore, numerologically:

Bit field width guidelines:

therefore:

in sum:

64-bit 'instruction words'

'instruction words' may be:

which type of instruction word this one is can be determined by the first 16 bits (maybe 1 bits says whether 'actual instruction', or something else? eg 2^15 opcodes, and 2^15 potential other things? maybe 1 bit of the other things says constant, or something else, allowing 14 bits of this field to be used as the payload in constants? or mb other way around, allowing 15 bits for the payload in constants?)

'actual instructions' are divided into 4 16-bit fields: opcode, src1, src2, dest. Note that the opcode field is also the same 16-bits that is used to discriminate instruction word types; so some opcodes are reserved for non-actual-instructions.

3 bits of every actual instruction field are reserved for (INCLUDING opcode field): 2 addressing mode bits, 1 emulation mode bit. This leaves 13 bits for data (values up to 2^13 = 8192, but actually 8291, see below).

if the maximum value (excluding the addressing mode bits) appears in any immediate-addressed field of an 'actual instruction', then this means that you must look after that instruction to find a constant, treated as an instruction suffix (although the constant may also be independently referred to by the 'loadk' instruction). If multiple operands of the same instruction indicate suffix constants, the suffix constants are placed in this order: opcode, src1, src2, dest.

the addressing modes are: immediate, direct, indirect, special.

special addressing mode includes operations on 4 stacks, memory-mapped as follows: 4 stacks; for each stack, can address the first-item-of-stack (top-of-stack) or the second-item-of-stack; for each stack and each item, can read directly; write directly; pop; push. So, 16 special locations corresponding to these 4*2*4 operations. (what is the rest of special addressing mode mapped to?)

annotations instruction words include point annotations, begin-region annotations, end-region annotations, beginend-region-annotations

constants are either single-instruction-word constants (64-16 = 48 data bits long), or constant-header constants (whose payload is 48 bits of constant length info; if max value then another constant-header is after the end, to indicate the length of the next segment), or constant-terminator constants (no payload? or the final value? combine this with single-instruction-word constants?) or constant-payload constants (48 bits of constant info in the middle of the constant? or increase this by allocatiing many opcodes for constant-payloads?) note: why not unaltered constants? that seems preferable, actually.. but mb not, as that would destroy the ability to plop down in some random place in the code and start deciphering it.. we dont really want ppl putting large binary blobs in the middle of their code.. this is just for theoretical purity, to allow for the theoretical possibility immediate values of unbounded magnitude..we could adulterate this somewhat by allowing baked-in mechanisms for constants of up to some multiple of instruction word size, but not larger; but this tiny savings in space probably isn't worth the complexity because practically speaking, how often do you need to encode a number larger than 32 bits into an operand in immediate addressing? so, no unaltered constants

note: another approach would be to really memory-map the stack guys; this would allow one-instruction popping something from a stack to use in indirect addressing; the downside is that it would require us to have a 'low memory' area and do arithmetic for other memory accesses. We could either reserve a small number of addresses (eg 32), or we could reserve half the memory for specials (leaving us with 4k immediately addressable items instead of 8k). We would then also have a left over addressing mode to use later (maybe make it custom? or reserved? or displacement?).

yeah, i'm leaning towards actually memory mapping the specials, using up another bit to switch b/t actual memory and special addresses. So we have only 4k of immediately addressable memory.

so: each 16-bit field is composed of 4 bits of addressing, and 12 bits of data.

(is this crazy? no; eg ARM has 12 bit immediate operands, and then use shifts to bring them up to 32 bits: http://alisdair.mcdiarmid.org/2014/01/12/arm-immediate-value-encoding.html ; it does this in one instruction, but we can do the same thing with multiple instructions, which isn't so bad; although realistically since we have suffix constants we'd just use those instead)

hmm, maybe we could use most of the 'special' memory area to put a jump table for the 'custom' opcodes. If we have 12 bits (after the 2 addressing mode bits, the 1 emulation mode bit, the 1 'special or actual memory' bit), that's about the same as the situation with opcodes (2 addressing mode bits, 1 emulation mode bit, 1 'opcode or not' bit). So we have 4k special memory locations and 4k opcodes. So we could say the top 2k opcodes are custom, and reserve the bottom 2k opcodes. And we could use the bottom 2k special memory locations for whatever (16 locs for the stack, a bunch left), and then say the top 2k special memory locations is a jump table for the 2k custom opcodes. Is that a crazy number? By my count, (see [3]), there's about 678 x86 instructions, so 2k reserved instructions is comfortable.

this would seem to make sense as the stack addresses are really more 'addressy' than they are a special kind of addressing mode. and as i said earlier, if we do this, we even have room for a custom addressing mode.

i like that.


todo: should probably check out http://docs.tendra.org/reference/xhtml/guide/ , also the various target languages in [4]

---

might be useful to have etherium-like 'gas', 'msize', 'call' (with gas parameter) opcodes

---

instead of having self-sufficient type-specific instructions eg ADD_UINT8, could make everything dynamically typed and polymorphic by default, and then have a type annotation prefix when you want to operate on non-dynamic (statically typed, unboxed) values. Eg:

c = ADD(a,b) -- a and b are boxed values with runtime dynamic typing c :: uint8 = ADD(a :: uint8 ,b :: uint8) -- a, b, and c are unboxed uint8 values; the operation is really ADD_UINT8

issues with this:

other ways to go would be:

i am torn here. On the one hand, i don't want to specify eg which types of number representations ADD must work on in the base system; eg i dont want to say 'every Oot Assembly implementation must have an ADD_UINT8'; one implementation might implement only ADD_UINT8, and another might implement only an arbitrary precision ADD, and in each case Oot libraries will emulate the other if the program calls for it; and further, i dont want arbitrary assignment of opcodes to UINT8, SINT256, Stwos_complementINT64, SbiasedINT?128, etc, i'd rather have a regular scheme, in which case why not just have a type number. On the other hand, it seems to me that many language implementations (eg GHC, i think) have an intermediate target language in which all polymorphic functions that can be specialized already are (eg no need to do a complex lookup from opcode*type to actual function number), and it seems that something called Oot ASSEMBLY should be able to serve that function.

maybe the answer is:

---

potential format for type annotations:

---

might want to lookup 'typed assembly language' again

https://www.google.com/search?q=typed+assembly+language+instruction+encoding https://www.google.com/search?q=typed+assembly+language

http://www.cis.upenn.edu/~stevez/papers/MCGG99.pdf (TALx86; this looks like a good one; although we dont want to burden Oot Assembly with a typing system that expresses the schema of complex data types (i'm not sure if TALx86 does that but it might))

haven't skimmed these yet:

http://www.cs.princeton.edu/~appel/papers/sftal.pdf http://adam.chlipala.net/berkeley/classes/cs263.pdf http://www.cs.cmu.edu/~crary/papers/2003/talt/talt.pdf http://www.cs.cornell.edu/talc/papers/tal-popl.pdf

---

should we specify encoding endianness? that is, do we specify in terms of actual format of bits on the wire, or in terms of the operations to be performed in C? We could follow Lua, which according to http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf specifies endianness in binary storage formats but not after information is read into memory ("In binary chunks, endianness is significant, but while in memory, an instruction can be portably decoded or encoded in C using the usual integer shift and mask operations").

---

COW copy-on-write opcode

---

now that i've thought through this, though, it may be better to use some existing 'assembly language' intermediate language, such as LLVM's ( http://www.aosabook.org/en/llvm.html ), instead of rolling my own. Not sure though.. it still seems attractive to choose my own set of primitives, esp. since i will want interoperability with so many other languages

---

could always forget quints and go to ntuples + 3 (opcode, output, annotation, ntuples of arguments) and, in fact, instead of ntuples, we have a dict, to allow for named and default arguments: (opcode, output, annotation, argument dictionary)