Bayle Shanks's website: proj-oot-ootAssemblyNotes5

how does our role system map to natural language parts of speech and phrasal categories?

verb (phrases) are role 0 (operations)
noun (phrases) are roles 1 and 2 (role 1 is object or indirect object, role 2 is subject)
pronouns are register references within noun phrases; a bare subrole 1 or subrole 2 payload would be a noun
we are still lacking conjuctions, of both types
determiners would be treated like adjectives
interjections would be meta-syntactic 'shortcuts', eg packed representations, which we are considering for role 6
we don't have particles as such; we do have adverbs. Although perhaps our role 7 length headers could be considered particles, in that they have a purely syntactic/meta/grammatical purpose.
adpositions are modifier phrases (role 3)
adverbs are modifiers (role 3)
adjectives are modifiers within noun phrases (subrole 3 within roles 1 and 2)

so what does this suggest we add?

an additional role to separate modality modifiers (things that change https://en.wikipedia.org/wiki/Grammatical_category such as mood modifiers that indicate eg hypothetical mood (another example: a lazy mode for an operation) from 'filtering constraint' modifiers such as 'yellow' in "Move the yellow ball over there".
a role for conjunctions, eg "increment (every_element_of-and: [a,b,c])" and "if (any-or: [a,b,c]) then"
indirect objects

note: we also have parens and quoting in role 7; i guess these aren't considered natural language lexicals because they mainly appear in writing (but you can speak "Jane said, quote, i love him, end quote"), but they have are parsed as tokens as if they are separate words.

so adding the first two of those, our roles currently are:

0: verbs on the top level; subrole 0 is the 'head' of whatever phrase it is in
1: object or indirect object (noun phrase)
2: subject (noun phrase)
3: filtering constraint modifier
4: modality modifier
5: conjunction
6: packed representation/alternate form for this 64-bit packet
7: things relating to combining multiple 64-bit packets into one sentence (sentence length, parens, quoting, etc)

note on filtering constraint vs modality:

i mean modality more broadly than it is used in linguistics; tense, aspect, mood, adverbs, and prepositional phrases are all 'modality' modifiers here. In my scheme, tense, aspect and adverbs modify the verb, mood modifiers the whole sentence.

Adjectives are examples of filtering constraints modifying a noun, and prepositional phrases are filtering constraints modifing a whole sentence (or a noun; 'He gave the bone to the dog in cage #3').

But is there modality for nouns and filtering constraints for verbs? If not, maybe we don't really need two roles for these.

if we give packed repr its own role like this, and, like role 7, give it special meaning (its primary meaning) when it is used at the beginning of a 64-bit packet, now we can use the payload of the role 6 16-bit word for something. I propose we use it as 8 bits of modality. Then the 48 remaining bits are for (8-bit opcode + 3x4-bit operands) + (8-bit opcode) + (8-bit opcode + 3x4-bit operands), as before.

note: we may not want to actually define the packed reprs yet, because they are an optimization. Just define them as things that consist of a defined 8-bit role 6 header followed by an opaque 64-bit packet that can be statically peephole-translated into other sentence forms.

on addressing modes:

note that 'direct with offset' is an uncommon mode, the common one is 'indirect with offset':

"MIPS does not provide a lot of choice when it comes to addressing modes for load and store operations. None at all, really. The only addressing mode supported by MIPS is address register indirect with offset. This, of course, is the same as ARMs basic register indirect mode, although variations such as incrementing the register are not allowed." an example is given of (paraphrased) 'Load r2 from memory pointed at by (r3 + 4)'

so eg we dont need a split mode for 'direct with offset' eg 'register (3 + r1)' (could be written 'r(3 + r1)'); we only need 'the memory pointed at by (r1 + 3)' (and i guess if you have memory-mapped registers, these would be the same thing). And in our model, + is replaced by .__GET, so this becomes 'r1.__GET(3)' (equivalent to r1.3, r1[3], etc).

note that, to unify structs and hashtables, you should be able to address r3["string"], not just r3[3]; also the case of structs, like r3.field2, are replaced by structs with integer-enumerated fields, like r3.2, and further, this is unified with hashtables as a special case where the lookup key is an integer and the keys are statically known; and the implementation use of struct instead of a hashtable is just an optimization when the compiler detects this special case (note that recognizing that only a fixed set of integers are ever gotten from certain tables may require some dependent typing, but probably an easier-than-usual special case of it, because since this is just a shortcut for structs, it is much like nondependent typing)

now, what i mean by you should be able to address r3["string"]; you should be able to use some LEA-like instruction to get the 'effective address' of the expression r3["string"] and then store it and use it later in any place where an address is accepted, in a similar way to how in traditional/'real'/hardware assembly language you can get the address of r3[2] by taking the indirect index of the object in r3 with the offset 2.

this implies two things:

this is somewhat like memory addresses in that everything (the local variables, and also any part of any object on the heap) is in a single 'address space'
but the semantics of 'indirect with offset' are different because we are using it to do a .__GET rather than pointer arithmetic. With hardware assembly indirect with offset, if r3 = &x[33], then &x[36] can be referenced by doing indirect-with-offset on register r3 with offset 3. However, in our case, that would refer to x[33][2].

note that in the previous the notation was a little confusing; in hardware assembly, "r3 = &x[33]" makes sense because a pointer to x[33] is in r3, and operations on r3 would affect that pointer. But in Oot Assembly, the corresponding situation is for r3 to be thought of as containing x[33] itself, and operations on it would affect x[33], not a pointer to x[33]. So r3 = x[33]; this is a statement of equivalence (==), not an assignment statement, but it is pointer equivalence, not structural (value) equivalence.

Furthermore, these semantics are confusing when it comes to unboxed values. If r3 = 9, and then we add 1 to r3, what does that mean? It doesn't mean that we alter the meaning of "9" systemwide. r3 = 9 means there is a single copy of the value 9 in r3, and mutations to it affect that copy, not some system-wide thing.

If we were to run "MOV r3 to r4", this would literally mean 'take what is in r3 out of r3 and place it into r4'. r3 then becomes undef (at least as far as the type system is concerned).

If we were to run "CPY r3 to r4", this would literally mean 'make a (non-aliased) (deep) copy of the value that is in r3 and place it into r4'. CPY would be implemented by the VM as COW (copy-on-write) under the covers. Further writes to r4 would not change the value in r3.

If we were to run "ALIAS r4 to r3", this would literally mean 'into r4, place a symlink that links to r3". Further writes to r4 would change the value in r3.

So what if we say "ALIAS r3 to x[33]; MOV 3 to r3"; does this mean x[33] = 3? I think it does. If we want to look at the alias itself, we use a meta-command, like ALIAS or GETALIAS. Or maybe ALIAS is just SET with a meta-modifier (or 'alias' modifier), and GETALIAS is just GET with that same modifier.

So what if we say "ALIAS r3 to x[33]; ALIAS r4 to r3; ALIAS r3 to y[2]"; do changes to r4 now affect x[33], or y[2]? Todo.

So it might be easier to think in terms of two assignment operators, "=" for values and "=&" for aliases (symlinks), and two equality operators, '==' for structural equality and '=&' for pointer equality, in a higher level language.

tangentially, note that since we are also unifying syntax for not just structs and hashes/arrays, but also function calling, we can just write x[33] as x 33, which is the same as x.__get(33).

now what are the possible split modes (where x y = x.__get(y))? taking every permutation of the four basic addr modes, we'd have 4*4 = 16 possibilities from 00 to 33: 00: i1 i2 01: i1 k2 02: i1 r2 03: i1 *r2 10: k1 i2 11: k1 k2 12: k1 r2 13: k1 *r2 20: r1 i2 21: r1 k2 22: r1 r2 23: r1 *r2 30: *r1 i2 31: *r1 k2 32: *r1 r2 33: *r1 *r2

recall however that our registers arent different from our memory. So is r1 and r2 identical? No, we can take the 'r's (direct addressing) to represent meta-addressing the aliases, and the '*r's to represent addressing the values

Now let's write all those in terms of array indexing to make their function clearer:

00: i1[i2] 01: i1[k2] 02: i1[r2] 03: i1[*r2] 10: k1[i2] 11: k1[k2] 12: k1[r2] 13: k1[*r2] 20: r1[i2] 21: r1[k2] 22: r1[r2] 23: r1[*r2] 30: *r1[i2] 31: *r1[k2] 32: [*r1 r2] 33: [*r1 *r2]

an aside on aliases:

The idea of 'alias' and 'value' addressing is that each memory cell has two 'levels'; the 'alias' level, which might contain a pointer (or actually, maybe an entire 'alias record', which is a pointer with some attached metadata), or nothing is the cell is not a symlink but an actual value; and the 'value' level, which in the case of non-symlinked cells contains the contents of the cells, and which in the case of the symlinked cells looks like it contains the contents of the symlink target. So when you use value addressing on a symlink cell (that is, one with a non-empty alias level), you are really manipulating the cell that is the symlink target; when you use alias addressing on this cell, you are reading or mutating its symlink record.

In order to CREATE an alias, a third addressing mode is needed, because you want to take the 'address' of a cell, and place it into the alias level of a different cell. Instead of 2 different alias-and-addressing relating modes, we could have some special instructions. Note that an 'address-of' mode could not be written to, only read, so maybe that one should be omitted?

We also need a sentinel to represent an empty alias cell. We can use 0, which would mean that the PC cannot be aliased, which is no great loss.

Also, is this 'alias level' identical to the 'meta level' above a 'perspective' (view), or do we need yet another mode for that? i'm guessing it's a field within the metadata in the meta level. But mb the meta level associated with an address-of the cell, not of the value contained at that address.

so, back to those split addressing modes. Which of these are the most useful? Regardless of whether we use one addressing mode to represent the address-of or to represent accesses to the 'alias level', it won't be very common to apply .__GET (to index into) these things; addresses themselves can't be indexed into (although we COULD allow straight-out pointer arithmetic addition when you try to do that :) ), and if you want to index into metadata attached to an alias record, you can always alias it to the value level of another cell first.

so let's eliminate all of the 'r's above. but, to make things easier to read and write, instead let's just get rid of the *s and just remember that we are dealing only with the value level here, not the address-of or alias levels (see below; i like address-of better).

so we have

00: i1[i2] 01: i1[k2] 02: i1[r2] 10: k1[i2] 11: k1[k2] 12: k1[r2] 20: r1[i2] 21: r1[k2] 22: r1[r2]

now, i1[i2], i1[k2], and i1[r2] don't make much sense, because integers cannot be indexed into. k1[i2], and k1[k2] are not necessary, because they could have been statically computed and added to the constant table.

So we only have 4 modes left:

k1[r2] r1[i2] r1[k2] r1[r2]

so these are the split addressing modes. A little thought shows examples when each of them would be useful:

k1[r2]: let k1 = ARGV. ARGV[r2]. r1[i2]: let r1 = x. x.3 r1[k2]: let r1 = the sin fn, let k2 = pi. sin(pi) (recall that f(x) = f[x] = f x) r1[r2]: let r1 = x. x[r2]

if we used an 'alias' addressing mode, and a special instruction/opcode for GET_ADDRESS_IDENTIFIER and CREATE_ALIAS then:

If we wanted to look at or manipulate metainformation in the alias record, we would first move the alias record itself into an ordinary value cell, by using issuing a CREATE_ALIAS instruction whose input had alias addressing (r) and whose destination had value addressing (*r) (this create an alias from the destination cell to the alias record which is controlling the input cell; if we wanted to alias a destination cell to the same symlink target as a source cell, we would instead do CREATE_ALIAS with value addressing in the input (the more typical case). (note: if this is what we're doing, tho, then 'alias' addressing in CREATE_ALIAS's output is useless, which seems like a waste). Note: this is irregular in that the CREATE_ALIAS and GET_ADDRESS_IDENTIFIER opcodes would interpret value addressing different from every other opcode.

A more regular solution would be to use an address-of addressing mode, and to have GETALIAS and SETALIAS operations. To create an alias from r4 to r3, you would SETALIAS(input= address-of(r3), destination= address-of(r4)). To create an alias from r4 to whatever r3 is directly aliased to, GETALIAS(input=address-of(r3), destination=r1); SETALIAS(input= r1, destination= address-of(r4)). Etc. I like this better.

A downside there is that address-of can't go in the destination operand. But i guess that's good, it means we have a spare 1/2 addressing mode. hmm, mb too clever, but we could use that in place of SETALIAS... presumably SETALIAS will be somewhat common, but GETALIAS will not. to clear a symlink, we can copy 0 in using this addressing mode

Note that SETALIAS is a special case of setting .__GET and .__SET.

Note that due to perspectives (views), if aaa is an array, even though most likely aaa is represented in memory as a struct containing two fields, a length field, and a pointer to data in the heap, the standard view of an array would just be the data itself.

Eg if r3 'contained' an array, and you wanted the third element of the array, you wouldn't do:

r3.1[2]

you'd just do

r3[2]

if you wanted to access the length field directly, you'd use something more like

GETMETA(r3).VIEWS."whatever-the-name-is-of-a-view-with-a-length-field-and-a-data-field".0

---

need 'load' and 'store' instrs to do dereferencing since we no longer have dereferencing built into the addressing mode (instead we have a C-like & operator)

we dont actually need loadstore b/c could always put an address that we have into a cell's alias level and then read/write to that cell's value

i guess thats what this system gives you; like python, you can have local variable represneting structs that you address directly like 'x', not like '*x', but you can also alias two locals togeether, eg changing x can change y.

---

in oot impure (including aliased, except for views) variables and fns would have an & to their left. vars without & are just values (from the point of view of this statespace mask). does this mean we have a non-cell address space , too? mb not b/c of statespace masking. but otoh dont want to have to do indirection upon every memory access, right, or worse actually traverse a chain of symlinks?

this is why i wanted an addressing mode to override getters and setters, as opposed to having an address-of mode. so we have 3 contending modes; addressof, override/nonmeta, and alias level/meta level. i had conceived of the overrride and meta and alias level as similar, the level at which you bypass the getters and setters (or assign to them), because there you can directly access the obj representation w/o going thru getters and setters. but i guess if u wanted to use that to touch the data itself, youd have to do 'x.0' (where x refers to x's meta record b/c we're in meta addressing mode). so you have to use split mode/offset just to get to the data, so u cant also use the offset to index into the data.

this seems more common to me than address_of. so if this were the other addr mode, then the code would just say '3, direct mode' when it wants to directly access 3, and '3, indirect mode' when it wants to access a potential symlink; the compiler knows which address corresponds to which local varname, and doesnt need to look it up at runtime with address-of. or, address_of can be special opcodes. in an HLL addr-of could still be something like &, like in C (or &&, since we've already used & as a warning sigil for impurity or aliasing)

the question is, when r3 is a potential symlink, what does 'direct mode' yield? i am tempted to say it should yield the alias record.

also, instead of having a 'symlink or content (terminator)' flag in the record, the alias record could just be the payload with an addressing mode. if this mode is itself indirect, which it would always be if this is an actual and not just a potential symlink, then a chain of links could be followed (like the nested indirect addr mode on some old ISAs)

but now what if you want to access the META of an object to which you only have a symlink? if direct mode is for both aliasing and symlinks, you cant do this w/o implementing ur own symlink chain walkeer, which seems silly.

another possibility is that the direct mode does access the meta node after going thru symlinks, but the meta node is dynamically constructed and has fields that tell you about each link on the chain you took to get here this time. (but see below: this would mean we wouldnt have an efficient unboxed mode).

but dont we want to make it fast for the runtime to add a=3; b=4; c = a+b without indirection? with that suggestion you still have to do c+0 = *(a.0) + *(b.0)

well mb. if a and b were known at compiletime, then a.0 and b.0 could be in the constant table. so we dont have to lookup 0 from a. and if a and b hold small unboxable values, and the meta for those nodes are never used, the compiler could secretly not have a meta node for those, and could just store their actual unboxed value on the stack or wherever it is stored.

also remember that even if the meta records cant be all dynamically constructed, we dont have to store the meta record in the local variable table value and the content in a field within this record. we can have a separate 'local var meta' table. we could also have a separate table for pure local vars, and local vars for which we need to store some meta records

so we're contemplating replacing 'direct registers' with 'direct memory' and 'indirect addressing mode' (memory via registers) with 'indirect' (follow symlinks, possibly nested; and/or use __get and __set). theres other kinds of indirection also; lazy thunks/promises, timeouts, etc

IMPORTANT: really, if you need to access meta or read an alias or something, you can use an extended sentence and have a modality modifier in ur noun phrase. so we dont need to worry too much about these special cases; we effectively have lots of addr modes available. we probably just need one default indirect mode with our language defaults (all indirection on), and then also a fast, direct addr modes, unboxed, no symlinks, etc, so that it's possible to write portable relatively efficient tight loops in the self-hosting reference implementation.

so, all indirect, and all direct. the all direct mode is NOT used for metaprogrammy things like directly accessing symlink chains, meta nodes, etc; it's used for straight out jvm-like unboxed arithmetic, with nary a pointer in sight (well i dunno about that, we might still have pointers to some arrays and even boxed structs, but we'll see).

IMPORTANT: we might not want to directly expose the 'internals' of the meta nodes, symlinks etc anyways; we should provide special commands for those so if we decide to return a virtual representation of these, we dont have to trap 'direct' memory accesses to create an illusion of this virtual reepresentation in 'direct' memory.

so we still have the question of what a 'direct' access sees if it tries to access a boxed and/or symlinked value.

IMPORTANT: otoh we know what is boxed and unboxed at compile time.. could just never have direct accesses allowed to potentially boxed or symlinked things. hmm, i like that.

which means, yes, it is a separate addr space. but it doesnt correspond to lack of & identifier prefix sigil in oot. i'm thinking the compiler figures out what can be unboxed, with the possible aid of annotations (can this be done with precondition demands? eg the precondition contract says you can only pass in nonsymlinks with default getters and setters, and furthermore they must be of types that can be unboxed? or do we need to somehow put some constraints on how it can be used that are postcondition demands (which you would think would never be needed but perhaps there are some 'implicit postcondition promises' about how you can use the returned values that would need to be overridden)?)

hmm this might actually be close to something that almost corresponds to &; the part that i can think of that is distinct from & is laziness; because if something were 'strictifiable' (meaning no chance of an infinite loop when you strictify it) with no visible mutable state (effectively pure; the '&' sigil; note that 'effectively pure' excluses aliasing also, because something aliased can be 'mutated') then it can just be strictified into an unboxed value at the time of assignment, even if you have to call a getter to do that.

potential syntax for 'strictifiable' vars ("unlifted" in the haskell terminology; types without "bottom"; operationally, types that cannot be thunks; haskell has unboxed, unlifted; and also boxed, unlifted; and also boxed, lifted types, there cant be any unboxed, lifted types because you would need a boxing struct to hold the thunk reference) not haskell's "data structs that automatically strictify" but rather a Maybe-like construct that forces u to explicitly do "strict assignment" (eg strictify then assign) when u assign to the var.

haskell seems to use a '#' postfix for unlifted types (see https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler/TypeType ). let's use that ourselves. i dunno if it'll mean unboxed, unlifted, or possibly boxed, unlifted for us. this suggests =# (or mb #=) for strictifying assignment. we might also want to use # postfix as a strictification operator in other expressions. eg mb every unlifted type has a # at the end of its name, and mb in ordeer to assign to such a thing from something else, you must do a =# assignment, to let the reader kno that this is where strictification occurs

this also suggests using a# for 'strictify a'. in which case we dont need =#. btw, imo having the sender do this is easier to read then GHC's BangPatterns? (see https://wiki.haskell.org/Seq ) or Haskell's "$!" (strict application) operator (but we should consider the syntax !x instead of x#)

also do we need something similar when we 'freeze' a mutable object by doing a deepcopy of a &var into a purevar? after all a deepcopy might also take a long time, might strictify, etc. hmm think about i/o monads; how does that work? monadic bind. from within an impure IO monad fn you can eg read a keystroke and bind it to a local var, then pass it into a pure fn as if the value were pure.

since we already have &impurevars, i dont think we need further special notation for purevar = &impurevar; the concatenation of "= &" is easy enough to see when scanning/grepping

IMPORTANT: all those intermediate modes between full direct and full indirect are just examples of views

---

some terminology from HTTP2:

The stream is a virtual channel within a connection, which carries bidirectional messages. Each stream has a unique integer identifier (1, 2, ..., N).
The message is a logical HTTP message, such as a request, or response, which consists of one or more frames.
The frame is the smallest unit of communication, which carries a specific type of data—e.g., HTTP headers, payload, and so on.

    HTTP/2 streams, messages and frames "

so perhaps we should call the 16- or 24-bit (or 64-bit) things 'frames' instead of 'words' to match this?

---

if we have regexp * (and regexp ?) in our stack maps, how do we verify that a program has consumed exactly those? one easy way would be to have 'stack markers' which are opaque constants that you push onto the stack (maybe labeled with integers)

---

our current proposal forces you to go thru metaprogrammy GET (and SET) protocol to index into an array using the split addr modes. When the compiler knows for sure that there is no metaprogramming override and knows the array layout in memory, want a way to directly index in without indirection, for high-speed numerics.

i guess one way to do this would be having special opcodes DIRECTGET and DIRECTSET. Another way would be just to have GET and SET with modalities, but this would reduce code density in one place we probably do care about it (numerics inner loops).

---

thinking about upvariables

perhaps each scope must statically state how many new locals n it is using.

then the memory cells from 0-3 are reserved, and from 4 to (n-3) are the new locals. Values above that are the old locals, or lexically bound upvariables, in order. i suppose that if the closure doesn't contain all of the locals used in the enclosing scope, there is no reason that they have to take up 'space' here, eg only bound upvariables need be included, not necessarily all locals in the enclosing scope.

behind the scenes, there is probably a saguaro tree of lexical scope frames, not a single block of all of them concatenated

since the number of bound upvariables in each enclosing lexical scope is known, and since the number of scopes in the tree is finite, we have a statically known bound on all variables of this type.

what about dynamically scoped variables? are these treated as module-level 'globals'? if so, do we then need getglobal/setglobal or are these assumed to be statically known, in which case we can place their locations after the lexical stuff? or do we require the dynamically scoped variables to ALSO be lexically bound, eg at the module level if not elsewhere?

---

hmm we might want to use

(namespace type, namespace, identifier)

for identifiers, instead of just single integers, because we have lexically scoped variables and dynamically scoped variables, and we have multiple enclosing lexical scopes.

Or maybe we a separate addressing mode for each 'namespace type'? Or maybe we limit the number of namespaces of various types (eg 'lexical depth of at most 16') and then have (namespace type x namespace, identifier)?

This would also allow us to be clearer about 'non-archimedian addressing', eg just have a 'heap' namespace type and use 'namespace' to refer to regions of the heap, and then 'identifier' is an address within that region.

note that the way that eg Lua does it is having separate instruction, eg LOADK, GETUPVAL, GETGLOBAL, instead of separate addressing modes or namespace types.

If we use either addressing modes or namespace types, and if we use namespace numbers, this certainly eats up a lot of bits for identifiers.

Eg with 2 namespace types and 2 namespaces per type, that's already all 4 bits that we had in split addressing mode with an 8-bit payload.

so i suppose that we'd say that with 8-bit payloads, we'd always implicitly assume namespace type 0, namespace 0, although we'd probably still map the PC and the stack pointer into that namespace (not sure tho?).

otoh dont we want to unify object attribute lookup, lexically scoped var lookup, dynamically scoped var lookup? in which case each object is its own namespace, so we have a ton of those -- and namespace lookup is just GET

but otoh we want to resolve lexically scoped lookup at compile time

---

hmm i feel like we're overduplicating the same concept on too many levels again... just like i feel like registers and memory locations and stack frame offsets are too many duplicates of the same thing...

what i was trying to do was remove the need for eg LOADK and GETUPVAL and GETGLOBAL, and to unify object GET and array GET and structure GET (and function calling)

but i guess you have to have some "pronouns" eg primitive 'registers' because when you use an expression to specify an effective address to which you want to put the result of something, you need to place that function in a 'register' if you are then going to use a fixed-length instruction to say to put something into that location (but.. i guess the alternative is to have an AST with no fixed length 'instructions', only fixed-length opcodes?)

anyhow, we already have GET, and we have 'noun phrases', all these references to various kinds of namespaces should be done then, right? but they're not.. we have to resolve references to locals and upvals differently, right?

hmmm

i guess Lua's use of LOADK and GETUPVAL and GETGLOBAL do make some sense.. because they allow ordinary local lookups to be really fast (less indirection). Lua also has GETTABLE.

---

i guess we could have the first 16 constants be namespaces such as UPVARS, etc, and use the k[r] split addressing mode to access these quickly

hmm.. then we'd really prefer a k[i] split addressing mode, though, right? to address the first 16 elements of each of the 16 special namespaces?

---

i guess a nice thing about using a stack is that (a) there isnt a limit on how deep expressions can be, unlike our 2-level-deep limit above, and (b) you dont have to have lots of UNDEFs after each statement for one-time-use nouns

so want to encourage use of the data stack (especially) for 'subexpressions' that conceptually form part of one conceptual statement

---

does oot byc (bytecode; oob) really need 2 levels if we have the stack for singleuse pronouns? i guess so; 2 lvls gives us extensible modality and addr modes

Intel SGX security enclave instructions:

http://web.stanford.edu/class/ee380/Abstracts/150415-slides.pdf

EEXIT EGETKEY EREPORT EENTER ERESUME

"We present a new compression-friendly wire format for the LLVM IR. We compiled 3511 real-world programs from the Debian archive and observed that on average we only need 68% of the space used by gzipped bitcode. Our format achieves this by using two simple ideas. First, it separates similar information to different byte streams to make gzip's job easier. Second, it uses relative dominator-scoped indexing for value references to ensure that locally similar basic blocks stay similar even when they are inside completely different functions." -- https://aaltodoc.aalto.fi/handle/123456789/13468

azakai 5 hours ago

Of course, several of the people working on WebAssembly? are actually from the PNaCl? team. And others, like the Emscripten people, have lots of LLVM experience as well. WebAssembly? is being designed while taking into account previous approaches in this field, including of course using LLVM bitcode for it.

LLVM bitcode is simply not a good fit for this:

It is not portable.
It is not designed to be compact.
It is not designed to be quick to decode and start up
It has undefined behavior
It is not sandboxed.
It is a moving target (see recent big changes to LLVM IR on geps and loads/stores)

Each of those can be tackled with a lot of effort. But overall, better results are possible using other approaches.

---

pro-big-endian:

drfuchs 1 day ago

Because big-endian matches how most humans have done it for most of history ("five hundred twenty one" is written "521" or "DXXI", not "125" or "IXXD"). Because the left-most bit in a byte is the high-order bit, so the left-most byte in a word should be the high-order byte. Because ordering two 8-character ascii strings can be done with a single 8-byte integer compare instruction (with the obvious generalizations). Because looking for 0x12345678 in a hex dump (visually or with an automatic tool) isn't a maddening task. Because manipulating 1-bit-per-pixel image data and frame buffers (shifting left and right, particularly) doesn't lead to despair. Because that's how any right-thinking person's brain works.

The one place I've seen little-endian actually be a help is that it tends to catch "forgot to malloc strlen PLUS ONE for the terminating NUL byte" bugs that go undetected for much longer on big-endian machines. Making such an error means the NUL gets written just past the end of the malloc, which may be the first byte of the next word in the heap, which (on many implementations of malloc) holds the length of the next item in the heap, which is typically a non-huge integer. Thus, on big-endian machines, you're overwriting a zero (high order byte of a non-huge integer) with a zero, so no harm done, and the bug is masked. On little-endian machines, though, you're very likely clobbering malloc's idea of the size of the next item, and eventually it will notice that its internal data structures have been corrupted and complain. I learned this lesson after we'd been shipping crash-free FrameMaker? for years on 68000 and Sparc, and then ported to the short-lived Sun386i.

http://geekandpoke.typepad.com/geekandpoke/2011/09/simply-explained-1.html

pro-little-endian: daemin 1 day ago

Little endian is easier to deal with at the machine level as you don't need to adjust pointer offsets when referencing a smaller sized variable. A pointer to an 8, 16, 32, 64, 128 bit quantity will be the same. Big endian you will need to adjust the pointer up/down accordingly.

---

lowlevel threadlocal analog: https://en.wikipedia.org/wiki/Scratchpad_memory

---

" starting with a system with a separate CPU chip and DRAM chip(s), add small amounts of "coprocessor" computational ability to the DRAM, working within the limits of the DRAM process and adding only small amounts of area to the DRAM, to do things that would otherwise be slowed down by the narrow bottleneck between CPU and DRAM: zero-fill selected areas of memory, copy large blocks of data from one location to another, find where (if anywhere) a given byte occurs in some block of data, etc. The resulting system—the unchanged CPU chip, and "smart DRAM" chip(s) -- is at least as fast as the original system, and potentially slightly lower in cost. The cost of the small amount of extra area is expected to be more than paid back in savings in expensive test time, since there is now enough computational capability on a "smart DRAM" for a wafer full of DRAM to do most testing internally in parallel, rather than the traditional approach of fully testing one DRAM chip at a time with an expensive external automatic test equipment.[1] "

-- https://en.wikipedia.org/wiki/Computational_RAM

---

some types of instructions:

'pure' operations (eg arithmetic)
conditional jumps
stores, loads, I/O

Oot Assembly should perhaps be particularly good at 'extensible opcodes' for pure operations; the need for extensible control flow is the least important (as extensible control flow could make it harder for the underlying platform to do clever optimizations eg pipelining). Impure yet non-control-flow stuff could possibly be marked with the impurity-type flags (see ootStateNotes1.txt) to allow the platform to optimize it.

---

the 16 binary boolean functions are:

0 x and y x and -y x -x and y y -(x eq y) x or y -(x or y) x eq y -y x or -y -x -x or y -(x and y) 1

note that another name for "-(x eq y)" is "xor"

that is:

1: a constant (0)
2: projections (x,y)
3: and, or, eq
6: (1+2+3): the negations of each of the above: 1, -x, -y, nand, nor, neq (xor)
4: and for 'and' and 'or', things where you negate one of the inputs (but not the output): -x and y, x and -y, -x or y, x or -y

so if 'not' is available, the following suffices: 0, x, y, and, or, eq

of course if 'not' is available, simply 'and' suffices, b/c you can make nand and with nand, 0 and 1, which is already boolean-universal. but it's nicer to have these others. Also it's nicer to have both 0 and 1.

and ppl expect xor.

So i suggest that we use:

0,1,x,y,and,or,eq,xor,not

---

http://hackersdelight.org/basics2.pdf provides some other basic functions that we might want to provide in a library or in core (abs, etc)

---

another thought on bit-width:

64-bits does seem to be the biggest bit-width that PCs need, and it is the square of 8.

However, ipv6 has 128-bit addressing, so it might be nice to go larger. In some applications, you might want to address memory within a machine itself addressed by ipv6, which would require 128+64 bits; the next power of 2 above that is 256. Similarly, you might want to have a 'virtual' 64-bit address space within a 'real' 64-bit address space, etc. Design notes for the EVM (Ethereum Virtual Machine) point out that a lot of crypto uses mod 256 arithmetic. 256 is the square of 16, and so is in the sequence 2,2*2=4,4*4=16,16*16=256.

So, maybe go up to 256 bits for the largest available bit-width.

I am still partial of a two-headed solution of 16-bit as a 'small/native' bit-width to allow compact instruction encodings and quick manipulation of small quantities, combined with a larger bit-width to allow a ridiculously large uniform virtual space address (and/or networked address space).

Or possibly even a multi-headed solution that allows any bitwidth up to the max; in this case this would require 8 bits to specify a bit-width.

Or a three-headed solution that allows 1-bit, 16-bit, or 256-bit. Having a 1-bit modifier available lets us avoid the need for special bit SET and CLEAR instructions.

Or a four-headed solution that allows 1-bit, 2-bit, 4-bit, 16-bit, or 256-bit (capturing the first part of the sequence 1, 2^1=2, 2^2=4, 2^4 = 16, and then also 256).

Since 2-bit may not be used much, could also do 1-bit, OTHER, 4-bit, 16-bit, 256-bit where OTHER could be used for either "native pointer size" or "object reference type" or "arbitrary bit length (true arithmetic, not modulo arithmetic)". Besides adding flexibility, having an OTHER that means 'native pointer size' in some contexts makes up for the omission of 64-bit.

To make that a power of 2, also omit 4 bit, leaving:

1-bit, OTHER, 16-bit, 256-bit

But perhaps the real solution is to allow optional instruction modifiers that specify any bit width between 1 and 256, inclusive, as well as native pointer size/reference and true (multiple precision) arithmetic.

if there were any preferred values, i would guess that having both native pointer and true arithmetic (separately) would be the most important ones. So drop 1-bit

native, 16-bit, 256-bit, true arith

---

also, i don't remember my exact 'grammatical' modifier scheme but it seems like there are probably around 16 types of modifiers that could be hardcoded, eg one of the codes could mean 'bit width' and then there would be 4 bits to choose a bit width (so for 8 bits AFTER the bits that say 'this is a modifier', 4 of them would specify 'this modifier is a bit-width modifier' and 4 of them would specify the actual bit-width.

---

Lua VM "is optimized for coding the more common conditional statements rather than conditional expressions" [1] by having VM instructions EQ, LT, and LE be conditional skips, rather than returning a boolean value; if you want to return a boolean value, you use a conditional skip followed by LOADBOOLs in the conditioned execution paths; LOADBOOL itself has a skip variant, so as to allow this to be done without JMPs (but the Lua VM code generator uses a JMP anyways, presumably so that the interpreter can assume that an EQ/LT/LE is always immediately followed by a JMP (eg the non-skipped codepath is a JMP)

---

Lua VM's parameterized TESTSET instruction can be either OR and AND depending on whether a third parameter is 1 or 0. The trick is that both OR and AND can be written in the form: "if A != K then return B else return K"

(sorta; the second K, within 'else return K', is not in TESTSET)

---

just to reiterate, 3 different designs for instruction encoding:

a simple, regular design. Maybe 64-bit instructions with 3 16-bit operands. Maybe some extensibile addressing modes, maybe flexible bit widths, maybe some typing. I have some sort of numerological obsession with 16-bit stuff for assembly code.
because the previous seems overly wasteful in terms of instruction width (and hence code density) (eg Lua VM has 32-bit instructions; ARM thumb has 16- and 32-bit instructions; 6502 has 1-, 2- and 3-byte instructions (btw are there any longer ones in 6502?), but at the expense of only having a few registers (i guess the zero page is effectively the replacement for more registers), and having large register irregularity (instructions that can only be used on certain registers, and certain addressing mode/instruction/register combinations; otoh ~50% of x86 code, which is variable-length, is consumed by instructions of 32-bits or less, and the other half is consumed by longer ones, often b/c of 32-bit immediate values [2]; something similar to Python stack-based bytecode could have ~32-bit instructions [3] [4]), i am sometimes pulled towards an even simpler encoding, with less opcodes and addressing modes and smaller operand bit-widths
and then i sometimes think, how could you actually make assembly really extensible and metaprogrammable and also able to preserve high-level intent? My answer to this ends up looking like natural language, which is i guess another top-level goal: how could you make a linear 'assembly language' that was flexible in the way that natural language is, but still at least possible to somewhat efficiently and redundantly parse given a lot of parallelism on the part of the instruction decoder?

And then sometimes i think about how to mix these, using some 'instruction format' leading bits to determine which of them a particular instruction is. Note that if you had one bit for 'instruction format' in a compact encoding and 1 more bit in an intermediate encoding, then you could have 3 separate formats while only wasting one bit in the compact encoding.

I'd like to note some other motivations:

for the first one: if Oot was running on a massively parallel processor, then most likely most of the individual CPUs in this processor would be a lot less powerful than eg the Intel Core CPU in my laptop right now. Like the brain, we'd expect them to be low-frequency (and error-prone!) so that we can have a lot of them, without dissipating too much heat, and cheaply enough for consumers (and me!) to buy. This is one reason i'm obsessed with 16-bit. The idea is that each of these CPUs will probably be running some sort of simple low-level runtime or interpreter, doing stuff like routing between processors, and the real program (and perhaps even the real language interpreter/runtime) is run on top of that.
for the first one over the second: if the operands are smaller than ~16-bits (eg if they are only 8 bits) then either, like Lua, we have to have a noticably low limit on the number of variables in some case (eg Lua's limit on upvalues), or we must have an indirection layer between the addressable registers in the Oot Assembly and the actual variables in the program. Which means that the high level will be doing register allocation to fit in the Oot Assembly's constraints. But in that case, why introduce the Oot Assembly abstraction, as it's introducing unnecessary accidental complexity? In that case, better to just optimize for the hardware ISA directly. Note that LLVM has an apparently infinite number of 'local identifiers' like %1 %2 etc; i think it's supposed to be essentially like a register machine with an unbounded number of registers
we want regularity to make it dead simple to port Oot Assembly to a new platform. Hence the idea of regular 16-bit operands; hence the idea of abstracting over bit-widths; hence having few opcodes. Also, we want some fixed-width-ness to assist parallel instruction decoding
we want to preserve high-level intent so that translating HLL to Oot Assembly, then translating that to the target platform, doesn't obscure opportunities for usage of platform-specific primitives
we want to encapsulate 'platform' operations such as garbage collection, greenthread scheduling, capability-based security, symlinks, transactional memory from the HLL implementation
we want the 16-bit regular design to capture the 'essence of assembly language' without the accidental complexity
we want Oot Assembly to make it easier to port Oot by isolating the platform-specific stuff below it, while still being dead easy to implement on a new platform. But if this was the motivation, why not just stick with LLVM? Maybe because LLVM seems way more complicated than what i'm proposing. I imagine that advanced Oot implementations would eventually grow out of Oot assembly and translate Oot Core (and sometimes Oot!) directly to LLVM. Also, there are some ways in which Oot is more opinionated than LLVM (eg some form of garbage collection is assumed to be supported by Oot), so where LLVM must be expressive enough to let languages handle this stuff in various ways, Oot does not. On the other hand, Oot must fully support tail recursion, which LLVM had some trouble with last time i checked (unless you use manually jumps rather than letting LLVM handle the call stack for you). And we can encapsulate the type system and memory representation of data types, which LLVM doesn't want to constrain too much.
why not JVM? It's too slow to startup, and also you have to write a layer on top of it to support tail recursion, and also too much accidental complexity with bit widths
why not RISC-V? Being a practical instruction set, there is a lot of accidental complexity there (bit widths, etc) which we don't want.
and if portability is the motivation, why do we want another layer below Oot Core? I'm really not sure that we do.
in order to escape accidental complexity, i keep flirting with polymorphic arithmetic in Oot Assembly, but this would be too inefficient (and perhaps too much meddling in the HLL type system) for a real/hardware assembly language like RISC-V
i'm still torn about whether the low level should support arbitrary-precision arithmetic (eg 'actual mathematical integers' that can be arbitrarily large). This is very inefficient; otoh do we really want to allow different implementations to only support some bit-widths? Or otoh do we standardize on 16 bits?
one requirement is for Oot Assembly to support efficient emulation of some other VM (such as JVM) on top of it. Oot Assembly must be especially efficient at self-virtualization (emulated Oot Assembly on top of Oot Assembly), and should support things like asking the hardware to directly run the 'emulated' Oot Assembly, with some memory mapping between emulated machine and platform, except to trap into the emulator upon certain instructions and memory accesses to certain locations. The virtualization must support airtight sandboxing for security
one motivation for the third natural-language-y design is that you want to annotate the code for the purpose of metaprogramming, and adding explicit annotations within a small fixed-length instruction would make it too large (otoh you could always just have 'annotation' opcodes that only annotate, and then have a relative pointer to the instruction (and possibly operand within the instruction) that they are annotating

As i write this, i am convincing myself (again) that the second option (the one where we strive for a decently compact instruction encoding) is senseless; it doesn't fulfill most of the goals that the other ones have that it doesn't, and if the Oot implementation is concerned with efficiency to the extent that it cares about code density in an IL, then it would be better to skip Oot Assembly and target the actual platform directly, or at least target something more complicated and practical like LLVM.

---

assertion of fact (opposite: query fact (and then match pattern to multiassign results)) (you could assert an equation rather than a special value assignment, but you could assert a value assignment too) vs command vs assignment (which is like pronouns) and also (although mb not separate from the above) RESTful interactions like GET address, SET (PUT) document=value; CRUD; VERBs applied to NOUNs, possibly with other arguments too (eg the value being assigned to a document in PUT)

---

if we are thinking massive parallelism with 16-bit, that brings to mind 2^16 MSP430s, how expensive would that be?

Cheapest MSP430 is about $0.40 per unit. 2^16*.4 = 26214.4. Not to mention the impracticality of wiring those together.

too expensive; so probably how we would really do this is to run the CPUs using the massive parallelism of a GPU. I think that is probably SIMD, whereas we need MIMD, so i guess we will end up running a Turing-head-ish state machine for each virtual Oot Assembly CPU on the GPU. (what are the bit-widths like for that? is there any savings on contemporary commodity GPUs for using a smaller bit-width? they are probably setup to handle large floating point numbers, we just want small integers)

(quick skim of http://docs.nvidia.com/cuda/parallel-thread-execution suggests that CUDA PTX does have some savings for 16-bit stuff; "A PTX program may execute on a GPU with either a 16-bit or a 32-bit data path."

(quick skim of OpenCL?: OpenCL? does have a 16-bit integer data type ("short" [5] and "ushort")

however, http://stackoverflow.com/questions/24238915/opencl-float-vs-uint-expected-performance-gain esp. http://stackoverflow.com/questions/24238915/opencl-float-vs-uint-expected-performance-gain#comment37436788_24238915 benchmarked a GPU and found that using int16 was WORSE than using float

(furthermore http://stackoverflow.com/questions/24238915/opencl-float-vs-uint-expected-performance-gain#comment37436588_24238915 says that "The new-ish "scalar" architectures don't seem to support stuffing several 16 bit operations into one vectorized instruction either.")

still possibly useful for us to go 16-bit ints, though, because in theory hardware could be built to make this faster

---

(would MSP430 even be what we want? a search on Digikey in the 'Embedded - Microcontrollers' category for lowest price first shows the following things cheaper than any MSP430:

0.31361: EFM8BB10F2G-A-QFN20R - Silicon Labs 8-bit CIP-51 8051 0.34800: MB95F272KPF-G-SNE2 - Spansion 8-bit F2MC-8FX 0.35000: PIC10F200T-I/OT Microchip Technology 8-bit PIC 0.35280: CY8C4013SXI-400T - Cypress Semiconductor Corp 32-bit ARM Cortex MO 0.36400: STM8S003F3U6TR - STMicroelectronics 8-bit STM8 0.36400: STM8S003F3P6TR - STMicroelectronics 8-bit STM8 0.37800: ATTINY4-TSHR - Atmel 8-bit AVR 0.38000: PIC10F220T-I/OT - Microchip Technology 8-bit PIC 0.38000: PIC10F204T-I/OT - Microchip Technology 8-bit PIC 0.38000: PIC10F202T-I/OT - Microchip Technology 8-bit PIC 0.38500: MSP430G2001IPW14R - Texas Instruments 16-bit MSP430

---

one simple (but inefficient to directly interpret) way to achieve bit width agnosticism would be to, instead of including the bit width in the instruction, have an instruction (or special memory map, like a stack map) to set/annotate/declare the 'type' of a memory location. Then other instructions would use the appropriate operations for this type. The inefficiency is that if different memory locations have different bit widths, and these are all intermixed into one set of infinite registers (or one stack) then either you have to pad all of them to the largest bit-width, or you have to compute a mapping from number-of-register to address-in-actual-memory and consult this mapping upon each update.

otoh if that mapping is immutable over the course of the program then a JIT interpreter could deal with it efficiently...

so i guess all we need here is the additional constraint that once the type of a memory location is set, it can't be changed, at least not within that dynamical context (eg the current function call) and perhaps not even the lexical context (eg all calls to the same function).

now, what about arrays? use the non-standard model of the integers, eg some 'addresses' have an infinite number of (indirectly addressable?) successor addresses before you reach the next (directly addressable?) 'address'. And so on, recursively (in case you need a 2-D array). Perhaps these successor addresses must all have the same type? No, b/c we need arrays of structs, too.

Restrict to repetitions of compile-time-known type structs? No, because in fact the application program we are running might want to emulate a heap within itself, but still use typed operations to manipulate the objects on this heap.

hmm... if we emulate a heap we can't really have compile-time-known types for each memory location per lexical context, can we? Mb make a 'heap emulating' function cast everything to int and back.

---

mb could mix the 16-bit regular and natural language style designs by:

having annotations as separate instructions which precede or follow the instructions (even if they annotate a subpart of that instruction) they are annotating
handling noun phrases like "the yellow dog over there" via an addressing mode that says "the target of this operand is given by a succeeding string of further instructions"
handling verb phrases similarly
using parenthesis instructions to signal the subexpressions, including the natlang (natural language)ish noun and verb phrases

this would seem to make sense because probably the 'bread and butter' eg most of the code of the 'small program running on each of a zillion 16-bit processors' will be ordinary imperative instructions, or at least ordinary parenthesized expressions, not weird natural language stuff with variable-length clauses. (actually it might be mostly expressions, right? or mostly graph data structure? yeah but most likely it will look like existing code, which is imperative assembly language).

one difficulty is how to specify a natural language argument. if we want 16-bit operands and 64-bit instructions, we have 3x16 bit operands and 16 bits left for the 'head'. If we use 6 bits of the head for 2-bits of addressing mode for each of the 3 operands, we have 10 bits left. But with only 4 addressing modes, we probably want: immediate constant, constant table constant, register, indirect. Leaving no room for 'parenthesized natlang-ish clause to follow'. Now, we could limit the bit-length of immediate constants. Or we could spend 3 more of the 10 leftover bits, to give 7 bits for opcodes and 8 addressing modes. But we also may want 'addressing modes' for the opcode (at the least, we need a way to specify that there are modifiers for this instruction to follow, right? or we could just only use prefix modifier instructions). 4 bits for opcodes seems a bit small (but maybe not really, if we can put longer opcodes in subsequent bytes; and we don't want many opcodes at this level anyways, right?).

these solns all sound ok, actually. we probably only really need 12 bit immediate constants. we don't need more than 64 opcodes, and we could make room with 16. we DO want 16-bit direct addressed operands, so that they can directly represent program variables.

---

for the same reason that a compact encoding isn't desired (that if efficiency is needed, it's better to target the actual platform), and because we have subexpressions in here, we might just think of Oot Assembly as more of a snazzy way to serialize the Oot Core AST.

---

otoh if we had a long opcode field (say 12 bits) then we can jump to 'custom instructions' in memory just like built-in instructions. Also, for first-class functions (that happen to have 2 arguments and one output), would like to be able to have register indirect addressing mode available for opcode field.

but then opcode field certainly can't be 16 bits, because if all 4 fields are 16 bits there are no bits left for any addressing mode (could combine addressing mode with opcode, but if that's regular it takes up just as much space, and we don't want irregularity). if opcode field is 8 bits then there's not that many 'custom instructions' in memory and it doesn't really work for first-class functions.

having 128 bit instructions seems too big (even though i just said it doesn't matter). Which brings us back to the 12-bit operand field proposal.

4 fields, each with a 12-bit operand and a 4-bit addressing mode. The first field holds the opcode, then the destination operand, then the 2 source operands.

We don't really need 16 addressing modes but we can use 8:

immediate constant table register direct register indirect stack direct stack indirect argument given by parenthetical phrase following this instruction custom addressing mode

(why am i going back to having stack addressing modes? b/c stack and heap are separate memory allocation regions)

(we don't need a separate stack push/pop addressing mode, because this can be stack direct/indirect when applied to stack address 0; if you want to read/write the first stack addr without pushing/popping, it is duplicated in stack addr 1)

if we stick with 8 addressing modes and 12-bit operand fields then we have 4 bits free (one for each field)

---

also note that since we're abstracting away from bit-with, the registers might be able to hold things > 16 bits.

---

ok one thing we can do is give less bits to the opcode than to the 3 operand fields. that just means that if we direct-call a function via the opcode field, is has to be in low memory. not a problem.

(i've been though all of this, or something similar to this, way before, but i guess i'll do it again)

so if we had 3 16-bit operands, and 4x3 addressing mode bits (3 each for 4 fields), then we have 4 bits left over for the opcode. This corresponds to 16 "direct-call-capable" registers at the beginning; also 16 built-in opcode's (immediate), also 16 constant table custom subroutines.

ok, we can work with that. But it seems kind of overly luxurious to have 3 addressing mode bits for ea. field including the opcode, and only 16 opcodes. So let's cut it back to 2 addressing mode bits again. the stack addressing can be memory-mapped. And the 'argument given by parenthetical phrase following this instruction' can be value 0xFFFF. So each operand can only directly address from 0 to 2^64 - 2, instead of the usual 2^64 - 1. Not so bad, and in fact the following parenthetical phrase could just hold 0xFFFF so we can represent that value, it's just less efficient. (actually the direct addresses are somewhat less than this b/c of the stack-mapped addresses; i'm thinking 16 of them)

and we lose the custom addressing mode. Which i liked, but really it's unnecessary and we can do without it (and it would make the implementation more complicated to boot). Anyway, if we have natlang-ish phrases with modifiers and prepositions, we can get it back via 'prepositions'.

for the 'see following parenthetical phrase' we might want to reserve more than just 0xFFFF, because we might have up to 4 parenthetical phrases following this instruction, one for each field, and we'd like a parallel instruction decoder to be able to skip straight to each one. Especially now that we realize that we can still represent any values lost via the parenthetical phrase itself. So how long do we want to allow these phrases to be? Probably one of 4, 16, 64, 256. If they are 64 then we may have to skip up to 64x3 = 192 for the last one. So just make the skip up to 256.

um, actually, i don't want to have memory-mapped locations in both low and high memory, because then the interpreter has to do 2 checks instead of 1. So lets put the parenthetical skip in low memory, too. We'll have room for slightly less than 256 in the skip, but that's okay b/c we only needed 192 (and could probably do with much less, even).

y'know, it would be nice to encode the end of the parenthetical phrase, too; that way the interpreter can read it in with a single memcpy command without first reading the phrase length from the beginning of the phrase. So use half the phrase skip bits for phrase length. Now we only have enough bits for max length 16 phrases, but that's already enough. But wait, now the max skip is also 16, it needs to be at least 3*maxphraselen. So we have 8 bits here (minus the first 0x10), so use the first 5 bits for the skip, and the remaining 3 for the length; max phrase length is now 8 (the skip will never be able 24 and we can accomodate up to 32 so we're not using 8 values here, which is fine).

wait, but the opcode can only address the first 256 (0x00FF); ok that's more important than having only 1 check for the interpreter. So keep the parenthetical skip info in high memory. Also, there's really no need to have the phrase length or to bound the length of the parenthetical clauses to 64 or whatever, that's just for efficiency; as long as there's only 1 'long' clause then it can go at the end.

---

so the above proposal is:

8-bit opcode 4x2 addressing mode bits 3x16 operand bits

addressing modes: immediate constant table direct indirect

memory map and memory-mapped special locations: 0: 'trash' register (output to this register when the output is unnecessary) 1: stack push/pop 2-0x000F (15): direct read/write access to first 14 stack locations (without pushing or popping) 0x0010-0x00FF: memory (that can be addressed by the opcode field) 0x0100-0xFEFF: main memory (that cannot be addressed by the opcode field) 0xFF00-0xFFFF: see following parenthetical phrase (x - 0xFF00 is offset to the beginning of the phrase)

---

arguments like "the little yellow dog", which would now be in following parenthetical phrases, is really a parenthesized QUERY; "(the argument is the result of: (find: the little yellow dog))"

---

one thing to consider: if we have an 'extension mechanism' for parenthetical clauses already, why not just have <16-bit operands and use the extension mechanism when a 16-bit operand is desired? i feel like 256 opcodes is too many...

one issue with that idea and regularity: if we have 2 addressing modes for each of 4 fields, that's 8 bits for addressing. These 8 bits should probably be in the same segment as the opcode, and then there should be 3 other segments, for 4 total equal-sized segments, which lets the size of the whole instruction be a power of 2. We need at least some bits for the opcode. To make this segment a power of two, that means it has to be at least the size of the next power of 2, or 8 bits. Which is already 64-bit instruction width.

unless.. you take the addressing mode bits out of the other operands, rather than having them each be a power of 2. Eg if each operand (and the opcode) were 6 bits, and each operand (and opcode) has 2 additional addr mode bits, then we get 32-bit instruction width. If each operand (and the opcode) were 2 bits + 2 addr mode bits, then we get 16-bit instruction width. Both of these choices are feasible for an assembly language, but neither directly meets our goal of being able to easily represent HLL variables without running out of addressable registers. But what about using the extension mechanism of parenthetical clauses? I still don't like 2-bit fields, because i want more than 4 opcodes. 6-bit fields may be feasible, however; MOST of the time we'll be dealing with <64 variables; and 64 opcodes is fine.

The main issue is that dealing with 6-bit boundaries may reduce speed more than the increase in code density increases speed (and perhaps it's not as 'beautiful' to have 6-bit thingees in there). Hmm, i wonder which is more important.

To restore 'beauty' we could take 2 bits of those 6-bit operands away to make 2 flag bits. So now we have 4x (4 operand bits + 2 addressing mode bits + 2 flag bits) = 32-bits instead of 4x (6 operand bits + 2 addr mode bits). This appeals to me because everything is a power of 2, and there are only 16 opcodes.

But surely there will be plenty of situations where we have more local variables/values to store than these 16 directly addressable registers will hold (esp. since many of them will be taken up by memory-mapped stack, etc), so we'll be using the suffix extension a lot -- does it make anything more complicated to use extensions as much as this? Yes -- peephole optimizations -- the optimizer will have to either (a) construct its own representation of the code, to unify the case in which the suffix extension is used with the case in which it is not, or (b) write multiple, more complex patterns for each optimization that matches when the suffix extension is involved, or (c) ignore optimization opportunities when the suffix extension is involved. We want implementation to be easy, so we'd prefer (c). With only the first 16 registers being seen by the optimizer, we're probably missing out a lot here (recall that we're also using the optimizer to sometimes recover semantics so we can substitute platform-specific instructions). So i think the cost may be too much there.

With 6 bits, i think that would be managable. So we have a 32-bit width possibility

Thinking about this reminds me that it would be nice to have some modifier flags, though. How about:

4x (8 operand bits + 2 addr mode bits + 6 flag bits)

This would recover format uniformity between the opcode field and the operand field.

With 6 flag bits we may even have enough to get back to one of my favorite (inefficient but pretty) ideas, to have type annotations (for built-in types) right there in the instructions. We could have 4 bits for 16 built-in types, and then have 2 flag bits left over for something else (parallelism? blocking? evaluation strategy? follow symlinks?). In the opcode field, these 4 bits could be used for something else. Having the type annotations right there might be good for the peephole optimizer too, for the same reason as before (although when custom types pop up we might still have to resort to modifier prefix instructions, which gives the optimizer the same problem i just discussed).

ideas for types:

types: boolean/single bit byte unsigned int16 int16 float64 arbitrarily large int arbitrarily large/precise rational unicode string array cons cell/linked list/binary tree graph/hash/dict (opaque/boxed) function/callable thunk/promise/future AST/expression/block/logic other/custom static type (has already been statically typechecked) dynamic (boxed/wrapped, tagged type; runtime typechecking requested)

what about: OOP object, pointer, nil, thread, non-native pointer, symlink/reference, discriminated union, variable environment/context, namespace, interface/type class, type constructor, annotation, keyword/symbol, continuation, maybe, parallel arrays, sets, clojure atoms, clojure agents, something related to STM, records/tuples

well we really need 'pointer'. OOP object can be expressed with custom static type, and/or dynamic. discriminated union can be expressed with custom static type, and/or dynamic.

do we really need both 'arbitrarily large int' and 'arbitrarily large/precise rational'? Integers are really importing in math and computing. But Lua only has a 'number' type, and code which really cares about having low-level ints can build its own out of the provided 16-bit ints. And most languages don't offer any arbitrarily large anything. So scrap 'arbitrarily large int'.

do we really need 'nil'? nah, CLR doesn't have it.

do we really need 'thread'? few languages have this as a primitive (Lua and Erlang do, for example). So i guess no.

is 'array' really useful if we aren't saying here what it's an array OF? or should we make all arrays of a fixed width, say 64 bits, or 16 bits, or the native pointer size?

could we combine thunk/promise/future and opaque function/callable?

i guess variable environment/context and namespace can both be just dicts

type constructor and interface/typeclass are both meta-ish things so maybe they can stay as custom static type, and/or dynamic (not sure about that tho). Annotation can be some other atomic type (eg AST, or graph). keyword/symbol can be int16.

i'm worried about not having continuation, although b/c it's meta-y we might ignore it.

i'm worried about not having maybe, as it's common, although i dont think other languages have that. Could we combine maybe (maybe-tower) and thunk?

do we really need 'unsigned int' type at this level? perhaps just have 2 versions of each relevant instruction for this?

sets are usually custom types

records/tuples are just custom types (which may or may not be 'boxed')

i'm worried about not having clojure atoms and agents, although i guess this could be part of the boxing

so:

types: pointer byte int16 unsigned int16 float64 arbitrarily large/precise rational boolean/single bit unicode string array cons cell/linked list/binary tree graph/hash/dict (opaque/boxed) function/callable, thunk/promise/future AST/expression/block/logic thread other/custom static type (has already been statically typechecked) dynamic (boxed/wrapped, tagged type; runtime typechecking requested)

---

flag ideas:

lazy
blocking
parallel
boxed/unboxed
as expression (AST), or as value
follow symlinks
dynamic typechecking requested/statically typechecked already
implicit coercion to this type requested
isThunk
platform-specific native width
wrapped with Maybe
lenient flag
atomic operation/transaction/memory barriers around this/volatile
use similar platform primitive if available, even if not identical to Oot semantics
in case of error, trap/raise; or return error; or saturate; or wrap-around; or set error flag/error status register; return value, or return tuple (result, errorcode)?
- return value, or return tuple (result, errorcode) (or Maybe)?
- return in case of error, trap/raise, or not
  - in case of error, trap/raise or not (and if 'not', do we set an error register, or do we wrap the result in a Maybe?)?
something to do with virtualization
parallel array annotation
timeout
something to do with vectorization
something to do with capabilities
something to do with metadata
something to do with synchronization
in-place
deterministic/pure marker
idempotent marker
quoted
symbolic (similar to quoted)
check for redefinition/overridding via metaprogramming
use similar platform primitive if available, even if not identical to Oot semantics (note: a BIG change to semantics, eg if platform ADD is used on a rational and overflows and wraps around, should cause an error; however the very fact that that error is possible is a change to semantics. Another time we would use this flag is something like using platform floating-point ADD on Oot integers, which exposes us to floating-point roundoff errors)
check for redefinition/overridding via metaprogramming/from user code vs from language implementation code

is there any case in which 'lenient flag' would not be present on a thunk? sure, when you do async I/O and get a promise; once that promise completes you are encouraged to strictly evaluate the result. We can use the function/promise/future type for that. So lets use a lenient flag to mean both 'this is a thunk of a value of the indicated type, not itself a value of the indicated type' and also 'lenient evaluation requested'.

i'd really like a 'wrapped with Maybe' flag but i guess 'follow symlinks' is more important, if we have symlinks at all.

note: if we have a 'lenient flag' as a flag in the instruction, rather than in the value, this means that we have to have multiple versions of each function, one for each combination of 'lenient flags' set in the arguments. This makes sense because low-level code doesn't want to use boxed values for everything, and to impose lenient flag checks between every register access, so we want a way to bypass that junk. Otoh for functions with many arguments, there is a combinatorial explosion. Perhaps for functions with too many arguments, the compiler simply boxes the inputs and then uses lenient flag checks throughout.

perhaps instead of 'lenient flag' what we really mean is 'boxed', where 'boxed' values must be 'unboxed' before use, which includes checking if they are a thunk, and if so, if they request leniency. So just have 'boxed' and don't bother with 'lenient flag'.

We could even put a maybe-tower in our generic 'box'.

if we have a 'follow symlinks' flag, don't we need a 'reference' or 'symlink' data type? we can't just use pointer for this b/c we need a terminator, right? or do we just have a symlink table that says what is symlinked to what? or are symlinks just some custom type? or is a boxed pointer a reference/symlink? or is symlinking part of boxing? i guess one of the latter four ideas suffices for now, so one way or another we can have 'follow symlinks' without a symlink primitive data type.

i'm thinking the thing to do is create a second type of 'boxing', 'symlink boxing'. or should that just be a part of ordinary boxing?

i guess we don't need a 'dynamic typecheck needed' flag b/c really you only want to typecheck things once upon function entry and then assume the type is correct. So inserting a typecheck is the compiler's job (the runtime can verify that typing is proper though), and dynamic typechecks are separate instructions. We still want a 'dynamic' type, though, so that we can pass around boxed things with dynamic types without caring about what they are.

i guess a general guideline here is:

the operand flags are for things regarding data
if there is an attribute that is carried with the data, we could use a flag like this, or we could put it in the boxing
if most code would like to be agnostic w/r/t some flag, then it should be in the boxing, not here; this is for when the code cares about whether the data has the flag or not
one reason to put stuff in a flag rather than in the boxing is when you really want to be able to work with unboxed values to save time in tight loops
given that we have a boxing/unboxing flag, what remains should be something orthogonal to that, eg something that you might want to use even with unboxed data; otherwise it could be part of the box
- i dunno if 'follow symlinks' meets that criterion; maybe not

if we had a flag: shared memory/IO/do under lock/atomic operation/transaction/memory barriers around this/volatile (note: if multiple input operands have this flag set, they must be gathered atomically or in a transaction; if at least one input operand and the output operand has this set, the entire instruction must be atomic or in a transaction) (when this is not set, the platform can do its usual instruction reordering tricks, assuming that no other threads are concerned with this value)

that would be cool, but why not just simply reserve part of the memory for this stuff?

0x0100-0x8FFF: local main memory (that cannot be addressed by the opcode field) 0x9000-0xFEFF: potentially shared main memory (that cannot be addressed by the opcode field); this memory is automatically accessed atomically/with memory barriers/with locking/volatile/in transactions/etc, and instructions which refer to this memory in multiple input operands gather them atomically or transactionally, and instructions which refer to this memory in both input and output operands are atomic or transactional?

nah, that's a huge overhead if we end up using that memory locally

maybe it should be an operation on the memory address rather than in the instruction encoding flags, tho. but that would mean checking a memory map upon each op.

my next-favorite choices for flags: ?: quoted/symbolic ?: 'do in parallel' here, we can have separate opcodes for that) ?: deterministic/pure marker ?: idempotent marker

---

do we want a memory-mapped 'status register' with zero, carry, negative, overflow?

---

i assume that we would have a 'type' prefix modifier to indicate the precise type for compile-time custom types. But one problem with that is that our 8-bit operand fields aren't up to the job here. Surely many programs (or modules within programs, i suppose) will be working with more than 255 types at once. For most things, we say, 'whatever, each item in a memory location can be arbitrarily large, so if the fixed-size instruction encoding can't contain it, just use direct addressing to point to a location in memory instead'. But for compile-time custom types, we want each type to be clearly annotated at compile time, so we want them to be immediate or immediate constant, not direct addressing; and immediate constant doesn't help us much because our problem is not containing the size of a type identifier, but rather, actually being able to reference many types within the same program segment. This underlines the need for non-standard instruction encoding forms with larger-bitwidth operands, no types, and perhaps no flags or addressing modes.

more generally, we may frequently need more than 256 constant values within one region of code, eg for operands of the custom type id prefix modifier. One soln would be to have some prefix modifiers and other 'special' instructions not themselves have type info, giving them 12 bits instead of 8 per operand. If we are having things with different numbers of operands and with other instruction formats in general, i'd like one or more 'format bits' to make this explicit (rather than it being based on the opcode).

i'm going to remove 2 of the opcode-only flags and replace them with 2 form bits.

---

so this proposal is:

4 x (8-bit operand-or-opcode + 2 addr mode bits + 2 flag bits + 4 type bits (type bits are more flag when with opcode))

(addressing modes and memory map same as previous proposal, types are new:)

addressing modes: immediate constant table direct indirect

memory map and memory-mapped special locations: 0: 'zero' register (send output to this register when storing the output is unnecessary; always reads zero) 1: PC 2: status register 3: stack push/pop 4-0x000F (15): direct read/write access to first 12 stack locations (without pushing or popping) 0x0010-0x00FF: memory (that can be addressed by the opcode field) 0x0100-0xFEFF: main memory (that cannot be addressed by the opcode field) 0xFF00-0xFFFF: see following parenthetical phrase (x - 0xFF00 is offset to the beginning of the phrase)

types: pointer boolean/single bit byte unsigned int16 int16 float64 arbitrarily large/precise rational unicode string array cons cell/linked list/binary tree graph/hash/dict (opaque, as opposed to AST) function/callable, thunk/promise/future/lazy value AST/expression/block/logic greenthread/process/channel(/actor?) other/custom static type (has already been statically typechecked) dynamic (boxed/wrapped, tagged type; runtime typechecking requested)

(note that some implementations might not support all of these types; eg an embedded implementation might not support floating point, arbitrarily large/precise rational, unicode)

(note that when i say int is 16 bits, the implementation is free to make it MORE than 16 bits; int here is just AT LEAST 16 bits)

flags (2 bits): boxed/unboxed shared memory/IO/do under lock/atomic operation/transaction/memory barriers around this/volatile (note: if multiple input operands have this flag set, they must be gathered atomically or in a transaction; if at least one input operand and the output operand has this set, the entire instruction must be atomic or in a transaction) (when this is not set, the platform can do its usual instruction reordering tricks, assuming that no other threads are concerned with this value)

opcode-only flags (4 bits):

2 format bits (# of operands with addressing modes and types: 0 (also indicates that this is a prefix modifier), 1, 2, 3)
blocking/async; do we block right now, or is the return value a promise?
(a) in case of error, trap/raise, or (b) wrap the result in a Maybe (does this mean that even ADD can create a Maybe? is a carry an error for ADD? or, on ADD, does this choose between using the status register for carry, vs returning a tuple?)

in the boxing:

type tag
is a thunk?
lenient evaluation requested?
follow symlink?
if array, type and length, and whether the 'data parallel' flag is set (ie if parallelizable operation is done on array, should we fork greenthreads to do this in parallel on the various array members?)
is this an immutable data structure with behind-the-scenes copy-on-write?
is this a behind-the-scenes distributed object?
is this a behind-the-scenes transaction-requiring object, or shared memory requiring locking?
are there subscribers to changes to this value?
is this value 'dataflow persistent' like a spreadsheet cell?
are operations on this value supposed to be subjected to the '__evaluate' magic protocol?
are changes in this value supposed to be serialized into an undo-able journal?

some special(ish) instructions:

left-brace: begins a quoted block
right-brace: terminates a quoted block
antiquote (optionally multilevel)
map
pmap, parallel map
lambda-left-brace

---

a simple/naive implementation would be to push the 2 operands, the type, and the flags onto the stack and then call the given function; the function would pop these and push the return value; in this way the interface between the dispatcher and the opcode implementation would be very simple

---

on the choices of some of the flags:

boxed
atomic
platform primitive
check for redefinition in metaprogramming

may be good choices b/c:

these are all things that are orthogonal to the instruction (they can be applied to any instruction)
they are things that you could 'check for' each time, but you don't want to because for low-level code doing simple operations like ADD, you don't want a bunch of extra checks

re: 'check for redefinition in metaprogramming': you might think that you always want to check everything within a dynamic context, so just virtualize, then check everything; but what you actually want to do is to not check certain 'low level' subroutines that are within/provided by the language implementation runtime. You don't want to just reserve memory for that, b/c at times you may want the language implementation to express its code within memory 'owned' by the higherlevel program. So this marker is really a 'is this from user code or language implementation code' flag.

re: check for redef: another alternative is to always fix the operation of primitive operations on integers and other primitive types, but always check for overrides on custom types

---

it would be neat of oot assembly could even cleanly represent (map (+ 1) [1 2 3]). i bet it can:

memory[0] = [1 2 3] MAP (lambda-left-brace ['x y] (ADD x y) right-brace) memory[0]

---

i'm pretty happy with this result so far. It's nice and regular and simple.

Two remaining blemishes are:

too many opcodes. I want a small beautiful set of opcodes, not 256. Part of the point of explicit types is so that we don't need to have a bunch of variants of eg basic arithmetic for various types.
the polymorphic types make dispatch inefficient. Either you physically instantiate a 'virtual dispatch table' dispatching on a 'virtual opcode' made out of the concatenate of the opcode and all of the types (which leads to a huge, sparse dispatch table), or you have a hashmap or you walk (or binary search) the virtual dispatch table upon each call (or have a 2-level dispatch table etc), or you stuff polymorphic branching logic into each instruction implementation. Out of these, i prefer binary search/two-level dispatch table, but this is inefficient. The efficient thing to do is to just have opcodes be specialized to type (polymorphic instruction x type) like in normal assembly; you don't have ADD (int) (int) (int) and ADD (float) (float) (float), you have ADDINT and ADDFLOAT. But this increases the incidential complexity of the instruction set.

A few relevant alternatives:

use only a small subset of opcodes for the 'real' polymorphic instructions. Then fill the rest of opcode space with opcodes specialized to type, like in normal assembly; these are all specialized variants of the few, 'real' instructions.
get ride of polymorphism. Replace the explicit types in the instruction encoding with either larger operands or more flags.
have only a few built-in opcodes, and allow the program to use the rest of opcode space with custom opcodes which could represent eg core library functions

Why do i want explicit types in the instructions anyways? There are a few practical reasons:

quicker typechecking
allow optimizers to apply rules that match on instructions regardless of their argument types (eg maybe ADD is assumed to always be associative)

but really my reason is to preserve intent/readability, and maybe even beauty. Basically i just want programs to be in a polymorphic language, even if it's a target language. Also it's closer to what i envision for Oot Core; in Oot Core i imagined mandatory type annotations all over the place. Related to this, the real target language is platform-specific anyways; if we're making an Oot Assembly at all, it should be at the same time dead simple (so it's easy to port), but also if we can get it simply, somewhat 'higher level' than the platform target, so that the next-level code is less complex, making the Oot implementation as a whole more comprehensible; (and this goes double if we decide to skip Oot Assembly, in which case my purpose in designing will turns out to have been just to clarify what the 'primitives of Oot' are)