Bayle Shanks's website: proj-oot-ootAssemblyNotes4

in Oot, do we want a base register to make addresses relative?

probably not; that's probably assumed/encapsulated away

---

ppl say that jsr/ret made the JVM verifier much more difficult to write. See [1], section The issue with JSR/RET. Why?

Maybe because ret jumps to an address in a register rather than to a fixed address?

Is jsr/ret the only computed goto in jvm?:

no, there is also lookupswitch, but there you have a table of possible return values (and tableswitch, but that seems to be equivalent to JMP?)
ret appears to be the only instruction in which it is not immediately obvious what all the possible jump destinations are (eg you could try things like jsr twice, then SWAP the return addresses, etc)

---

timeout bit

---

timeout bit

copied from [2]:

so what is boxing doing now? a lot:

maybe/option/nullable types
lazy/promises
binding/symlinking variables (which is also used for upvals)
typechecking needed
any value that does not fit in one cell (64-bit cells, i guess)

we should also add:

__get and __apply (eg only implicitly call these when the 'boxed' bit is set)
and maybe locking?

in other words 'boxing' seems to be the default for our inefficienct, high-level, safe, dynamic, lazy, symlink-having language; and 'unboxing' is like a special mode that you can use to do more CTM stuff for tight loops, etc.

so still not clear on:

if we are using that bit to say whether something should start evaluating now (even if it's a promise, not something being strictly evaluated), or wait until its called for
how to indicate lazy/strict
how to annotate whether something is worth doing in parallel

if we had more flags, we could also have one for how to deal with failure, eg whether to have (a maybe-null, or a straight nullable type), or to raise an exception

---

heck i dont wanna decide which flag bits are important, and decide between 8bit, 10bit, 12bit operands. unlike kant, i'm not confident that i'll think of all the modes i want forever. also it seems like a waste of space to have either large operands or many modes, but i dont know which one b/c that's empirical.

ok so maybe back to the variable-length, natural-languagey model.

design requirements: a variable length encoding that:

can be decoded in parallel
simple to implement a decoder for
in specific cases can reduce to something close to the fixed width schemes i already thought up

so accept variable length instruction 'sentences', where each 'instruction word' is fixed length(?) (and possibly has a marker to say if its 'ongoing' or not)

perhaps 'quoted' should also be part of the marker?

'referent' marker? eg 'this modifier modifies the word tagged 011'?

and/or 16 fixed part-of-speeches? eg destination, destination modifier, operand1, operand1 modifier, instruction, etc

sentence terminator bit, word terminator bit, clause terminator bit?

but if you allow conjunctions, then its already an AST

heck mb go that with anyway

---

yeah i think this sort of thing is the way to go.

---

two kinds of AND; simply parallelizing, and building a big noun (which i'll call 'constructive'). this is like whether to autoveectorize over a list, or to treat the list as a list

---

do we still need an 'oot bytecode' below oot core? probably (for things like COW). is there a need for an 'oot core+' which is like oot core, but more colloquial, and which is what the meetaprogramming matches upon (eg macros would run on oot, or oot core+, or on oot core)

---

kant's table of judgements... but want to generalize that sort of thing

'universal' (perhaps i should say 'generic', because i bet some natural languages dont have all of theese, anyway im constructing a programming language not a natural one languagee features): which-language (is this imperative programming, logic programming, etc?) POS-in-sentence (subject, verb, object.. or, opcode, destination, input1, input2) mode (and tense, etc) words, phrases, sentences (what abouut morphemes, etc?) base word of a phrase, and modifiers (eg adjectives in a noun phrasee vs the noun; perhaps addressing mode vs operand? how is mode distinct from this?) arguments (in bot nat. language and prog. lang) compound phrases, and conjunctions (common ones: and, or, xor) and parens lattice of truth values; not quoting annotation

     again, should review natural language declensions and conjugations...

---

what can we do with 4 grammar bits?

sentence terminator, phrase terminator, 2 POS bits, allowing for 4 sentence components, eg, whole sentence, subject, verb, object. but some sentences have 5 components: he gave the apple to johnny in the house so need at least 3 bits for component choice (note: in that examplee, the componeets are like destination, operand, argument2, argument1, mode modifiers)

could impose an ordering rule that components must go in order, or at least, that 0 goes first (or last). this allows us to split sentences at the beginning of phrase 0. now we dont need the sentence teermination bit (at the cost of having to possibly read past the end of the sentence to know that the sentence has ended, if the rule is that 0 goes first). so we neeed 4 bits so far.

could also say that component 'base' goes eeither first or last in phrase; now we dont need to both identify the base, and also to identify phrase termination.

still need to distinguish modifiers from base and addressing modes.

so can do everything with 16 pos: component 0 component 1 base c2b c3b c4b c0addr c1addr c2addr c3addr c0mod c1mod c2mod c3mod annotation quote (more accurately, 'quoteed' or inside a quote) parens or other grammatical symbol

this doesnt quite suffice if component 0 can contain multiple preepositional phrases the preposition in prepositional phrases is like a verb; the rest is a nounphrase-like argument

how about: each component (recursively) has a base, an addr mode, up to 3 arguments, and modes/modifier subcomponents. that's too many, want at most 5. so combine base+addr mode. addr modes are subcomponents.

conjunctions are subcomponents (eg an AND would have base 'AND', a mode/modifier saying whether it's parallel or constructive, and 2 arguments)

---

so we say the base component must come first.

---

it take 3 bits to represent 5, so dont know what to do with the otheer 3 alternatives. One of them can be 'annotation'. Parens could be forms of annotation. So we have two left; could just waste them on parens, including a relative jump to allow a reader to quickly find the matching parens. So that's 3 bits. If we can do with no more than 1 more overhead bit we'll have a nice round 4 bits.

wait dont parens need to say which component they're part of? so we can't have them be their own components

is the last bit some kind of terminator? word, phrase, sentence?

---

mb instead of component i should say 'grammatical role'? grol for short? i think "phrasal category" is the linguistics term [3], but that's too long and the word 'category' has too many other meanings here.

---

so to summarize the current natural language proposal outlined here:

a variable length encoding that:

can be decoded in parallel
simple to implement a decoder for
in specific cases can reduce to something close to the fixed width schemes i already thought up
is extensible with regards to which mode bits we have

we have 'syllables' (or should we say grapheme? phoneme?)s which combine into 'words'; 'words' which combine into 'phrases'; 'phrases' combine into 'clauses'; and 'clauses' combine into 'sentences'

in our system, a 'clause' is ('is like'? or just 'is'?) a recursive, embedded, parethesized sentence. It is not a 'sentence' only because it is not toplevel. Semantically, due to quoting and unquoting and assigning first-class-functions to variables, etc, both clauses and non-clauses are similar; however, the language won't DO anything yet if it has just read a clause, because the context that clauses is embedded into might quote it, make it hypothetical, make it conditional, etc.

in our system, 'phrases' can be single-word or multi-word.

a 'phrase' might be an expression; e.g. in "x = 3 + (2*(*y))", "2*(*y)" is a "phrase"

each (word?) has a 4-bit 'morpheme' that give syntactic 'overhead'. 3 bits of this tells what its grammatical role is within its (phrase? clause?):

role 0 (base) e.g. "Johnny's dad gave him an apple", "dad" is the base of the subject phrase; the verb is the base of the whole sentence; in assembly, the operand (or in previous Oot assembly proposal, the operand field) is the base of the instruction
role 1 (argument 1) eg subjects in natural language, destination fields in assembly
role 2 (argument 2) eg direct objects in natural language, first input arguments fields in assembly
role 3 (argument 3) eg indirect objects in natural language, second input arguments fields in assembly
role 4 (modifier/mode attaching to the whole (phrase? clause?)) eg in natural language, modes (imperative/interrogative, etc), in natural language, propositional phrases; in assembly, modes (whether or not this is lazy, hypothetical; whether this is an imperative or logic programming sentence)
role 5: ?: maybe open parens?; but dont these need to be marked with their role in the enclosing phrase?
role 6: ?: maybe close parens?; but dont these need to be marked with their role in the enclosing phrase??
role 7: ?: annotation, maybe syntactical/grammatical metasymbols like parens, quote, etc; but dont these need to be marked with their role in the enclosing phrase??

what we do with the other bit isn't decided yet

the rule is that role 0 comes first in any phrase. so if you see a non-role 0 and then a role-0, that's how you know you crossed a phrase boundary this isnt good enough because two adjacent phrases might both have only role 0s; so we still need a phrase delimiter; so mb use that bit for that

open and close parens contain the relative address of the matching parens

should we also mark when something is part of a quotation, to make it ez to skip without parsing?

should we also mark, for something which is acting like a 'pronoun', where the referent is, and for a modifier, where the modify-ee is?

two kinds of AND; simply parallelizing, and building a big noun (which i'll call 'constructive'). this is like whether to autoveectorize over a list, or to treat the list as a list

phrases can have subphrases whose grammatical role within the parent phrase is the same 5 things

I gave the ball to the person who had given it to me:

"the person who had given it to me" is the indirect object, and in our system would be a subphrase, with argument1 'the person' and modifier 'who had given it to me'; 'who' here would just be an artifact of English, the modifier is that constraint tha 'person' 'had given it to me'.

similarly 'Billy's dad' is a subphrase with argument1 'dad' and (modifier?) Billy's

the 'modifiers' in the above might be 'base' in our system, eg like VERBS in natural language (contrary to the way linguistics treats them)

conjunctions are also subphrases; the 'base' is the type of conjunction, argument2 and argument3 are the two things being conjoined, the modifier specifies eg if this is a parallel or constituitive conjunction. eg "alpha and beta" has base AND, argument2='alpha', argument3='beta', and there is a mode too

modes are assumed to be defaults if not specified

---

rather than tagging the roles, could just say that the 4 bits say:

word ends phrase ends sentence ends (what's the 4th one? something to indicate when we skip a phrase, eg for a sentence with only phrase 0 and phrase 4? how would we do that?)

---

to cut down on subphrase overhead, we could make addr modes mandatory parts of words

---

one issue is that you can have multiple modifier phrases under a phrase, so you have to allow multiple phrase 4s

args 2 and 3 are symmetric, so if can have multiple, then why not just combine them, now just 4 choices = 2 bits for grammatical roles (just call these "parts of speech", POS, for short)

so now we're only 1 bit over our supposed budget of 4 bits (eow (end of word) + eop (end of phrase) + eos (end of sentence) + POS tag (2 bits)) = 5

we could eliminate the 'boxed' bit from the addr mode; so that now we would have 3 addr mode bits and 5 grammar bits. If words started at 16 bits long, then 8 bits would be left for the operand.

---

but do multi-word phrases even make sense if we can't markup each word in that phrase with its role in WITHIN THAT PHRASE?

hmm.. how about: if the "new phrase" bit is set, then role of this word within the clause is role 0 (base), and the inflection (2-bit of grammatical role) refers to the role of that phrase within the clause. Otherwise, the inflection refers to the role of that word within the phrase. but actually we want 'end-of-phrase' not 'beginning-of-phrase', right? so put 'base' at the end, then.

we say that 'syllables' (which may or may not correspond to 'words' on the underlying platform; our concept of 'word' is somewhat higher-level) are 16 bits.

so, now i think we have a complete system.

similarly to the fixed system, we can fit 4 fields into 64 bits, except we lost the 'boxed' bit, and we only have 8 operand bits instead of 12:

first 16-bit syllable:

8 bit operand field: the opcode operand
3 bit addr mode field: the opcode addr mode (usually immediate, unsplit)
5 bit grammar field: EOW; EOP; no EOS; role 0

second 16-bit syllable:

8 bit operand field: the destination operand
3 bit addr mode field: the destination operand addr mode
5 bit grammar field: EOW; EOP; no EOS; role 1 (role 1 b/c this is relative to the whole clause/sentence, not to this phrase, which is role 0, assumed because EOP is set)

third 16-bit syllable:

8 bit operand field: the input1 operand
3 bit addr mode field: the input1 operand addr mode
5 bit grammar field: EOW; EOP; no EOS; role 2 (role 2 b/c this is relative to the whole clause/sentence, not to this phrase, which is role 0, assumed because EOP is set)

fourth 16-bit syllable:

8 bit operand field: the input2 operand
3 bit addr mode field: the input2 operand addr mode
5 bit grammar field: EOW; EOP; EOS; role 2 (role 2 b/c this is relative to the whole clause/sentence, not to this phrase, which is role 0, assumed because EOP is set)

(note that there are two role 2s here, because there are two input operands)

note how easy it would be to have longer than 8-bit operands; just unset the EOW bit (in which case 11 of the 16 bits of the additional word are added to the operand, in big endian fashion; long operands still have the other 5 grammar field bits (but should they? we only really need the EOW one..)). So we can easily have a 19-bit operand, a 30-bit operand, etc.

if we need mode/modifier bits, either on the operand or the sentence level, we simply add some role 3s. (do role 3s have addr mode bits? i guess so, just for uniformity). we'll get 8 mode bits. If we need more, just unset the EOW.

---

so now what do annotations look like?

---

we need pronouns and conjunctions, too. i feel like a pronoun is like an addr mode. or do we need them? why not just load a referent into a register? that accomplishes the same thing.

---

'ootby' (oot assembly/oot bytecode) has a positional input argument syntax, is that ok? in that case we need a 'argument not provideed yet' sentinel (of 3 forms ?: the one that creates a function expecting that argument, and the one that assumes a default if there is one, and ow also expects that arg; and one that creates a partial fn that wont execute even if 0-ary)

---

how to give modes of more than 1bit?

should we provide a keyword arg option? (for role 2)

---

parens (subordinate clauses) might not be necessary, bc they can be replaced by previous sentences that load into registers

eg 'The bad man bit the dog':

r1 = 'the bad man' r1 bit the dog

---

even if recursive expressions with pronouns are not allowed, we at least want scoped regs, like haskell's let... in ..., so that we can tell when a temporary register that was just created for one expression is done. For this, however, we could just use an 'UNDEF' instr. i guess it wouldn't hurt to have type declaration annotations at variable initialization too...

---

maybe the concept of 'paragraph', paragraph-local vars

is this part of a general pattern of optimizing non-recursion, similar to link registers which optimize for only one level of subroutine? no, because in fact a parargaph can contain recursive calls, so this is a diffeerent concept

on the topic of link registers, do we want a similar req on fast fns? no b/c then ordinary partially-applied functions can't be used as fast functions.

---

'base' is a bad name, b/c in the case of 'noun phrases' (arguments), it makes you think of the operand, not the addressing mode, but it should be the addressing mode. mb it should be something like 'verb'.

addr mode is verb, not operand. but here we have addr mode/operand pairs, so that's ok. mb call it 'operation'? hmm.. really in general it's not necessarily active, it's just the type of meaning we're assigning to the rest. interpretation? too many other meanings. extensible grammar determinant? too long. root? naw, too man other meanings. 'base' doesnt sound do bad now. but 'operation' is probably the best one so far. maybe 'connective', as in logical connective? too abstract-sounding

so, 'op'.

note that there are 'main ops', which are the ops for top-level role 0, and then there are 'subops', which are ops for role 0 in other top-level roles.

note also that:

the typical 'ops' in roles 1 and 2 contain not just the addressing mode, but also the operand, to save space in the typical case
we need to have a way to have other ops in roles 1 and 2, AS WELL AS longer operands. Perhaps multisyllable subops in roles 1 and 2 should be reinterpreted as all-addressing mode (all subop), and then you use role 2 (or mb role 1?) for the operand?
mb we should make the destination role role 2, and the input role role 1, if we intend to use the input role for longer operands

---

useful assembler directives:

 ORG $400: what is to follow should be located at memory location $400 [http://alanclements.org/68k.html]
 DC.B label 12: declare a byte of memory, label its address 'label', and initialize it with 12 [http://alanclements.org/68k.html]
 END $400: "End of program and entry point" [http://alanclements.org/68k.html]

possibly useful instructions: STOP #$2700

"The 68K has an explicit halt instruction called STOP that usually takes the parameter $2700 (this parameter is loaded into the processor status word)." -- [4]

---

let's see again how the semwebish stuff looks in this format:

old fashioned AST: a = 2 + 3: assign(a, +(2,3)) us: +,a,(2,3) the boy is happy: subject: boy, verb:is, object: happy: us: is,boy,happy or: the boy is happy: subject: boy, edge:emotionalState, object: happy: us: emotionalState,boy,happy the boy is happy today: us: same as above, but with additional role3 with 'today' bob knows fred: triple: subject: bob, predicate: knows, object: fred us: knows, bob, fred a graph edge: a--label1-->b: us: label1, a, b or maybe: 'edge', label1, (a,b) hypergraph edge with roles: src1: A, src2: B, dest1: C, dest2: D, label1: E ?us?: 'edge', A, {src2: B, dest1: C, dest2: D, label1: E} ?us?: 'edge', , {src1: A, src2: B, dest1: C, dest2: D, label1: E}

and remember the hidden fifth field; the primary key (id) (address) of this instruction

or, in the case of graphs, we could stuff edge ids into the structure itself:

hypergraph edge with roles: src1: A, src2: B, dest1: C, dest2: D, label1: E ?us?: 'edge', A, {src2: B, dest1: C, dest2: D, label1: E} ?us?: 'edge', id-of-this-edge, {src1: A, src2: B, dest1: C, dest2: D, label1: E}

NOTE THAT HERE WE FIND OURSELVES NEEDING A DICT INSIDE THE EDGE PRIMITIVE

could get around this by putting the dict's keys (the roles) into role 1 (i forgot to switch roles 1 and 2, so currently role1 is still 'dest')

or, could represent hyperedge as a collection of edges:

'hyperedge', id-of-this-hyperedge, (subedge-id-1, subedge-id-2, subedge-id-3, subedge-id-4, subedge-id-5) 'subedge', subedge-id-1, ('src1', A) 'subedge', subedge-id-2, ('src2', B) 'subedge', subedge-id-3, ('dest1', C) 'subedge', subedge-id-4, ('dest2', D) 'subedge', subedge-id-5, ('label1', E)

or if we assume the ids are implicit:

'hyperedge', ?, (subedge-id-1, subedge-id-2, subedge-id-3, subedge-id-4, subedge-id-5) 'subedge', parent-hyperedge-id, ('src1', A) 'subedge', parent-hyperedge-id, ('src2', B) 'subedge', parent-hyperedge-id, ('dest1', C) 'subedge', parent-hyperedge-id, ('dest2', D) 'subedge', parent-hyperedge-id, ('label1', E)

or mb better:

'begin-hyperedge-annotation', ?, (subedge-id-1, subedge-id-2, subedge-id-3, subedge-id-4, subedge-id-5) 'subedge', 'src1', A 'subedge', 'src2', B 'subedge', 'dest1', C 'subedge', 'dest2', D 'subedge', 'label1', E 'end-hyperedge-annotation'

or:

'begin-hyperedge-annotation', ?, (subedge-id-1, subedge-id-2, subedge-id-3, subedge-id-4, subedge-id-5) 'src1-role', parent-hyperedge-id, A 'src2-role', parent-hyperedge-id, B 'dest1-role', parent-hyperedge-id, C 'dest2-role', parent-hyperedge-id, D 'label1-role, parent-hyperedge-id', E 'end-hyperedge-annotation'

can we merge the idea of a 'role' here and our grammatical 'role'? sure, the latter is a special case of the former, if we consider our instructions to be hyperedges with 5 edges (roles 0-3, plus an identifier for the edge).

... out of the above, i think i like the one where in the subedges of the hyperedge, dest=hyperedgeid, and both the role and the destination node were 'input arguments' (role 2)

---

maybe, just as instructions are arranged linearly, we should allow graph edges to stand in a partial order with each other?

not sure if that's a good idea though...

---

"For example, Motorola's 68300 chips have a cool TBLS (table lookup and interpolate) instruction that can replace hundreds of lines of C code. And it executes in six clock cycles. With one assembly instruction. C compliers never use it.

Hitachi's SH7708 processor has a 3D matrix transform instruction that does a whopping 16 multiplies and 12 adds at once, all with floating-point numbers. It's used for calculating the angle of reflection between a 3D light source and a 3D polygon. Obviously useful for video games. But only assembly programmers ever use it.

...

Occasionally a killer instruction comes along -- like MAC -- but that's rare. " -- http://www.embedded.com/electronics-blogs/shop-talk/4024514/Do-processors-and-instruction-sets-matter-

---

instead of 'operation' or 'op', one could say 'construct'. Nameclashes with 'cons' and 'construction' though.

---

role 1 is also 'output params', or lhs(s), and role 2 is also 'input params' or rhs(s)

---

y'know, since we have constants, we don't need address 0 to be a zero register. But we still want a way for an instruction to indicate that it discards its output. We could just omit the output phrase in that case, but i'm guessing we want to reserve that for implicit stack inputs/outputs. So how about we make 0 the PC, and say you can't write directly to the PC, and that any attempted write to the PC means 'discard'?

This seems to fit b/c for the same reason we want to allow an instruction to specify discard (to hint to the compiler/interpreter that the result value is unused), we want to force jumps to be via special instructions, instead of just arithmetic instructions output into the PC.

---

i like how the aesthetic/goals of Oot Assembly/Oot Bytecode are coming together. In the last section, we justified two things (having a way to explicitly state that an output value is 'discard'ed, and forcing jumps to use specific instructions) by expressiveness. Also, the variable length encoding is justified by expressiveness too, as is the use of memory-memory.

the variable length expressiveness also shows that we don't really value performance that much.

Although this is still a linearization of the program, not an AST.

---

need some sort of annotation to say "the next bunch of things are compile-time hints, if you're interpreting this, skip over X instructions" (where X is eg 16)

eg when we reach the end of a scope we may have a bunch of UNDEFs. An interpreter might choose to just skip all these at runtime, especially if it did a scope analysis upon initial load (load time) already.

---

actually since instructions are variable length, shouldn't jumps be specified in terms of actual numbers of bytes in the bytecode, rather than numbers of instructions?

what does the JVM do? (looked it up) yeah i think goto and jsr deal in byte offsets, not instruction offsets

---

interesting idea:

"and a 1-bit SCC field determines whether the condition code is to be updated" -- http://alanclements.org/rischistory.html

i don't think we have the bits for it right now though (without using the extended form). unless this is a bit taken out of the opcode, which it might be (error handling mode)

---

wow, this manual seem like it would be too detailed and irrelevant but i find it inspiring by illuminating the sort of 'intent' that needs to be preserved long enough for low-level optimizers to have a crack at it, as well for illuminating how some of the issues i've wrestled with are dealt with in the recent x86 ISA implementations:

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

see ootIntelOptimizeManualNotes1.txt

---

the cmov instruction on x86, conditional move, is different from a branch in that BOTH branches are speculatively executed.

so this is like my idea for a neural-net like concurrency where sometime both branches are executed in parallel by different sets of neurons, and then sometimes an actual branch is made.

---

Intel User/Source Coding Rule 17 suggests using the PAUSE instruction inside fast spin locks. This suggests that in Oot Core, either we should have PAUSE or a fast spin lock or something higher than that should be a primitive (i think that something higher than a spin lock should be a primitive).

---

Intel Optimization Manual 3.4.2.1 has some details on micro-ops: "Stores execute internally as two separate micro-ops: store-address and store-data."

there are a few other details on specific micro-ops in this document, but i didn't see a good overview. Maybe there is a table somewhere that says how many microops per instruction.

---

" 3.4.2.3 lea xor xor loop: add sub jnz rdx, buff - 4 rcx, LEN eax, eax eax, [rdx + 4 * rcx] rcx, 1 loop Length-Changing Prefixes (LCP) The length of an instruction can be up to 15 bytes in length. Some prefixes can dynamically change the length of an instruction that the decoder must recognize. Typically, the pre-decode unit will estimate the length of an instruction in the byte stream assuming the absence of LCP. When the predecoder encoun- ters an LCP in the fetch line, it must use a slower length decoding algorithm. With the slower length decoding algorithm, the predecoder decodes the fetch in 6 cycles, instead of the usual 1 cycle. Normal queuing throughout of the machine pipeline generally cannot hide LCP penalties. The prefixes that can dynamically change the length of a instruction include: • • operand size prefix (0x66) address size prefix (0x67) "

---

http://stackoverflow.com/questions/18113995/performance-optimisations-of-x86-64-assembly-alignment-and-branch-prediction says that Intel used to support 'branch hints', presumably that means hints about which branch is more probable. This could be a good idea in Oot, particularly because it fits in with my observation from speech recognition that prior probabilities are key and therefore that UIs should be told how probable the program thinks various UI actions are in this state (to enable UIs such voice recognition, swype, etc to work better). But specifying any more than which branch in a switch table is most probable is probably too many bits for a fixed encoding, and in addition this is only a hint, so this should be in annotations.

---

intel's rule for 'hinting' without explicitly hinting which branch is most probable:

Assembly/Compiler Coding Rule 3. (M impact, H generality) Arrange code to be consistent with the static branch prediction algorithm: make the fall-through code following a conditional branch be the likely target for a branch with a forward target, and make the fall-through code following a conditional branch be the unlikely target for a branch with a backward target.

---

i was wondering how Intel squares variable-length instructions with alignment and efficient JMPing. It doesn't:

intel says:

Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch targets should be 16-byte aligned.

---

" You might not have expected this, but in 64-bit assembly for the AMD64, pointers to code and data whose addresses are within the executable are still only 32-bits. This ties in with the fact that RIP-relative addressing limits the size of the executable to 2GB. Pointers to external addresses, such as functions in Dlls, are 64-bit wide so that the function can be anywhere in memory see call address sizes.

...

RIP-Relative addressingtop Some instructions in the AMD64 processor which address data or code, use RIP-Relative addressing to do so. The relative address is contained in a dword which is part of the instruction. When using this type of addressing, the processor adds three values: (a) the contents of the dword containing the relative address (b) the length of the instruction and (c) the value of RIP (the current instruction pointer) at the beginning of the instruction. The resulting value is then regarded as the absolute address of the data and code to be addressed by the instruction. Since the relative address can be a negative value, it is possible to address data or code earlier in the image from RIP as well as later. The range is roughly ±2GB, depending on the instruction size. Since relative addressing cannot address outside this range, this is the practical size limit of 64-bit images.

Data alignment All data must be aligned on a "natural boundary". So a byte can be byte-aligned, a word should be 2-byte aligned, a dword should be 4-byte aligned, and a qword should be 8-byte aligned. A tword should also be qword aligned. GoAsm? deals with alignment automatically for you when you declare local data (within a FRAME or USEDATA area). But you will need to organise your own data declarations to ensure that the data is properly aligned. The easiest way to do this is to declare all qwords first, then all dwords, then all words and finally all bytes. Twords (being 10 bytes) would put out the alignment for later declarations, so you could declare all those first and then put the data back into alignment ready for the qwords by using ALIGN 8.

As for strings, in accordance with the above rules, Unicode strings must be 2-byte aligned, whereas ANSI strings can be byte aligned.

When structures are used they need to be aligned on the natural boundary of the largest member. All structure members must also be aligned properly, and the structure itself needs to be padded to end on a natural boundary (the system can write in this area). Because of the importance of this, from Version 0.56 (beta), GoAsm? aligns structures automatically for you. See automatic alignment and padding of structures and structure members for more.

...

It is also a requirement that the structure is enlarged so that it ends on the natural boundary of its largest member

...

In practice it was found that the system wrote to the area of padding ... in certain circumstances. This shows the importance of complying with these rules (otherwise you could find that data after the structure could be written over).

" -- http://www.godevtool.com/GoasmHelp/64bits.htm#diffrel

---

notice the modal demand/promise operator in preconditions/postconditions; note that modal logic here just being like the mode bits in Oot Assembly Notes 4 syntax! so that's a big win

more generally, useful points about our syntax include:

attach first-class labels (and therefore annotations) to any point or static or dynamic region of code (includign AST nodes/subtrees/"potential scopes"), or to the reified connection between an AST node and its child (these can be used to attach extensible grammatical roles)
attach first-class modalities to any AST node/subtree (= potential scope)

note: if roles are extensible then we need more an EOR (end-of-role) bit. So, steal an addr mode bit; no more split addressing modes unless you use longform addr modes (verb in roles 1 or 2).

so our syntax labels the ast nodes (values), and labels the edges (roles). and includes region annotations, and quick-jump parens

note: the fifth role is implicitly annotation (address of the instruction in the instruction stream/global memory)

hey lets say that variable-length instructions first have to say their length!

yknow, these arbitrarily long 'words' will be quite obnoxious for implementers, should probably give them a max length and then also have a fixed number of length prefix bits to say how long the word is. if just one such bit, then we have a choice between a 16-bit word and a 32-bit word, so payloads of either 8 bits or 24 bits. but if that's it, i'm still tempted to just go with a fixed-length encoding, at least at the word level.

if we fully reify both roles and addr modes.. but then we get more than 64 bits for the common case of an opcode and 3 operands

each sentence could be prefixed with an annotation byte that says if it is short form or long form, and if it has some directed relation with its neighbord (does it come after the previous sentence, or is it incomparable with it? is it a child of it, a sibling with it, or an uncle or even an anceestor? so 2 bits). recall that sentences already are implicitly labeled via their position in the instruction stream. also, we need a way to begin/end new 'languages' within the instruction stream, eg for composite data literals.

i guess if we are doing fixed length, things like roles need at least 4 bits in general. eg allow at least 16 roles max.

so suggests: 16bit payload, 4 bits role, 2 bit addr mode, 1 bit EOP, 1 bit EOS. but that suggests an 8bit payload.. which makes constant addr mode not as useful. oh i know; make the 'input' and 'ouput' subphrase roles special, with a 24-bit payload instead of 8. actually, do that with all of them except role 0, which is then a 'packed' role in the subphrase, which makes sense bc it is the op, which is almost grammatical.

so this way we can still achieve our target of a 64-bit 3-operand instruction in the typical case; the typical case is represented by having each phrase be a single word of subrole 0.

instead of 32bit longwords with 24bit payload, could give some more bits to addr mode, etc and have 16bit payload and mb a bitlimited second payload for the addr mode to use, instead of splitting those 16 bits. eg a 16bit payload, 8 bits as above, 4 bit secondary addr mode, and 4 bit secondary addr mode payload (so now we're back to 16 registers again, because that's all that the second payload can see)

do we want this to potentially be an AST format too, with parens and subtrees, or even a dag or graph format? that would be in keeping with Oot's focus on graphs and on oot core language homoiconicity

in that case, do we need some of those 8 free bits for quickjump parens and annotations and reified edges? and/or for phrase lengths? or for form specifiers (eg extended form or not, variable form or fixed form, etc)?

do we want a fixed-field sentence form? mb in which case, mb 2 forms, 64bit fixed field, and 64bit phrase? prob. not..

mandatory sentence size field for var length sentences? 4 bits? mb part of a modality phrase, or is this special, in which case, what to do with the next 12 bits? length in terms of 16bit words? mb specify len of each phrase? but then max len is actually 12, not 16. which mb okay; the last 4 lengths are special forms.

so, then first bit is a form bit?

if the first word of the sentence is a 16-bit prefix, and if role 0, subrole 0 is only 16 bits, and other words are all 32 bits, then we are 32-bit aligned at least. not sure if that's feasible tho; we dont really need 32 bits for each other word. if all subrole 0s aree only 16 bits, and other subroles are 32 bits, then alignment is uncertain. also we dont appear to have much 64bit alignment at all.

i'm a little worried about the non-64-bit alignment of our variable-length form w/r/t jump targets. i dont really know what to do about it, tho. x86 assembly has a similar problem (i think in x86 they sometimes pad with NOPs to make jump targets/branch targets align).

note that we do need the modes in modal logic to take arbitrary arguments, so that we can do things like this: https://en.wikipedia.org/wiki/Dynamic_logic_%28modal_logic%29#Language

we'd do that with a mode 'operator' in subrole 0, and then 'inputs' in subroles 2 or 1.

ok to reduce implementation complexity i think it's important that the bytecode be an array of elements of some fixed size (16 bit, 32-bit, etc), and to have the 'payload' of what we are calling 'words'. Incidentally, i think that language needs to be changed; it's an unfortunate fact that if we talk about 'words' while describing our bytecode format, the reader will think of a fixed-length array of 8-, 16-, 32-, or 64-bit packets, like platform words, rather than of the broad linguistic place of 'word' in the hierarchy morpheme/word/phrase/sentence, which is what we have been using it for to date. Let's change it instead to either 'lexeme', adopting the vocabulary of https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization , even though the use of 'lexeme' here is different from its usage in linguistics, or to 'chunk' or 'packet' or 'field' or something. Ironically, i bet this sort of thinking is why compilers started (mis)using 'lexeme' to represent a substring being grouped into a token, rather than the linguistic meaning of it; they probably didn't want to use 'word' b/c that makes computing ppl think of fixed-length platform words, and they figured lexeme was an equivalence class for meaning (almost but not quite), so they used that (before Wikipedia it probably wasnt too easy to quickly doublecheck that sort of thing). We might still use 'word' to indicate the fixed-size array elements.

Also, instead of the linguistic term 'phrase', we might adopt the logical term 'term'.

or, we could use 'term' instead of word/lexeme/chunk/field.

i guess that means:

giving up on the ideal of arbitrarily long (extensible) terms
giving up on the ideal of sufficiently compact bytecode to be efficient

If the latter is too much to bear, we can introduce alternate forms.

as noted before, the minimal reasonable size for things is 16 bits; 8 bits (256 values) is just too small for some things (such as addressing), and we want powers of two. So let's go with that (the alternative is to make our programs even more inefficient by forcing every lexeme to be 32 or even 64 bits, even though most of the values will be between 0 and 16!)

given all the junk we see above that we want to stuff into each word, we want at least 8 bits of 'header' for each 'payload'. So if each 'payload' is 16 bits, we need two words for that.

since we are interspersing payloads and headers arbitrarily, and we dont want to constrain the payloads without having parsed the entire stream before a point, we can't resync and tell which bytes are part of payloads and which are headers. So, let's constrain things so that the first thing on every 64-bit boundary is a header. So if we have at least 8 bits of header first, then we have 56 more bits to put payloads and headers in before we need another header.

eg if payloads are 24 bits, then we're 32-bit aligned and hence 64- too. If payloads are one of [8,24,40,56], with certain constraints on nearby payloads (eg if we go 8,40, then we must go 8,8 next), then we remain 64-bit aligned.

interestingly, assuming instructions needing big payloads are much less common than small ones, we could just go 8-8,8-8,8-24 in every 64-bit packet. This is getting a little complicated for the supposedly trivial to implement, efficiency-hating design.

also if we wanted to go in the other direction and make things even more 'fixed' width, we could require each phrase to fit in at most one 64-bit packet, and not to go over the 64-bit packet edges. This interests me.. we could still fit 4 16-bit chunks within a 64-bit phrase.

also, the implementation is not TOO difficult if the combined payload size is variable but there is an upper bound on the bit-width of the combined payload.

also, we could have a header at the beginning that 'maps' a sentence or phrase, rather than all those role bits in each header. but in general i doubt we could easily save many bits that way, even if you required roles to stricly increment, because sometimes you skip them, and sometimes a role repeats yet you still need to mark the boundary between repetitions (at the sentence level).

so, let's do that for now. so we have 64-bit packets, and phrases can't span packets, but we can have multiple phrases in one packet, and we can have payloads that combine, and sentences also must end on 64-bit packets. A sentence header says how many 64-bit packets the sentence consumes (max 16 (128 bytes) would mean max 64 16-bit phrases).

heck, might even make it 4-bit payloads and 4-bit headers and combine them as before with an EOW. But then a 4-bit header isnt enuf for EOW, EOP, EOS, 4-bit role in every header.

128 byte sentences seems expressive but might put too heavy a load on the resources of the decoder. So maybe max 4.

So max 4 words in a 64-bit packet, and max 4 64-bit packets in a sentence; that's max 16 8-bit payloads.

if we only allow 8-bit or 16-bit payloads, then there's less chance of not getting to have a big payload when you want b/c it's near a 64-bit boundary, but there's still a chance; so may as well allow up to 32-bit or whatever.

so i guess we're pretty much back to our old design, except for the 64-bit phrase boundary limit, and the 256-bit sentence limit.

and instead of using a EOW terminator bit, just have a two-bit payload length in the header.

so the header is now: EOS, EOP, 4-bit role, 2-bit payload len,

---

remember the point of all this: the point is so that when you have an expression of the form:

let a = (addressing mode on operand expression) let b = (addressing mode on operand expression) let c = (addressing mode on operand expression) c = add(a, b)

so that you don't have to put a,b,c into registers, knowing that they are going to be immediately consumed and then never used again; or, another way of looking at it, this is equivalent to that but is a way to hint that a,b,c are going to be immediately consumed and never used again. Supposedly this keeps the 'intent' more. Also, we get to add modality in an extensible way.

---

what should the sentence header contain? if we have 16 roles now we can have role 0 just for the sentence header. it could contain a map of the sentence, where it breaks into phrases, i guess. That would only take 16 bits at the most, in fact 15 bits because the sentence header takes up some room and we already know how much. But we need 2 bits for the sentence length. We could steal the EOS and EOP bits but that would be irregular.

but i'm somewhat unhappy with the whole idea of a sentence header because it appears to compromise the dream of being able to fit a whole 4-field sentence into one 64-bit packet.

also; we know that phrases must end at the end of each 64-bit field, so that's a wasted bit. And we know that sentences can't end except at the end of each 64-bit field, so that's some wasted bits.

so really we only need 1 bit for both EOP/EOS, because it switches between them. so we have an extra bit.

also, if we removed the 256-bit sentence length limit, then we need an EOS bit, and we can't have a sentence length field, so we can't do better.

so let's do that.

---

so we have:

header field looks like: 2 length of payload bits, EOP or EOS bit (if last field in 64-bit packet), 4 role bits, 1 reserved bit
phrases cannot go over 64-bit packet boundaries
there is no limit on the length of sentences

wait we have no addr mode now :(

we can either:

give up variable payload length, make it either 8-bits always or 24-bits always, and have 3 bits of addr mode
have only 1 bit for variable payload length, to choose between 8-bits and 24-bits, and have 2 bits of addr mode
give up 16 roles, make it 2 role bits, and have 3 bits of addr mode
have 8 roles, make it 3 role bits, and have 2 bits of addr mode
have only 1 bit for variable payload length, to choose between 8-bits and 24-bits, have only 8 roles, and have 3 bits of addr mode
use the reserved bit for 1 bit of addr mode, immediate vs register (no constant.. but can use phrases for that)

recall that no matter what we do, we can use phrases for more addressing modes. This would suggest one of:

use the reserved bit for 1 bit of addr mode, immediate vs register (no constant.. but can use phrases for that)
have only 1 bit for variable payload length, to choose between 8-bits and 24-bits, and have 2 bits of addr mode
have 8 roles, make it 3 role bits, and have 2 bits of addr mode

otoh those split addressing modes might be crucial for allowing many useful sentences to fit within 64-bits; we often need register + offset, after all.

so far i like this one the best:

have only 1 bit for variable payload length, to choose between 8-bits and 24-bits, have only 8 roles in 3 role bits, and have 3 bits of addr mode

so:

header field looks like: 1 length of payload bit (8 or 24 bits), EOP or EOS bit (depending upon last field in 64-bit packet), 3 role bits, 3 addr mode reserved bits

---

oh i know: say that if there is no sentence header, then the sentence is 64 bits long.

also, kinda stinks to have the EOS at some random bit in the middle of the field. Let's move that to the beginning, and make the phrase bits Begin New Phrase rather than End Of Phrase (the EOS bit is still EOS, pertaining to the end of the current 64-bit packet).

so:

header field looks like: 1 length of payload bit (8 or 24 bits), EOP or EOS bit (depending upon last field in 64-bit packet), 3 role bits, 3 addr mode reserved bits
phrases cannot go over 64-bit packet boundaries
256-bit sentence length limit
sentence header is mandatory at the beginning of any sentence longer than one 64-bit packet. sentence header is role 7. Payload of sentence header is: 2 bit length (x 64 bits), then a 'map' of phrase boundaries (with the first and last bit omitted b/c phrase end is assumed) (we got some extra bits in the 24-bit payload case (which is needed for >128-bit sentences) here; 2 length bits, 14 bit phrase map, still 8 bits left
we are 8-bit aligned and 64-bit aligned. We could relatively easily achieve 32-bit alignment if we wanted by prohibiting starting a 24-bit payload in the second packet slot (eg prohibiting a 32-bit item starting at bit 16)
there is a 'modifier' role whose payload is split in two; the first half is the choice of modifier field (eg 'color'), and the second is the modifier value (eg 'blue')
- mb, to save space, there is also a 'modifier' role that is just a bitfield with predefined modifier fields at each location, who knows

interestingly, with 8 roles and 24 bit payloads we have enough to encode natural languages like English (8 parts of speech, native speaker active vocabulary about 20k; in fact number of total words in dictionary estimated at about 1mil, which is less than 24-bit max 224 = 16M), (assuming that we have parens) except that the 4-word length on phrases and the 16-word length on sentences is too limiting. If you wanted to get rid of those limits, you would simply remove the requirement that phrases cannot cross 64-bit packet boundaries, and remove the requirement for a header field at the beginning of a sentence whose length is longer than 64-bits.

---

ok yeah let's loosen those restrictions a little by giving up the phrase map in longer sentences and giving up the single-packet phrase restriction. Implementations can still execute single-packet phrases quicker if they want. Also let's clarify parens and annotations.

header field looks like: 1 length of payload bits, BNP or EOS bit (if first field in 64-bit packet), 3 role bits, 3 addr mode bits
256 phrase limit
224 x 64-bit sentence length limit
sentence header is mandatory at the beginning of any sentence longer than one 64-bit packet. sentence header is role 7 with register addressing mode. Payload of sentence header is then length of the sentence, in 64-bit packets. Lengths 0 or 1 can have a special meaning, and we can have two more special meanings if the sentence header is 24 bits but 0 or 1 is chosen. (all these special meanings are currently reserved). Sentence header must have register addressing mode, but the payload is interpreted in constant addressing mode.
role 7 with immediate addressing mode is left parens; sentence header with constant immediate addressing mode is right parens; like sentence headers, parens may only occur in the first slot at a 64-bit boundary; the payload for a parens is the number of 64-bit packets included in the corresponding parens
while in parens, the EOS bit is never set; parens can only include phrases, not complete sentences
role 7 with register indirect addressing mode is region annotation begin
role 7 with split register addressing mode is unquote (used only within quasiquote)
role 7 with split register indirect addressing mode is region annotation end. Region annotation end is not needed after an annotation with region length of 0 (point annotation) or 1 (annotation whose region is this 64-bit field)
role 7 with split immediate addressing mode is left-quasiquote, and role 7 with split constant addressing mode is right-quasiquote
quasiquotes have payloads like parens. parens, annotation regions, quasiquotes have 'flavors' and split payloads; half of the payload selects the flavor, the other half is the length (this means 4k is the largest allowed parens length)
the most-significant-bit of the quasiquote flavors means 'there is no unquote (that ascends all the way back up to this level) in this quasiquote' (so there are only 2k distinct flavors)
the unquote payload is split into a 'flavor' and a depth (for nested unquote).
role 0 is 'head' (whose semantics are language-dependent; but there must be at least one head in each phrase)
we are 8-bit aligned and 64-bit aligned. We could relatively easily achieve 32-bit alignment if we wanted by prohibiting starting a 24-bit payload in the second packet slot (eg prohibiting a 32-bit item starting at bit 16)
there is a 'modifier' role whose payload is split in two; the first half is the choice of modifier field (eg 'color'), and the second is the modifier value (eg 'blue')
- mb, to save space, there is also a 'modifier' role that is just a bitfield with predefined modifier fields at each location, who knows
(visible) side effects of a sentence do not (visibly) occur until after all phrases have been decoded (including interpretation inside parens and interpretation of unquote inside quasiquotes)
phrases and subroles within a phrase are ordered, however, there is no guarantee that they will be decoded in order (in case their values rely on changing values in memory locations)
the first memory location is the PC. The PC cannot be written to; writes to the PC are discarded
the next three memory locations are stack-associated regs; the first of these is the stack pointer; the second is a pseudo-location such that a write to it is a PUSH (what about indirect addressing?) and read to it (even indirect addressing) is a POP; the third is a pseudo-location that always holds the TOS. Note that because our pointers are opaque, the 'stack pointer' may also be implemented as a pseudo-location.
the next memory loc is a GPR, but it's also implicitly used as a 'status register' for those instructions that need one (containing bits for arithmetic carry, overflow, etc)
the rest of the locs are GPRs (note that we've given up on the multiple-stack idea, because we already have only have 11 GPRs for split 8-bit instructions, and that's cutting it close already)
instructions which don't specify a needed operand are assumed to have the stack be that operand (eg register addressing to the PUSH/POP pseudo-location) (to do partial function application, use a modality)
- or should partial fn application be the default? mb so if we're doing a lot of laziness..

profile implementations may place the following limitations on sentence form (we can pack a specification of these restrictions into (less than) 16 bits, in such a way so that a bitwise 'implies' operation can immediately tell if a program is runnable on a given platform):

parens, quoting, unquoting, annotations, region annotations are all optional features
sentence length may be limited to either 256 64-bit fields or to some power of two smaller than 256
max phrases per sentence may be limited to some power of two smaller than 256
phrases may be limited to not cross 64-bit packet boundaries
starting a 24-bit payload in the second packet slot could be prohibited (to achieve 32-bit alignment)
24-bit payloads may be prohibited (making everything 8-bit only)

note: the official implementation does NOT support all these, eg it does NOT long phrases (in the instruction stream; in quoted data structures it does i guess)! this is for use in alternate languages/quoting!

this architecture makes some thing 'registery':

the first 16 memory locs are registery in that a split 8-bit payload can only address these
the first 256 memory locs are registery in that an unsplit 8-bit payload can only address these
the first 4k memory locs are registery in that a split 24-bit payload can only address these

this appears to meet our goals of:

relatively easy to implement decoder
can be decoded in parallel
preserves a lot of intent
extensible
can encode linear instructions streams, but also expression trees, and oot nets

and provide the following features:

stack support
extensible addressing modes (via phrases)
extensible modalities (via phrases)
can fit one traditional instruction into 64 bits
8-bit alignment
64-bit alignment
constant table is large enough to hold references to linked-in libraries
verb indirection allows higher-order functions
parens, quasiquoting, annotations including region annotations

remaining questions:

do the memory locs more like registers, which hold their values in between function calls, or local variables, whose meaning changes in between function calls? Eg when the stack frame shifts, do these other things shift?
dont we need a frame pointer along with that stack pointer?
are default operands partial fn application, or are they stack operands?
do we have a separate 'data stack' and 'return stack'? separate 'call stack' too? if so, which one is the stack pointer for?

how many GPRs are enough? we'd like to be able to code tight loops entirely with 8-bit payloads; when these are split, we can only access the first 16 memory locs. But we want to use some of these for stack(s), status regs, etc.

https://fail0verflow.com/blog/2012/dcpu-16-review.html alleges "Eight 16-bit registers are also not enough to perform 64-bit operations entirely in registers.". http://stackoverflow.com/a/16248421/171761 alleges "Measurements (and experience) have shown that 8 registers was not really quite enough to avoid such spills, and since spills touch memory, they slow down the processor considerably. I think 32 is considered (carefully measured) to be more than enough registers, but that would have required 2 bits, and 16 is close enough to ideal to be very practical."

http://web.eecs.umich.edu/~tnm/trev_test/techRepts/2000.10.The_Need_for_Large_register_files_in_Integer_codes.pdf cites various studies "For exam- ple, one study suggests that the number of processor registers that can be effectively used is lim- ited to a couple dozen [1]. Others have suggested that existing compiler technology cannot make effective use of a large number of registers [2]. Studies on the RISC I architecture refer to earlier work which shows that 4 to 8 windows of 22 registers each (for a total of 80 to 144 registers) is sufficient to house the locals and parameters for over 95% of function calls [3, 4]. In addition there has been much effort in the area of optimizing spill code; this is indicative of a need for more registers. Floating point and DSP codes are generally thought to be able to take advantage of a large number of processor registers if the compiler utilizes advanced transformations such as loop unrolling, register renaming, accumulator and induction variable expansion, and software pipelin- ing. One study confirmed this by showing that for a set of loop nests from the PERFECT and SPEC suites, register utilization increases by a factor of 2.6 after all ILP transformations were applied. While most optimized loops required fewer than 128 registers, the average register usage was about 70 for the loops studied [5]. However, other work which used some similar benchmarks found that after 32 registers, adding registers produced only a marginal performance improve- ments, particularly if sophisticated code generation techniques are used [6]. Finally another study showed that optimal performance for a number of Livermore kernels requires 128 to 256 registers [7]." and their own results say "64 or more registers enables performance improvements from 5% to 20%....most programs can easily use 100 to 200 registers when multiple active functions are considered for simultaneous allocation to the register file."

http://alanclements.org/rischistory.html says "Measurements made by Tanenbaum, and by Halbert & Kessler indicated that for over 95% of dynamically called procedures just twelve words of storage are sufficient for all their arguments and local values."

so our current 11 regs are probably a good compromise. It may or may not be a good idea to sacrifice one more of those to get a 2nd-on-stack pseudo-reg; but right now we have only the first 4 regs are pseudo-regs, which is a power of two, which is nice (presumably in a highly optimized architecture this would allow the mandatory 'is this a real reg or an mmapped one?' check to be slightly faster).

---

remaining questions (copied from above):

do the memory locs more like registers, which hold their values in between function calls, or local variables, whose meaning changes in between function calls? Eg when the stack frame shifts, do these other things shift?
dont we need a frame pointer along with that stack pointer?
are default operands partial fn application, or are they stack operands?
do we have a separate 'data stack' and 'return stack'? separate 'call stack' too? if so, which one is the stack pointer for?

i think the memory locs are more like local variables

but we do want to use numbers in the same space to refer to linked functions (which are module-level globals, i guess)

note that with nonarchimedian model, we can have our cake and eat it too; first infinity is local vars, etc.

probably do need a way to get to upvals, tho? possibly use GETUPVAL and SETUPVAL like Lua? Or just encapsulate that with __GET and __SET and put the value in a normal memory cell that intercepts gets and sets internally?

we might want to present an 'activation frame' abstraction that contains a stack of pointers to activation frames (whether these are allocated on the stack, for efficiency, or on the heap, for closures). This could be a pseudo-stack/virtual abstraction to allow the programmer to see and alter the call stack in a straightforward way while not actually being the way that variable access is implemented internally (to reduce indirection).

so i think we do want a 'call stack'. but i don't want to burn another 3 low registers on a call-stack-as-array, call-stack-as-stack, and call-stack-TOS.

We might not even need TOS anymore now that we have 'noun phrases'. Addressing a stack's TOS can be done in a single noun phrase.

Also, i guess if we wanted, we could directly use PUSH and POP opcodes which took an extra input argument to say which stack to PUSH or POP. This way it leaves GET and SET open for peeking and poking past the TOS; and that means that we could just store the stack in one location, rather than 3. We could make 'stack' a special data type, or just have PUSH and POP apply to all resizable arrays (or even all graphs) and then have 'stack' be a hidden implementation type of arrays (linear graphs). i favor the latter.

Note the contrast with the way things are done in an HLL; here we have a limited set of verbs, such as GET, SET, PUSH, POP, and we're applying these to various objects, some of which only support some interfaces. In a HLL, if we want to push x onto stack a, we take the target object (a) and the NAME of the verb we want, eg 'PUSH', and dynamically do something like 'f = a.__getattr__('PUSH'); f(x)'. The distinction is that there, the available verbs can be expanded arbitrarily by the program, and hence we have to do an extra lookup. Note that the Oot philosophy is to use verbs GET and SET for treating a stack as if it were an array, rather than making up stack-specific verbs like PEEK and POKE with the same semantics.

In a way, i kind of like this 'assembly' approach; have a limited set of verbs, like www REST, and use that as a core language.

also, since we have perspectives, if we have a stack object but we want GET to POP and SET to PUSH, we can simply take a perspective on the stack object that has that behavior.

if we decide that call stack manipulation will be rare, then instead of burning a low register to provide access to it, we could have a 'GETCALLSTACK' instruction to place it into a register; or similarly (and better), bind it to a low constant (so now it takes two steps to access the call stack; first, fetch it and put it into a register; second, access it).

if we provide call stack manipulation, one could imagine implementing CALL at a higher level as follows: push the arguments onto the stack; create a new stack frame, placing those arguments 'within' the new frame; push the new stack frame onto the stack; execute a VM instruction that says 'now remap the memory cell to correspond to local variables in this new stack frame'; pass control to the callee. Of course CALL is a common operation so an optimized implementation would override our high-level implementation and implement it natively, but that's cool. This seems reminiscent of Guile's VM, which seems to expose a lot of these sort of details. Which argues against it; my first thought when i say Guile's VM is that "this sort of thing is too complicated for Oot". But at least we could.

---

if stack regs are just an abstraction, should we allow them to be remapped to different stacks?

---

the stack problem is an important one: how do you have interfaces with different r/w interface semantics in bytecode land? can this be dynamic (at runtimee, which local avrs have stack semantics and which dont switch) or must it be static? in the former case, what about verification?

i guess in general we want things like __GET__ and __SET__ to be able to cause non-standard behavior for reading from and writing to ANY memory location. We still need a way to circumvent that, however. I guess an addressing mode modifier to enable a bit that says to ignore that?

---

we might want to make a distinction between 'Oot Assembly' and 'Oot Bytecode' as follows.

In the bytecode, we don't want polymorphism, because (to review) if at runtime we encounter an instruction to add register 0 to register 1, we don't want to have to lookup the types of register 0 and register 1 to see whether we should jump to the integer addition routine, the floating point addition routine, etc. Similarly, when we GET a value from an object whose type is statically known, we don't want to check at runtime if that object is actually a linked list or a stack or an array before jumping to the relevant GET routine.

we could encode the statically-known types in a separate 'type' field in the opcode, then look up the handler in a table that is indexed by (opcode, type). But not all opcodes can be used with all types (or are handled differently with all types), so this table would be much larger than it needs to be. We could store the table sparsely or as a hash, but then lookup takes a long time, and this is something that we have to do upon every instruction.

so it seems sensible to say, ok, when we statically know the type, do this at compile time, so let the opcode directly specifies the routine that will actually be called. So have an 'addinteger' opcode for adding integers, an 'addfloat' opcode for adding floats, etc.

but in the assembly code, we may want polymorphism, because it reduces the size of the language, allowing the core operations to become clearer (eg just write ADD not ADDINT) (also it may save some code from being written twice; would that be a C++-style 'templates' approach?)

i guess this may be another difference between bytecode and AST interpretation. Or possibly just between Oot Core and Oot Bytecode

(yeah, actually, i think this should just be a diff. b/t Oot Core and Oot Bytecode; no use creating a separate 'assembly' layer)

(also, i guess Oot Core would be expression trees, whereas Oot Bytecode might try to linearize first, althogh Oot Bytecode can support trees via its parens)

---

might want to call the assembly 'ootas'

---

do we really need/want an operand stack at all? we don't NEED one since we're a register machine, but i thought it might be nice to have. But mb its not.

i particularly thought it would be nice to permit compact instruction encodings that use the stack, but currently we cant have sentences smaller than 64 bits anyways.

even so, though, it helps us preserve intent, which might be nice to help a compiler compiling the bytecode to native

---

we could still efficiently encode some alternate forms as follows:

if the first field is role 7 small, use the high order bit for a special purpose (if they want to have eg a 200 x 64-bit length sentence, then they'll have to use role 7 large). Now you have 7 bits left in this payload, plus the other 48 bits.

for example, for a compact encoding of up to 6 instructions, use those 7 bits as a map, where 1s indicate boundaries between instructions (we also have 2 spare bits here). Now the rest of the 64-bit packet consists of 6 bytes. There are no headers; each instruction starts with an opcode, then if there is another slot given to it in the map, that's the first input operand (in register addressing mode), and if there is another slot, that's the next input operands (in register addressing mode); operands not given are assumed to be the operand stack; all destination operands are assumed to be the stack except for the last instruction, whose first operand, if any, is the destination operand.

or could reserve one of the role 7 addr mode schemes for this (we could probably take the combination of EOS being true, and one of the role 7 addr modes that is incompatible with that, like parens)

instead of using the stack by default, could simply:

have 2 instructions with an 8-bit opcode, 3 4-bit operands, 3 2-bit addr modes (the header took up the first 16 bits; the header payload is available to us tho, so we have 8 bits reserved as of now)
or, have 3 instructions; after the 8-bit role 7 header, 24 bits (8, 3*4, 3*2, as in the previous bullet point), then an 8 bit opcode whose operands are all stack, then another 24 bits; this 'short form' lets us do 3 instructions in one 64-bit packet, as long as (a) no modality is needed, (b) they only use the first 16 'registers', (c) the middle instruction can be all stack, (d) only the first 4 addressing modes are used

these sort of schemes seem good but a lot of trouble.. mb they don't belong in Oot Bytecode, which eschews efficiency. although i kinda like the last one.

---

so we could have the first 4 memory locs be:

PC (and discard writes)
status word (more generally, location used for 'extra' input/outputs of some operands, ie sort of like a minor accumulator)
operand stack
call stack

---

if we do give an operand stack, do we want to allow it to be used for calls and returns? but then wouldnt we also want to provide explicit call and return ops? eg we may want to offer both of:

CALL (f, 53, 44)

and also

PUSH 53 PUSH 54 CALL-WITH-ARGS-ON-STACK (f, 2)

(note that we offer std calling methods that can handle keywords, not just positional params, although postitonal params are more efficient and should be defaulted to)

---

do we want to allow ppl to embed 2-level expression trees in a single sentence using phrases? Eg "c = a*3 + b*4"; add, c, (phrase: mult, a, 3), (phrase: mult, b, 4) (that's exactly 8 chunks needed, so 2 64-bit packets)?

if we do this, then the heads of noun phrases might become the same as for overall sentences (eg the opcodes), with the destination of noun phrases being a secret mini-stack in the VM

if we don't want this, then just don't say that you can put opcodes in the noun phrases.

---