proj-oot-ootAssemblyNotes8

https://en.m.wikipedia.org/wiki/Find_first_set gives more info on why count trailing zeros (ctz) and count leading zeros (clz) are popular operations; they are related to log base 2.

---

why have meta bits in the graph data representation, rather than only having meta bits in access fns, eg in the arguments to whatever we ops we have in program code to mimic __get and __set? b/c we want graph to be able to 'modify' accesses, ie, views, ie we want a node to be able to have an edge whose target is not just another node, but the 'meta view' of that other node.

---

so is the stuff i'm think of meta level 3? stuff like the 'metagraph'? no, actually, right now i think that stuff is still just meta level 2. Recall, metalevel 1 ('hyper') is 'a message about a message', metalevel 2 ('meta') is 'a message about the connection between (a message about a message M) and the underlying/target message M; metalevel 3 is about the metalevels, that is, the relation between metalevel1 and metalevel2. Now, you could say that the metagraph is itself metalevel3 because it may contain nodes representing 'views' corresponding to different meta-levels. But it probably contains a node representing a view for metaedges; and another node representing a view for meta-meta-edges. This is distinct from having one node representing 'hyper' and another node representing 'meta'. Also, metalevel 2 already contains the concept of an edge that points to an edge. Also, recall that metalevel 3 is analogous to perspective, eg the way that a 2-d perspective view of a 3-d scene has converging lines to cram 3-d information into 2-d in a 'bent' way; which is analogous to 'rotating' information and presenting it in different ways; a single 'metagraph' doesn't seem to contain this (although perhaps if there were ways to transform the metagraph in various ways to represent metainformation in different ways, such as transforming from a representation in which (there is one node representing a view with metaedges and another node representing a view with metametaedges) into a representation in which (there is one node representing metalevel1 and another node representing metalevel2...).

Similarly, one might think that the 'metagraph' is in fact metalevel4 because it contains many levels in one graph, but i don't think it is, as level 4 is supposed to be even further out there than level 3, and in fact is supposed to possibly generate/include new metalevels.

---

We only allow 12 bit operands in instructions. We are a '16-bit' system; however it is possible for the actual memory to be larger than 2^16, and for the (opaque) pointers to be implemented as larger than 16 bits. However, the 12/16 difference suggests that on at least some naive or small implementations, there will be 4 bits 'left over' in some cases. We could use these as type tags or 'fat pointer' bounds checks [1].

How would we use these 4 bits? How does Haskell GHC use pointer tagging? Who else does that?

---

" nabla9 5 hours ago

Memory allocation feature asked is not a language feature. It has to be baked into web assembly if you want to use portable byte masking with pointers.

It's very low level implementation level detail that enables fast execution of dynamically typed languages. Boxing has high cost and it consumes memory. Type tags embedded in pointers can be very fast. For that to work, you need objects that are aligned with power-of-two byte boundaries.

Adding this feature enables efficient execution strategy for scripting and dynamic languages.

reply "

---

i guess if we only had 4 types, Parrot's "INSP" is a good bet: integers, numbers (floats), strings, other. We could also replace 'floats' with either 'boxed' or 'pointer (reference)' or 'native pointer' or 'dyn'.

---

XPC primitive types:

-- [2]

---

some random ideas on what it might take to make an IL (such as WebAssembly?) readable:

http://wllang.com/ https://github.com/wllang/wll

---

"Usually, a method accepts either a string or a number or a function or an object, etc" [3]

---

created ootVm.txt; moved old proposal from ootAssemblyThoughts to below this; copied current proposal from top of ootAssemblyNotes7.txt to ootVm.txt.

todos:

---

perhaps we put Java-esque restrictions on the data stack for safety? We still need to permit call stack manipulation though, so the HLL can easily implement continuations.

A different strategy would be to let the language implementation do whatever it likes, and only check user code libraries for stack safety restrictions.

todo

---

OLD proposal from ootAssemblyThoughts.txt moved to here (new version copied to ootVm.txt):

motivations and goals

The idea here is that Oot might have 3 well-defined compilation stages:

Oot -> Oot Core -> this Assembly/Bytecode -> platform-specific code

"Oot" would be the high-level language. It would be built from Oot Core using Oot Core's metaprogramming facilities.

The rationale for Oot Core is:

The rationale for Oot Assembly is:

Oot Assembly will make porting Oot easier by:

Some properties we want Oot Assembly to have:

Some things that many traditional assembly languages do that we don't:

What do we mean by 'preserve intent', and why do we want that?

What we mean is "don't map a specific operation S to a special case of a general operation G if, by looking at the result in the form of G, you cannot discover that this is actually S without doing non-local analysis".

Some examples:

There are three reasons to 'preserve intent':

Efficiency: We want Oot Assembly to be reasonably efficient to interpret; however, efficiency should not be at the expense of preservation of intent or customizability. We expect that performance-minded platform-specific implementations might compile Oot Assembly into custom high-performance bytecode formats anyways.

Examples of the consequnces of this choice:

current proposal

note: as of 201602 this has been superceded by the proposal in ootAssemblyNotes7.

Design goals

Ease of implementation

(Massively) parallelizable decoding

supports linear instruction stream; also optionally supports tree structure

Message constituents with a hard upper bound on their length in bits (in this case, 32 bits is the upper bound of werds, and the payload is max length of 24 bits)

supports at least 12-bit addressing (in fact, we support 24-bit addressing, although using some addressing modes on operands of more than 12 bits requires two instructions)

Preservation of HLL intent

Extensibility

note: efficiency is NOT a primary design goal; as with the Dalvik encoding of JVM bytecode, or the LuaJIT? bytecode, we expect that performance-minded implementations may create their own variant encodings. Eg the primary purpose of the 64-bit frame alignment is to support parallelized decoding (although efficiency is a secondary design goal).

Note that in this design, the operation might be indirectly specified via a reference to memory, rather than being immediately specified in the bytecode. This should be useful for encoding of calling first-class functions assigned to local variables.

Syntax

Note that the syntax of Oot Bytecode is a very general syntax that could be used for multiple languages (by varying the operations available, the modalities, the constaints, and the semantics of memory cells). Indeed, within Oot, this syntax is used for at least two 'languages'; one is Oot Instruction Bytecode and the other is Oot Graph Bytecode.

Sentences and phrases: Oot Bytecode is divided into variable-length sentences. Sentences consist of one or more variable-length phrases. Phrases consist of one or more werds

Each werd is either 16 bits or 32 bits. A 16-bit werd consists of an 8-bit header and an 8-bit payload. A 32-bit werd consists of an 8-bit header and a 24-bit payload.

An aside on terminology: The natural language linguistic concept of 'word' is a good fit for what is here called a 'werd' because, like linguistic words, Oot Bytecode werds are composed of a 'root morpheme' (the payload) and other syntactical and modifier morphemes (the header). However, in computing, the term 'word' is already in use to refer to architecture-specific fixed-length 'words', so to avoid confusion, i changed the spelling slightly. If this annoys you, feel free to call them "words"; in fact, in most contexts, i use the spelling "word" myself, and i only use "werd" when i am particularly worried about confusion with architecture-defined words.

Werds are grouped into 64-bit frames. Phrases cannot span multiple frames unless the parentheses construct is used.

Each phrase has one of 8 'roles' within the sentence. Each werd has one of (the same 8) subroles within the phrase. It's possible for multiple phrases within a sentence, or for multiple werds within a phrase, to have the same role or subrole. The ordering of phrases or werds with the same role or subrole is significant. Otherwise, the ordering is insignificant (except that some roles have positional constraints, namely, the first werd of a phrase is always subrole 0, and phrase roles 6 and 7 can only appear as the first phrase of a 64-bit frame).

The bits in the 8-bit werd header are as follows:

EOS/BOP: if this is the first werd in the 64-bit frame, then its an EOS. Otherwise its a BOP. EOS means that this 64-bit frame is the last 64-bit frame in the current sentence. BOP means that this werd begins a new phrase.

role: if this is a BOP, then this is the role of this phrase within the sentence, and the role of this werd within the phrase is role 0. Otherwise, this is the subrole of the werd within the phrase. The 8 roles are:

todo: what about 'metadata'?

addressing modes:

the 'split' modes split the payload in half (so an 8-bit payload is split into 2 4-bit payloads, and a 24-bit payload into 2 12-bit payloads), and apply the specified addressing mode to each half, then combine the two in a 'GET' (or index-into) operation. For example, mode 5 retrieves the contents of the memory cell indicated by the first half of the payload (the direct mode), and then finds the index within that data structure indicated by the second half of the payload; for example, if the first half of the payload is '3' and the second half is '5', and memory location 3 contains an array, then the effective address indicated would be the 5th element within this array.

the semantics of memory cells, indirect addressing, and the GET operation are language-specific

more details on some of the roles:

role 0 phrase details

role 1 phrase details

in this role, addressing modes yield the value found at the effective address

werds with subrole 2 within this phrase can be used to pass 'named arguments'; eg the name of the argument would be in subrole 2, and the value of the argument would be in subrole 1 (or implict in subrole 0)

role 2 phrase details

in this role, addressing modes yield an effective address

werds with subrole 2 within this phrase can be used to pass 'named return arguments'; eg the name of the return argument would be in subrole 2, and the lhs expression of the return argument would be in subrole 1 (or implict in subrole 0)

role 3 phrase details

todo: there should be a way to include a single 8-bit modality payload, but also a way to include arbitrary settings of named modality fields to values.

role 4 phrase details

role 5 phrase details

role 6 phrase details

Role 6 is generically defined as anything that can be translated in a context-independent manner (that is, the translator is only allowed to look at a contiguous group of 64-bit frames at a time, and must be stateless in between groupings) into one or more sentences.

When the payload is 8-bits, the 8-bit payload is modality (that is, it is interpreted as if there were a single-werd phrase with role modality, where the werd had role 0, and this was the payload of the werd), and the rest of the 64-bit frame is as follows:

When the payload is 24-bits, the first 8 bits select the number of 64-bit frames included in the packed segment, the next 8 bits specifies a language-specific packed representation, and the last 8 bits are language-defined. So far this is not used by Oot.

role 7 phrase details

Role 7 reinterprets the addressing mode field to indicate what type of grouping construct is present. All role 7 constructs apply at the granularity of entire 64-bit frames.

The payload of role 7 is always split; the first half of the payload is how many 64-bit frames are spanned by the construct, and the second half is the 'type' of the construct. Frame lengths 0 and 1 may have special meanings, todo. Note that grouping-ending constructs such as right parens have an identical payload to the matching left-parens; since this includes the number of frames spanned, this makes it efficient to jump from the right parens to the corresponding left parens (or vice versa). The most-significant-bit of the 'type' of the construct is reserved. For quasiquote, this means whether or not there is any antiquote within this quasiquote; the semantics of this bit for other modes is reserved for future use.

As an optional extension, a language may support arbitrary-length constructs. In this case, if the length of the payload is the maximum value (2^24 - 1), then this indicates that the construct is actually arbitrarily longer than 2^24 - 2, and at displacement 2^24 - 2 will be found another construct opening werd. Most languages don't support this arbitrary-length construct feature, in which case payload value 2^24 - 1 is illegal.

some of these constructs are used to embed things from a different 'language' within bytecode of a default language. todo figure out which ones?

todo: can parens only enclose single phrases, or can they enclose whole sentences?

todo: What about EOS bits within these constructs?

todo is region annotation of length 0 a point annotation?

todo: can 'sentence length' also specify a 'foreign language' sentence?

note: the difference between 'foreign languages' in role 7 vs language-specific packed representations is that a role 7 'foreign language' uses the same format/syntax as described here, but varies the semantics (eg list of operations, list of modalities, list of constraints, semantics of memory cells, semantics of indirect addressing mode and GET), whereas role 6 is an extension mechanism to contain segments of arbitrary format/syntax.

languages

a language using this syntax must define:

and must either define the semantics of or forbid the use of:

and may define:

numerology

There are 2^24 accessible memory cells (24-bit addressing). The size of the constant table must be less than 2^24.

Because an 8-bit payload can be split into two 4-bit payloads in split addressing mode, memory cells 0-15 can be accessed somewhat more easily than others. Similarly, constant table entries 0-15 can be accessed more easily than others.

Similarly, the largest memory cell location that can be accessed using split mode addressing is 2^12 - 1 (using a 24-bit payload split into 2 12-bit payloads). Similarly, constant table entries up to 2^12 - 1can be accessed more easily than others.

profiles

There are many aspects of this format which aid extensibility but which may make implementation more difficult. Therefore, although they are described above, the 'default profile' turns them off.

todo: let's have short construct length limits by default so that a default-profile interpreter doesn't have to reserve much memory

segment format

todo: need to specify constant table format, format for file containing both constant table and bytecode, etc? or is this implementation-dependent? (i'm leaning towards implementation-dependent, although in that case we should still specify it for the Oot language)

natural languages

some inspiration was taken from the syntax of natural languages (i took a course once in head-driven phrase structure grammar, so i wouldn't be surprised if this turned out to be particularly close to that). In addition, this syntax is probably sufficient for encoding most natural language sentences (assuming you are willing to do things like sometimes use multiple Oot Bytecode sentences to represent one natural language sentence).

Here's how you might encode some natural language constructs:

semantics: Oot Instruction Bytecode

indirect: todo; this probably will have something to do with aliasing or symlinking but it's not certain (see [[ootAssemblyNotes5?]]). In any case, the 'direct' mode is the 'fast' one.

In direct and indirect mode, the werds reference memory cells. A memory cell is considered to hold an entire arbitrarily-sized data structure; that is to say, a memory cell number is semantically more like a local variable than it is like an actual location in memory.

memory cell 0 is the PC (i think 1-3 should be special too, relating to stack(s), todo)

only supports up to 2^12-4 local variables

does not support arbitrary-length role 7 constructs

does not support modules whose AST has more than approximately 2^24 nodes

does not support modules which have more than 2^24 module-level werds (i'm flirting with a 2^12 limit here, although i think that's too low for many existing libraries; but note that if you import/link to external libraries a,b,c, as long as the references to those libraries are hierarchical (a.f, not just f), then 4096 is probably fine)

does not support composite literals with more than about 2^12 AST nodes (eg an array of strings is a composite literal; a string is an atomic literal)

todo


more 'special' registers from other architectures:

"

    Accumulator register (AX). Used in arithmetic operations.
    Counter register (CX). Used in shift/rotate instructions and loops.
    Data register (DX). Used in arithmetic operations and I/O operations.
    Base register (BX). Used as a pointer to data (located in segment register DS, when in segmented mode).
    Stack Pointer register (SP). Pointer to the top of the stack.
    Stack Base Pointer register (BP). Used to point to the base of the stack.
    Source Index register (SI). Used as a pointer to a source in stream operations.
    Destination Index register (DI). Used as a pointer to a destination in stream operations.

" -- https://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture#General-Purpose_Registers_.28GPR.29_-_16-bit_naming_conventions

---

could have an A (accumulator) register if we really want to have 3-input instructions (all operands are inputs, and the output goes into register 'A')

and while we're there, may as well consider adding an X register, too (memory base or offset or counter)

examples of useful 3-input instructions:

if we had an X register, then with 3 addr mode bits we could have:

so if we did that, then we'd have one more addr mode/flag bit left per operand, plus 2 instruction-wide ones left

---

is this relevant?: http://clang.llvm.org/docs/SafeStack.html

---

i think i'm nearing the end of this stage of my oot assembly/bytecode/VM journey. I feel like i've refreshed my knowledge of assembly language, explored the major extent different types and instruction 'basis sets' of VM and ISAs, got a little bit of intuition for (a least a few of the most initially obvious parts of) ISA and VM design, started to get the picture of what the implementation consequences of various HLL language choices, and of how HLL constructs map to these things, identified various salient features and issues that might inspire my HLL design, and sketched a VM design for Oot. I think it may be time to put the VM design aside and go back to treating it as an 'implementation detail', and translate/'upproject' some of my VM design into parts of the design for Oot Core. There are a few things that make me think this:

some of the salient things, and correspondences, i've learned about on my oot assembly/bytecode/VM "journey":

(already copied this list to a todo in assembly language principles)

some stuff i still have to learn:

---

---

to reiterate/summarize some the functions of an HLL:

---

should i look again at adding more Kant-ish stuff inspired by his Tables of Categories Judgements?

---

of course the simpler even-more-16-bit alternative to the MEDIUM format in ootAssemblyNotes7 would be:

4x2=8 addr mode bits 8 opcode operand bits 3x16 operand bits

but there's no encoding bits there. If we ditch addr mode bits for the opcode (ie ditch custom instructions), we can get encoding bits instead:

2 encoding bits 3x2=6 addr mode bits 8 opcode operand bits 3x16 operand bits

A slightly more complex alternative would be:

2 encoding bits 4x2=8 addr mode bits 6 opcode operand bits 3x16 operand bits

this only permits 64 built-in opcodes, but that would be enough if we are just trying to have a core which is the intersection of popular assembly commands (like webassembly, but even less), rather than all the sort of rich semantic ops we have been thinking of. But also it permits only 64 values to be used in the register direct and register indirect addressing modes for custom instructions.

If we wanted more opcodes we could reduce the operand sizes:

2 encoding bits 4x2=8 addr mode bits 12 opcode operand bits 3x14 operand bits

but if we are going to have the operand bitwidth a non-power-of-two, we should also consider reducing the operand size further, to 12 bits, and putting the extra 6 bits to use to allow addressing modes, at which point we're almost back to the proposal in ootAssemblyNotes7; this would also have the virtue of making the operand bitwidth the same for the opcode operand and the other operands, which might make register direct and register indirect addressing for custom opcodes more uniform:

2 encoding bits 1x2 addr mode bits + 4x3=12 addr mode bits 12 opcode operand bits 3x12 operand bits

to get back to the proposal in ootAssemblyNotes7, we'd reduce the opcode operand bits by 4 and use this to (a) give the opcode operand 4 addr mode bits instead of 2, like the other operands, (b) create 2 'custom flag bits':

2 encoding bits 2 custom flag bits 4x4=16 addr mode bits 8 opcode operand bits 3x12 operand bits

---

note: out of those, a less uniform but rather simple one is:

2 encoding bits 4x2=8 addr mode bits 12 opcode operand bits 3x14 operand bits

this is simpler than the current proposal in ootAssemblyNotes7 b/c there are no 'custom flag bits' nor 'custom addr modes'; yet it still allows for 4k custom instructions, as well as for 16k immediate constants.

16k (14 bitwidth) may be significantly better than 4k (12 bitwidth), because whereas one can say that a compiler would 'almost never' need to exceed 4k in its various limits (# of fns in a module, # of local vars, etc) in normal use, but still be a little worried (esp. for total # of fns accessible within a module), with 16k i think you can comfortably say you will never need to exceed it. Note that the current limit for many existing compilers is 64k, which is pretty close (only 4x); see proj-plbook-plChInternalRepresentationsImpl, section 'How many bits for various things?'; also note that on my machine currently, 'nm -D /lib/x86_64-linux-gnu/libc.so.6

wc' says 2242 lines, which is too close for comfort to 4k but reasonably distant from 16k.

(recall: Lua has 6 opcode bits and 8, 9, or 18 operand bits [4])

Note also that if you have a 64k memory, but only 16k constants, then you have a 16k constant table, so you can actually have jumps to locations higher than 16k, it's just that you can only have 16k such distanct jumps within one module.

the non-uniformity is b/c the first 4k of memory is special, in that that's the only place that register addressing can look to dynamically execute 'custom instruction' (eg this is like first-class functions for assembly).

Compared with one of the simpler 16-bit operand proposals, the only difference here is that we have 14-bit operands instead of 16 bits, and 12 bit opcode operand instead of 6 bits -- so we avoid having the crunch of only 32 opcodes (which may be more of a curse than a blessing if we don't want to make each port implement all those opcodes; of course, we could have a small 'core' and have the rest of implemented in a library and then macro'd to that core, eg compiler intrinsics eg a stdlib), and we get 16k instead of 4k limits.

A downside is that 14 bits is a less 'round' number than 12 bits. It wouldn't be crazy to do 15 bit operands and 9 bit opcodes, either, but that's odd numbers of bits so i'd avoid it.

I gotta say, the simplicity (implementation and static analysis simplicity) of ditching the custom addr modes and custom flags appeals to me, even though it'll make the assembly code more verbose.

With 4k opcodes, we could easily say that we have 2k standard opcodes and a 2k space of custom ones.

Otoh, from the perspective of implementation and static analysis simplicity, if we have (static/immediate constant) custom opcodes (as opposed to the dynamic, register-addressed ones), then static analysis isn't much more complex if we add custom addr modes and custom flags; because in all these cases the static analysis just has to do the macro replacement before the analysis.

And even 2k opcodes just seems like way too many. (a) both MoarVM? and x86 have a count of ~1k, (b) a dispatch table with 1k items and 4 bytes per item (say, on a 64-bit x86 machine) yields a 4k dispatch table, which is the largest 'small' object size imo b/c sometimes page sizes, caches are 4k.

But with this sort of instruction encoding, if you took away 4 opcode bits to get 8 bit opcodes (and only 256 words of special 'first class function' memory), then what would you do with the extra 4 bits? It's not enough to bring the other operands up to 16 bits. And if we add custom flags or custom opcodes than we're not making things simpler anyways.

Could we decrease the addr mode bits from 4 to 3 and increase the operand size from 12 to 14? No, b/c the addr mode decrease only frees 4 bits, but we need 6, so we'd have to cut the opcode bits from 8 to 6, which is not worth it.

Also, note that if we're worried about making static analysis easy, then SHORT mode is fine b/c it can be macro compiled into MEDIUM, but LONG mode cannot.

---

i guess i'm cool with custom per-instruction flags and custom addr modes and custom instructions as long as their semantics can be resolved via compile time macros (eg a precompile step can compile them out, so the actual interpreter doesn't have to worry about them). They could be used not just for readability but also for potential platform primitives (eg if there is no 'matrix multiply' in the core language, there could be a 'matrix multiply' custom instruction, which some platforms would implement via a platform primitive, and others would resolve via the macro; eg atomic instruction flags; etc).

hmm... but if there is self modifying code, then the runtime would have to deal with macro expansion too.

i guess that's okay. The point is that the macro expansion is simple enough to do.

---

i keep thinking that 12-bit addressing (4k values) may not be enough. Even my old TI-85 calculator had 32k (16k 16-bit words), and the currently popular TI-84+ has 128k (64k 16-bit words, or 16k 64-bit words).

and we've seen from the history of computing that there is very strong pressure to have a decent pointer bitwidth -- almost nothing can overcome the network effects of processor ISAs, but the need to move from 8->16->32->64-bit width seems like it did. And here i am below 16-bits! I know this is only the instruction encoding, not the actual bitwidth, but still.

the argument against is that we already have the LONG encoding, so who cares if MEDIUM can only reach 4k?

so i think i'm currently in favor of:

2 encoding bits 4x2=8 addr mode bits 12 opcode operand bits 3x14 operand bits

so, yes custom instructions (but only 12k addressing for them; nonuniform), no custom flag bits, no custom addr modes.

Since I am on the fence, the argument that this is simpler by excluding custom flags and custom addr modes should probably push me towards this one.

I made a note in ootAssemblyNotes7 that i am considering changing to this.

---

ok still happy with that but one other related problem:

i don't really like having the first 16 locations of memory be magic. This: (a) necessitates an extra comparison when executing simple instructions such as ADD(0x20, #1, #2) (b) makes it slightly more annoying for analysis software to scan for eg PUSHs and POPs; they might want to use an IR with an extra 'stack' addr mode for that purpose, which we don't want to have to do

otoh the reason we had magic memory locations is that the other alternative is to introduce an extra addr mode bit, which reduces our (directly) addressable memory by a factor of 2 (in this case, 8k memory locations) to gain just 16 memory locations that we are going to use.

So semantically, it would be clearer to have another addr mode bit and have stack addressing modes. But in terms of instruction encoding length, this would be wasteful.

(a) and (b) probably aren't actually that important. for (a), ordinary computation isn't that costly, what is costly is hitting memory or branching. for (b), it's not like analysis software won't have to construct auxiliary data structures no matter what we do, and it's not like we're introducing anything dynamic or conceptually different; to see if we're accessing the stack, just check if the addressing mode is direct or indirect and the address in the operand is one of the magic stack addresses.

another thing to worry about is the annoyance of potential stack overflow. Like divide-by-zero, this introduces a pervasive potential error. If stack ops were confined to PUSH and POP instructions (and similar) then it would be easier to identify instructions that have this failure mode; if we have stack addressing in any form though, then any instruction could have this failure mode depending on its addressing mode and operand.

now, even if we did invent new addressing modes for this stuff, let's say we added a third addr mode bit and got 4 new addressing modes, that wouldn't be enough: if we added two stack addr modes (for PUSH and POP, in direct and indirect addressing modes) and two stack-as-normal-memory addr modes (for reading and writing the interior of the stack, in direct and indirect addressing modes), that's already 4 new addressing modes; but we also want a second stack, and possibly other special locations like PC, ERR, MODE. Of course, we could say that some of these things would be other bits in the operand; eg in stack addressing mode, we steal one operand bit to specify the stack, another operand bit to specify whether we are pushing/popping or just doing direct memory access on the stack, etc; but this is sorta equivalent to making the addressing mode field into a variable-length prefix-encoded field and also incurs the same annoyances/inefficiencies as just having special memory locations.

Even if we just made two new addressing modes ('special direct' and 'special indirect'), then (a) isn't solved that much b/c popping and pushing to the stack, a common operation, would still require an extra comparison within the special addressing modes.

so i guess i've argued myself into keeping the special memory locations.

---

incidentally i'd like to consider cleaning house a little on those magic memory locations. Recall that in the SHORT format, the first and last instructions have a 5-bit operand, so they can access any of the magic memory locations as well as the first 16 real memory locations. So our choice of magic memory location is important b/c this is what SHORT instructions can access. If 16 memory locs is not enough, we might want to add more GPRs to the magic locs (we already have 4 GPRs in there).

eg i dont think we really need:

10: current size of stack 0 11: current size of stack 1

in there.

and it might be nice to add direct read/write to the second item in the stack.

and how about putting all the stack ops near 0 to make it less annoying for code analysis to see what's touching the stack?

we might want to make JMP opcode the first opcode (1 or 2; 0 should be comment or annotation) for similar reasons, but since custom instructions can branch or jump this won't help much.

so:

0: stack0 push/pop 1: direct read/write access to the top of stack0 (without pushing or popping) 2: direct read/write access to the second element of stack0 (without pushing or popping) 3: stack1 push/pop 4: direct read/write access to the top of stack1 (without pushing or popping) 5: direct read/write access to the second elements of stack1 (without pushing or popping) 6: PC (program counter) (do we forbid writing to this? i think so; we want to make programs use control flow ops like JMP for easier analysis) 8: FLAGS 9: ERR 10: ARG0 (call3 operand0/output) 11: ARG1 (call3 operand1/usually input) 12: ARG2 (call3 operand2/usually input) 13-16: GPRs 0x0010-0x2000: general purpose registers (GPRs) (accessible in the opcode field) 0x2000-0x4000: general purpose registers (GPRs) (not accessible in the opcode field) }}}

---

great slides on JVM, including author's opinion on what the JVM should do and what should be done above it in libraries:

http://www.azulsystems.com/blog/wp-content/uploads/2011/03/2011_WhatDoesJVMDo.pdf

---

y'know, looking at SELF has me convinced of the importance of the indirect-indexed addressing mode; in a late bound dynamic OOP language, you are often going to a memory location, which itself contains the address of an object instance of a known type, and then adding a constant offset to that address to get the actual address of some field within that object, and that address is the effective address. This is one statement in the HLL; eg "instance.x = 3", where 'instance' is a local object.

so we want 3 bits of addr mode (8 addr modes), not 2. What should the other 3 addr modes be?

indexed-indirect is an obvious choice, by symmetry, but apparently it is much more rare than indirect-indexed. Also, what about good ol' split mode, which is just taking the sum of the two split arguments to get the effective address?

should we also go back to having 2 custom address modes here?

---

the previous suggests a MEDIUM encoding format of:

2 format bits + 3 addr mode bits + 11 opcode operand bits + 3 x (3 addr mode bits + 13 operand bits)

note that it's kind of ugly to have an odd number of bits in all the operands, in addition to having a different number of bits in the opcode operand and the others. Not sure if we should care; eg Lua 5 VM has some 9 bit operands ([5] page 3).

also, we are back to only 2k for the opcode-addressable space (and 8k for the other guys).

---

if we have custom addr modes, but we want it to be module-static, but we want to be able to dynamically write code, then what happens when one module writes a function and then passes it to a second module, which runs it? does the code somehow magically 'remember' the custom addr mode context in which it was written? that sounds like a lot of runtime complexity. Or do custom addr modes function according to the module in which they are executed? In which case there might be some surprising bugs when code is run from a different module; also, again, this might some runtime complexity (as you have to track 'current module' context). Of course, we could make it dynamic, but i don't like that either.

we could also just make custom addr modes a compile-time only thing; actually executing them is a runtime, they are illegal addressing modes that are reserved for use in compiler macros. But in that case they don't actually help with code density for the purpose of icache, so why standardize them?

---

so if 8 addr modes and no custom, then we have 3 addr modes left. Ideas:

PC-relative or stack addressing seem like the obvious choices to me...

---

so how about the other 4 addr modes are:

?

(PC-relative can be done via 'split' with the register 0 as one of the splits, assuming 0 is memory-mapped PC register)

(note: but by the same argument, stack direct access would be unnecessary b/c you can do 'split' with the stack pointer as one register... unless it's a hardware stack that's not mmapped) wait no you can't, you have to get the VALUE at the stack register and add the other argument to that -- so it's actually indexed indirect

(are the arguments to 'split' signed or unsigned? how about the first arg is unsigned, the second is signed?)

(another mode we are missing (in addition to indexed-indirect); split where both are indirect, ie follow the first pointer, follow the second pointer, add the results, that's your effective address)

---

hmm, actually if we had an indirect pre/post- increment mode, and a memory-mapped stack pointer address, then we effectively have PUSH/POP anyways.

so how about:

note: if indirect pre/post-de/increment is to be used for push/pop, this implies a commitment on whether stacks grow up or down

apparently stacks usually grow down, so to POP we read and then increment the stack pointer, and to PUSH we decrement the stack pointer and then write, so it would be:

---

array indexing appears to be indirect-indirect:

x[y] appears to mean 'fetch the value stored in x, fetch the value stored in y, add them together, that's your effective address'

indirect-indexed-indirect could also be useful for PC-relative addressing; if the PC is mmapped to address 0, then the first argument could be 0 and the second the offset; this would mean fetch the PC, add the offset to the PC, fetch the value stored at that address, that's your effective address.

is there any HLL use like this for indexed-indirect? i can't think of any yet.. i guess b/c this is just 'split' and then 'indirect'

so how about:

maybe indirect pre/post-de/increment-indirect?

note: i guess if memory locs are local vars then raw 'split' without any indirection makes less sense; dynamic arrays and other variable-length stuff will always be stored indirectly, never in a chunk of direcly-address memory. i guess a static-sized array could be stored in directly addressable memory, though.

---

so when there is a CALL, do the directly addressable 'memory locations'/'registers' persist? Or do we abstract over ABI calling conventions (eg callee saves vs caller saves, passing params on the stack vs registers, whether local variables are in the stack or in registers or in some mix, etc), and just say that everything directly addressable is fresh (local variables; like registers with caller saves), and everything indirectly accessible persists (the heap)?

i guess i prefer the latter. But then how do we pass data between functions when there are more than two params to be passed? Do we have to always allocate it on the heap? Can't we pass stuff on the stack?

i guess what we want is a way for the 'assembly' code to declare stuff about the data it will be creating, and then let the compiler/interpreter/runtime decide whether this actually goes on the stack or on the heap

but we want to be able to do closures, which means sometimes putting activation records on the heap rather than the stack (saguaro stack). We also want first-class activation records.

---

we want to have explicit for loops, while loops, etc in this language. But how to do that without real expressions (oot assembly is linear)? Mb the way that Lua bytecode does it?

---

toread https://www.gnu.org/software/guile/docs/master/guile.html/Object-File-Format.html#Object-File-Format

---

yknow the idea that the bitwidth of the opcode operand is 2 bits less than the other operands is bumming me out. First, it means that there are addresses that are reachable by ordinary instructions but can't be used to store functions in (or rather, we'll have to CPY or MOV the functions before executing them). Second, it's less beautiful. Third, some compilers and other code analysis tools may decide to have two lists of addresses, one for short addresses that can be used for functions and one for slightly longer ones, which will make their code harder to read.

the operand size in the opcode operand is 11 b/c we have 2 form bits and 3 addr mode bits, and 16 bits total. The operand size in the other operands is 2 bits bigger because we don't need the form bits.

now, we could just say that instructions are 64 bits, but they don't divide cleanly into 4x 16 bit parts. That's probably the best practical solution; even Lua divides its 32-bit instructions into fields at non-power-of-two locations ([6] page 3). Breaking the 64-bit instruction into parts is a little annoying to read but it only has to be done in a few places in the code, and it won't actually hurt performance that much because just doing arithmetic is fast (accessing memory, and jumping, is what's slow).

but first let's explore slightly more what we could do with the 16-bit constraint.

The cleanest thing would be to just have a 1-bit addr mode for the instruction operand (i guess we'd pick constant and indirect?), and a 3-bit addr mode for the other operands. But we really want the indirect indexed addr mode for vtable dispatch in OOP instances.

If we insist on 3-bit addr modes throughout (but see below), then we have two bits per operand to spare.

This is kind of annoying, because i can think of far more things we'd like to do with another per-instruction bit than with another per-operand bit (pushing us to consider a 64-bit fixed length instructions that is not broken into 4 16-bit parts).

An obvious use for one of the per-operand bits is to indicate whether the argument is boxed or unboxed.

We still have one more operand bit. I can't decide quite what it should be. Two candidates are:

'maybe' will probably be the most commonly used here, although it's not as semantically meta. But 'await' also sounds useful. Otoh i kind of like 'atomic' because unlike the others, it's hard to represent as separate instructions at all.

so maybe the best choice for 2 bits would be:

note that the split-mode addressing also creates a difference between how usable low memory is vs the rest.

---

we could have a 2-bit addr mode for the instruction operand, maybe:

note that it may make sense to get rid of 'constant' indexing. But we want 'constant' for 'standard library' and compile-time import sort of things. But otoh this brings up the same problem that we thought we had with custom addr modes; when code is dynamically created by one module and executed by another, whose instruction constant table is used?

also here we omitted 'split mode' and 'stack mode', but i bet executing code out of an array or off the stack could have some use cases.

then if we added one 'boxed/unboxed' bit for operands, we'd be good.

---

for the 'split mode' addressing modes, if there are 11 bits total, let's have the split be 8/3. That way even in split mode we can address 256 addresses initially. This provides a situation where there are essentially 256 'registers', which is nice; we could have a limit of ~256 local variables. Is this workable? Probably; this is similar to Lua's limit of about 200 local variables and 60 upvalues.

for the 4th new addr mode, we could have another indirect-indexed, except that we use the 3 bit part for the initial indirection, and the 8 bit part for the final index. This allows us to have a bigger index when we do indirection on one of the 'special' mmapped low registers, eg on the PC or on the stack pointer.

It also suggests that we limit the special registers to 8.

---

earlier, i was thinking about the 'types' of instructions as to whether the first, second, and third operands are input or output or both. This would affect whether pre-decrement or post-increment would be used when that addr mode was used, and maybe some other stuff like that; and it could be used for code analysis.

a simpler thing to do would be, with regards to whether pre-decrement or post-increment would be used and stuff like that, just just set the I vs O types of all instructions to OII, that is, the first operand is always treated as 'output' and the other two are always treated as 'input'. This doesn't mean that they can't actually do different things than that (so program analysis tools still have to do a lookup by instruction), but it makes instruction decoding simpler.

---

the SEAM meta-VM uses a graph store as a generic memory. The graph store also has 'references' (aliases), which are used for completed futures (a special case of their "transients") ("Furthermore, a reserved label, internal to the store, is used to represent references. A replaced transient updates its label in place to become a reference, and stores an edge to the node the transient was replaced with. "), so that is very interesting for us.

---

how do we represent a lambda eg:

lambda x,y: x + y

in that code there will be an ADD instruction for sure, but should we have a flag on it to indicate that it is just 'quoted code' and not actually code that is not about to be executed? this may help parallel execution interpreters. But i think no: instruction decoding can happen in parallel (so we have fixed-length instructions etc), and speculative execution, but there is no guarantee that the speculative execution engine can know beforehand which code will actually be executed.

This suggests that maybe we SHOULD allow 'blocks' and/or 'expressions' eg BEGIN and END delimiters surrounding eg an FOR loop or other block.

And if we have that, then why not prefix instructions?

But then should we just have variable-length instructions with an instruction word and then following words specifying the operands?

maybe.. but let's hold onto the fixed-length instruction idea a little longer.

---

wait, i've been misunderstanding the meaning of 'indirect indexed'. https://www.c64-wiki.com/index.php/Indirect-indexed_addressing describes it as more like what i have been calling 'indirect-indirect'. In both, after you take the contents of the first effective address and dereference it (the 'indirect'), you add something to that; in C64's 'indirect-indexed', the thing you add is itself the value contained at some location (like my indirect-indirect), but in what i have been calling 'indirect-indexed', i imagined that the second thing added was an immediate constant.

so mb i should call that 'indirect-immediate'.

---

note that if we have an indirect-indexed mode with an 8/3 split, then that makes the first 8 registers special. So maybe we should leave some of them as GPRs.

of course, we could just have a 6/5 split, in which case we would have 32 'most special' memory locs, alleviating register pressure somewhat.

but the whole point here is that these locations ARE the HLL local variables. 32 isn't enough. So this would just be an optimization, not something that can be used by default for every local variable.

---

so currently we're looking at:

64-bit instructions in 4x16-bit parts:

2 bits: form, 3 bits: addr mode, 11 bits: operand 3x: 1 bit: atomic, 1 bit: boxed, 3 bits: addr mode, 11 bits: operand

with addr modes:

however, since we have 'constant table' custom instructions, should we also just have custom addr modes?

also, consider changing the split to 6/5, freeing up one custom addr mode.

---

but by the same argument against the 6/5 split (that these things are supposed to directly represent the HLL language constructs, things that the compiler can use every time in its naive translation, so if they don't work for as many locals as are allowed in the HLL, then don't include them), an 8/3 split isn't very useful either (unless the only point for the 3 side is for the special mmapped low variables, which is possible); eg if the point of indirect-indexd is for array access, well if the array index is stored in a local above location 7 then this won't work, so a naive compiler can't rely on doing this everytime and must have some backup way of compiling array accesses for when the array index is stored in a higher-numbered local address -- so why include it at all?

So we'd really only want an 8/8 split before we include this stuff. Which we definitely don't have room for, unless we go back to full 16-bit operands.

---

to have full 16-bit operands, and addr modes, we'd have to put all the addr mode bits in the first 16-bit part. So with 2 addr mode bits per, that's 4x2=8 + 2 form bits, so only 6 bits are left for the opcode. Which means only 64 locations can be used for first-class functions.

Which by the above argument rules out first-class functions. And that's not even with 3-bit addr modes.

(note that the argument may not be what we want; b/c a naive compiler can always use LARGE encoding if MEDIUM can't fit all the locals)

---

even if we were to settle for an 8-bit opcode operand with no first-class functions (no addr mode bits), we don't have enough addr mode bits to have both 16-bit operands and also 3 addr mode bits:

2 form bits 6=2x3 addr mode bits 8-bit opcode 3x16-bit operands

---

if we demanded 3 16-bit operands and 3 addr modes for each (so that we can have split addressing modes with 8 bits in each split), then we can only have a 5-bit opcode:

2 form bits 9=3x3 addr mode bits 5-bit opcode 3x16-bit operands

---

what would it cost to get 4 uniform 16-bit operands and 3 addr mode bits? at least:

2 form bits 12=3x4 addr mode bits 64=4x16 operand bits

78 bits

this would let us have 256 locals, first-class functions, and use split addressing modes.

---

visual studio warns when there is more than 64 locals:

https://msdn.microsoft.com/en-us/library/ms182263.aspx?f=255&MSPPError=-2147217396

---

and if we retained first-class functions and had a limit of 64 HLL locals (and 64 structure elements) and didn't care about 16-bit boundaries:

2 form bits 2 reserved bits 12=3x4 addr mode bits 12-bit opcode 3x12-bit operands

hmm, that's rather symmetrical...

---

if we only have 1 form bit instead of 2, we can make the above on 16-bit boundaries:

1 form bit 3 addr mode bits 12 opcode operand bits 3x 1 reserved bit 3 addr mode bits 12 opcode operand bits

---

really though, as cool as it would be to have x[y] + q[z] be a single assembly statement (by using split addressing modes), i think that when you think of "x[y] + q[z]" you think of three operations: temp1 = x[y] temp2 = q[z] sum = temp1 + temp2

so you don't really need indirect indexed addr mode. Similarly for instance0.f(x); this is:

  temp1 = instance0.f
  temp1(x)

so we don't really need these 'split' addressing modes anyways.

so split addressing modes just aren't feasible (unless you want to limit the HLL to 64 locals), and don't really fit in our paradigm anyways.

stack addressing mode (without mmapping) would be cool, though, and it's not a split, but i'm not sure if it alone is worth adding a whole addressing mode bit. We can always have PUSH and POP instructions.

so, unless we want custom addr modes, we don't need 3 addr mode bits, only 2.

---

so mb:

2 form bits 2 addr mode bits 12 opcode operand bits 3x 2 reserved bits 2 addr mode bits 12 opcode operand bits

---

but i like stack addressing mode without mmap...

if we had another bit, what would the other modes be? mb 2 custom and what about the third? mb add an Y register (and/or X) and do real/traditional 'indirect-indexed'?

---

another thing to note:

the compiler might need some temporary variables from time to time. So if we have 4k memory locations, the HLL must allow less than 4k local vars. Eg Lua has some 8-bit operands but only allows 200 vars.

the simplest thing to do would be to set the limit at 1/2. eg if we have 12 bit operands, set the HLL limit of local vars at 2^11.

---

so i guess if we have a per-module constant table, we may as well have also have per-module custom constant instructions and custom addressing and custom flags.

---

the semantics of SHORT encoding is not necessarily stack; stuff only interacts with the actual stack to the extent that the instruction takes more from the stack than it put there, or pushes more to the stack than it pops (at which point the actual change to stack state is only guaranteed at the end of the sequence). Otherwise the compiler is free to put intermediate results in registers or even in hidden locations that the program can't see. Primitive stack ops (such as DUP and SWAP) still have their proper effect, but if the program calls a subroutine that follows the stack pointer, the stuff may not actually be in the stack. This allows the compiler to optimize away the intermediate results. The point of SHORT is to allow you to write expressions, not to actually manipulate the stack in the middle of it.

---

so mb back to:

2 form bits 2 addr mode bits 12 opcode operand bits 3x 2 custom addr mode bits 2 addr mode bits 12 opcode operand bits

6 predefined addr modes, 10 custom (another way to say that: 4 bits total, 3 predefined bits with 6 predefined modes + 2 custom modes + 1 custom bit)

(last one because it could be useful for scanning arrays also)

---

hmm i do miss stack addressing. i guess we need to bring back 'mmapped' stacks.

also, extending the discussion above with SHORT addressing, i guess mb IN GENERAL the 'stack' is not really a normal thing residing in memory. it should be illegal to store an address within the stack somewhere else. The stack is only good for PUSHing and POPing and PEEKing and POKEing, plus the stack primitive ops (DUP, SWAP, etc), and things composed out of those.

there can be other userlevel stacks (even a data stack and a return stack) but these are not the Oot Assembly stack.

This ensures that the Oot Assembly stack can be optimized away during compilation.

also, mb the same with the PC. Mb we should be a Harvard architecture?

after all, if eg Java could autogenerate new Java and immediately run it, what good would stack maps be? you'd lose safety guarantees (eg the guarantee that the runtime checked the stack map upon loading and verified that you won't ever POP an empty stack, etc).

---

so some lessons learned recently:

8-bit limits (256) are practically sufficient for things like numbers of local variables, as shown by Lua. However, i feel more comfortable if they are 12 bits. i would be even more comforatable if they were 16 bits, but...

since 4 fields of 16 bits is already 64 bits, assuming we want addr modes or form bits at all and that we demand that an instruction is 64 bits and that we demand 3 operands plus an opcode operand, we can't have all 4 operands be 16 bits.

the reason i want one or more built-in 'stacks' in a register machine is more for (a) expressions and (b) call stacks than for other reasons; i want the compiler to be able to compile away the stack(s) and so they can't be directly accessed in some low-level ways.

mmapped registers for PC and stack seem a little yuck

if we have first-class functions, then having a smaller opcode operand than the other operands is a little yuck, because it means the low memory addressable by this smaller opcode is the only 'uniform' GPRs (the higher memory can be addressed by the other operands but not the opcode operand, hence in this way it is less 'uniform')

split-mode addressing has the same problem; if it is to be relied upon by the HLL compiler, then each split of the mode must be able to address all the 'uniform' GPRs. But that probably means at least 8 bits for each split, which means the operand size must be at least 16 bits, which as we saw above is unworkable if we want the opcode operand to be the same size as the others.

some split-mode addressing that otherwise sounded useful is indirect-index (array access) and indirect-immediate (struct access).

i would really like a stack addressing mode, which is just indirect plus pre-decrement on write or post-increment on read, applied to the stack pointer.

various combos of pre/post decrement/increment could also be useful for scanning through an array.

polymorphic opcode instruction encoding seems more readable, but imposes needless efficiency loss

immediate addressing modes and indirect represent LOADK and LOAD instructions.

some assembly concepts and related HLL concepts:

---

if we have first-class functions, then what about partial function application?

---

mb have 3 different constant tables:

---

one nice thing about disallowing control flow in SHORT is that CFG analysis can skip the SHORTs. What if we only allow CALL? CALL always returns, right? But what about nonlocal returns, like throwing an exception? So if we have a special CALL that disallows nonlocal returns, then we're good (except that the code inside the CALL can still (a) crash (b) infinite loop).

---

one idea instead of an APPLY opcode is to use ordinary instructions for partial application but with a sentinel for 'leave this argument unapplied'. This sentinel could be to use 999 in constant addr mode, or something like that (000 in constant addr mode could be Nil for nullable types/maybe, although are we used a Maybe tower? If so, how would that work?).

---

remember that we need an ANNOTATION/COMMENT opcode

---

and should we have one or more PREFIX opcodes in MEDIUM?

---

so Oot Assembly is coming together. It has 8-bit or 12-bit operands. It has 4 basic addressing modes (and maybe also a stack addressing mode and one or two more). It has goto, CALL, partial application, first-class functions, basic arithmetic (and POPCOUNT), conditional branch, boolean logic, Oberon-ish loops, S and K and I combinators, stack ops. Maybe some other composite data types (structs, arrays, multidim arrays, strings, cons cells/linked lists, dicts? graphs? streams? sets? what else? other collections?). It has garbage collection (like Go's, or just refcounting?). It has map and fold and filter and function composition. It has lenient evaluation, async I/O, tail calls, closures, optional first-class heap-allocated saguero stack frames. It has hygenic macros and possibly operatives/vau/fexprs. It has option/maybe types (optionally nullable types). It has gradual typing and non-optional stack maps for the virtual (builtin) stack (and the safety guarantees/expressiveness limitations of stack maps). It has dynamic wind/unwind. It has fluid variables. It has limits of 2k local vars and 2k functions per module and 2k constants per module. Does it have exceptions, handler trees, continuations and prompts (and 'guarded continuations', i don't know what that is)?. It has various 3-operand 'gates' such as Fredkin, Toffoli (CSWAP, CCNOT) [7]. It has immutably data COW. It has centis-like graph regex. It might have DFAs and NFAs and NockSpec?-like reduction and grammars. It has atomic ops and compare-and-swap. It might have some mmapped low registers, including maybe a read-only PC that doubles as an immutable 'discard' destination, and possibly including either a virtual stack pointer or a set of memory locations that stand for a stack addressing mode.

And this is also looking a lot like Oot Core. One difference is that Oot Core is not a linearized (flat) sequence of instructions, it has expressions and blocks (it's an AST, not a flat array of instructions). Another difference is that Oot Core functions can take many (up to 64? 256? 2k?) arguments, including both positional and keyword arguments, whereas Oot Assembly instructions have 3 operands.

It would be nice to have struct indexing or array indexing in single instructions, but the exploration of 'split addressing modes' above shows that there is no way to do this in a uniform fashion given the constraints of 64-bit instructions, first-class opcode operands, and at least 8-bits in each split (struct limits of at least 255 members).

It might have SYSCALL or STDLIBCALL opcodes (to allow an additional 4k builtins, separating builtin 'instructions' from stuff like 'open a file').

---