proj-oot-old-150618-ootAssemblyThoughts

todos


motivations and goals

The idea here is that Oot might have 3 well-defined compilation stages:

Oot -> Oot Core -> this Assembly/Bytecode -> platform-specific code

"Oot" would be the high-level language. It would be built from Oot Core using Oot Core's metaprogramming facilities.

The rationale for Oot Core is:

The rationale for Oot Assembly is:

Oot Assembly will make porting Oot easier by:

Some properties we want Oot Assembly to have:

Some things that many traditional assembly languages do that we don't:

What do we mean by 'preserve intent', and why do we want that?

What we mean is "don't map a specific operation S to a special case of a general operation G if, by looking at the result in the form of G, you cannot discover that this is actually S without doing non-local analysis".

Some examples:

There are three reasons to 'preserve intent':

Efficiency: We want Oot Assembly to be reasonably efficient to interpret; however, efficiency should not be at the expense of preservation of intent or customizability. We expect that performance-minded platform-specific implementations might compile Oot Assembly into custom high-performance bytecode formats anyways.

Examples of the consequnces of this choice:


(copied from [1])

The idea of 'alias' and 'value' addressing is that each memory cell has two 'levels'; the 'alias' level, which might contain a pointer (or actually, maybe an entire 'alias record', which is a pointer with some attached metadata), or nothing is the cell is not a symlink but an actual value; and the 'value' level, which in the case of non-symlinked cells contains the contents of the cells, and which in the case of the symlinked cells looks like it contains the contents of the symlink target. So when you use value addressing on a symlink cell (that is, one with a non-empty alias level), you are really manipulating the cell that is the symlink target; and then you use alias addressing on this cell, you are reading or mutating its symlink record.

In order to CREATE an alias, a third addressing mode is needed, because you want to take the 'address' of a cell, and place it into the alias level of a different cell. Instead of 2 different alias-and-addressing relating modes, we could have some special instructions. Note that an 'address-of' mode could not be written to, only read, so maybe that one should be omitted?

We also need a sentinel to represent an empty alias cell. We can use 0, which would mean that the PC cannot be aliased, which is no great loss.

Also, is this 'alias level' identical to the 'meta level' above a 'perspective' (view), or do we need yet another mode for that? i'm guessing it's a field within the metadata in the meta level

if we used an 'alias' addressing mode, and a special instruction/opcode for GET_ADDRESS_IDENTIFIER and CREATE_ALIAS then:

If we wanted to look at or manipulate metainformation in the alias record, we would first move the alias record itself into an ordinary value cell, by using issuing a CREATE_ALIAS instruction whose input had alias addressing (r) and whose destination had value addressing (*r) (this create an alias from the destination cell to the alias record which is controlling the input cell; if we wanted to alias a destination cell to the same symlink target as a source cell, we would instead do CREATE_ALIAS with value addressing in the input (the more typical case). (note: if this is what we're doing, tho, then 'alias' addressing in CREATE_ALIAS's output is useless, which seems like a waste). Note: this is irregular in that the CREATE_ALIAS and GET_ADDRESS_IDENTIFIER opcodes would interpret value addressing different from every other opcode.

A more regular solution would be to use an address-of addressing mode, and to have GETALIAS and SETALIAS operations. To create an alias from r4 to r3, you would SETALIAS(input= address-of(r3), destination= address-of(r4)). To create an alias from r4 to whatever r3 is directly aliased to, GETALIAS(input=address-of(r3), destination=r1); SETALIAS(input= r1, destination= address-of(r4)). Etc. I like this better.

A downside there is that address-of can't go in the destination operand. But i guess that's good, it means we have a spare 1/2 addressing mode. hmm, mb too clever, but we could use that in place of SETALIAS... presumably SETALIAS will be somewhat common, but GETALIAS will not. to clear a symlink, we can copy 0 in using this addressing mode

Note that SETALIAS is a special case of setting .__GET and .__SET.


current proposal

Design goals

Ease of implementation

(Massively) parallelizable decoding

supports linear instruction stream; also optionally supports tree structure

Message constituents with a hard upper bound on their length in bits (in this case, 32 bits is the upper bound of werds, and the payload is max length of 24 bits)

supports at least 12-bit addressing (in fact, we support 24-bit addressing, although using some addressing modes on operands of more than 12 bits requires two instructions)

Preservation of HLL intent

Extensibility

note: efficiency is NOT a primary design goal; as with the Dalvik encoding of JVM bytecode, or the LuaJIT? bytecode, we expect that performance-minded implementations may create their own variant encodings. Eg the primary purpose of the 64-bit frame alignment is to support parallelized decoding (although efficiency is a secondary design goal).

Note that in this design, the operation might be indirectly specified via a reference to memory, rather than being immediately specified in the bytecode. This should be useful for encoding of calling first-class functions assigned to local variables.

Syntax

Note that the syntax of Oot Bytecode is a very general syntax that could be used for multiple languages (by varying the operations available, the modalities, the constaints, and the semantics of memory cells). Indeed, within Oot, this syntax is used for at least two 'languages'; one is Oot Instruction Bytecode and the other is Oot Graph Bytecode.

Sentences and phrases: Oot Bytecode is divided into variable-length sentences. Sentences consist of one or more variable-length phrases. Phrases consist of one or more werds

Each werd is either 16 bits or 32 bits. A 16-bit werd consists of an 8-bit header and an 8-bit payload. A 32-bit werd consists of an 8-bit header and a 24-bit payload.

An aside on terminology: The natural language linguistic concept of 'word' is a good fit for what is here called a 'werd' because, like linguistic words, Oot Bytecode werds are composed of a 'root morpheme' (the payload) and other syntactical and modifier morphemes (the header). However, in computing, the term 'word' is already in use to refer to architecture-specific fixed-length 'words', so to avoid confusion, i changed the spelling slightly. If this annoys you, feel free to call them "words"; in fact, in most contexts, i use the spelling "word" myself, and i only use "werd" when i am particularly worried about confusion with architecture-defined words.

Werds are grouped into 64-bit frames. Phrases cannot span multiple frames unless the parentheses construct is used.

Each phrase has one of 8 'roles' within the sentence. Each werd has one of (the same 8) subroles within the phrase. It's possible for multiple phrases within a sentence, or for multiple werds within a phrase, to have the same role or subrole. The ordering of phrases or werds with the same role or subrole is significant. Otherwise, the ordering is insignificant (except that some roles have positional constraints, namely, the first werd of a phrase is always subrole 0, and phrase roles 6 and 7 can only appear as the first phrase of a 64-bit frame).

The bits in the 8-bit werd header are as follows:

EOS/BOP: if this is the first werd in the 64-bit frame, then its an EOS. Otherwise its a BOP. EOS means that this 64-bit frame is the last 64-bit frame in the current sentence. BOP means that this werd begins a new phrase.

role: if this is a BOP, then this is the role of this phrase within the sentence, and the role of this werd within the phrase is role 0. Otherwise, this is the subrole of the werd within the phrase. The 8 roles are:

addressing modes:

the 'split' modes split the payload in half (so an 8-bit payload is split into 2 4-bit payloads, and a 24-bit payload into 2 12-bit payloads), and apply the specified addressing mode to each half, then combine the two in a 'GET' (or index-into) operation. For example, mode 5 retrieves the contents of the memory cell indicated by the first half of the payload (the direct mode), and then finds the index within that data structure indicated by the second half of the payload; for example, if the first half of the payload is '3' and the second half is '5', and memory location 3 contains an array, then the effective address indicated would be the 5th element within this array.

the semantics of memory cells, indirect addressing, and the GET operation are language-specific

more details on some of the roles:

role 0 phrase details

role 1 phrase details

in this role, addressing modes yield the value found at the effective address

werds with subrole 2 within this phrase can be used to pass 'named arguments'; eg the name of the argument would be in subrole 2, and the value of the argument would be in subrole 1 (or implict in subrole 0)

role 2 phrase details

in this role, addressing modes yield an effective address

werds with subrole 2 within this phrase can be used to pass 'named return arguments'; eg the name of the return argument would be in subrole 2, and the lhs expression of the return argument would be in subrole 1 (or implict in subrole 0)

role 3 phrase details

todo: there should be a way to include a single 8-bit modality payload, but also a way to include arbitrary settings of named modality fields to values.

role 4 phrase details

role 5 phrase details

role 6 phrase details

Role 6 is generically defined as anything that can be translated in a context-independent manner (that is, the translator is only allowed to look at a contiguous group of 64-bit frames at a time, and must be stateless in between groupings) into one or more sentences.

When the payload is 8-bits, the 8-bit payload is modality (that is, it is interpreted as if there were a single-werd phrase with role modality, where the werd had role 0, and this was the payload of the werd), and the rest of the 64-bit frame is as follows:

When the payload is 24-bits, the first 8 bits select the number of 64-bit frames included in the packed segment, the next 8 bits specifies a language-specific packed representation, and the last 8 bits are language-defined. So far this is not used by Oot.

role 7 phrase details

Role 7 reinterprets the addressing mode field to indicate what type of grouping construct is present. All role 7 constructs apply at the granularity of entire 64-bit frames.

The payload of role 7 is always split; the first half of the payload is how many 64-bit frames are spanned by the construct, and the second half is the 'type' of the construct. Frame lengths 0 and 1 may have special meanings, todo. Note that grouping-ending constructs such as right parens have an identical payload to the matching left-parens; since this includes the number of frames spanned, this makes it efficient to jump from the right parens to the corresponding left parens (or vice versa). The most-significant-bit of the 'type' of the construct is reserved. For quasiquote, this means whether or not there is any antiquote within this quasiquote; the semantics of this bit for other modes is reserved for future use.

As an optional extension, a language may support arbitrary-length constructs. In this case, if the length of the payload is the maximum value (2^24 - 1), then this indicates that the construct is actually arbitrarily longer than 2^24 - 2, and at displacement 2^24 - 2 will be found another construct opening werd. Most languages don't support this arbitrary-length construct feature, in which case payload value 2^24 - 1 is illegal.

some of these constructs are used to embed things from a different 'language' within bytecode of a default language. todo figure out which ones?

todo: can parens only enclose single phrases, or can they enclose whole sentences?

todo: What about EOS bits within these constructs?

todo is region annotation of length 0 a point annotation?

todo: can 'sentence length' also specify a 'foreign language' sentence?

note: the difference between 'foreign languages' in role 7 vs language-specific packed representations is that a role 7 'foreign language' uses the same format/syntax as described here, but varies the semantics (eg list of operations, list of modalities, list of constraints, semantics of memory cells, semantics of indirect addressing mode and GET), whereas role 6 is an extension mechanism to contain segments of arbitrary format/syntax.

languages

a language using this syntax must define:

and must either define the semantics of or forbid the use of:

and may define:

numerology

There are 2^24 accessible memory cells (24-bit addressing). The size of the constant table must be less than 2^24.

Because an 8-bit payload can be split into two 4-bit payloads in split addressing mode, memory cells 0-15 can be accessed somewhat more easily than others. Similarly, constant table entries 0-15 can be accessed more easily than others.

Similarly, the largest memory cell location that can be accessed using split mode addressing is 2^12 - 1 (using a 24-bit payload split into 2 12-bit payloads). Similarly, constant table entries up to 2^12 - 1can be accessed more easily than others.

profiles

There are many aspects of this format which aid extensibility but which may make implementation more difficult. Therefore, although they are described above, the 'default profile' turns them off.

todo: let's have short construct length limits by default so that a default-profile interpreter doesn't have to reserve much memory

segment format

todo: need to specify constant table format, format for file containing both constant table and bytecode, etc? or is this implementation-dependent? (i'm leaning towards implementation-dependent, although in that case we should still specify it for the Oot language)

natural languages

some inspiration was taken from the syntax of natural languages (i took a course once in head-driven phrase structure grammar, so i wouldn't be surprised if this turned out to be particularly close to that). In addition, this syntax is probably sufficient for encoding most natural language sentences (assuming you are willing to do things like sometimes use multiple Oot Bytecode sentences to represent one natural language sentence).

Here's how you might encode some natural language constructs:

semantics: Oot Instruction Bytecode

indirect: todo; this probably will have something to do with aliasing or symlinking but it's not certain (see [[ootAssemblyNotes5?]]). In any case, the 'direct' mode is the 'fast' one.

In direct and indirect mode, the werds reference memory cells. A memory cell is considered to hold an entire arbitrarily-sized data structure; that is to say, a memory cell number is semantically more like a local variable than it is like an actual location in memory.

memory cell 0 is the PC (i think 1-3 should be special too, relating to stack(s), todo)

only supports up to 2^12-4 local variables

does not support arbitrary-length role 7 constructs

does not support modules whose AST has more than approximately 2^24 nodes

does not support modules which have more than 2^24 module-level werds (i'm flirting with a 2^12 limit here, although i think that's too low for many existing libraries; but note that if you import/link to external libraries a,b,c, as long as the references to those libraries are hierarchical (a.f, not just f), then 4096 is probably fine)

does not support composite literals with more than about 2^12 AST nodes (eg an array of strings is a composite literal; a string is an atomic literal)

todo